Using CDC Replication with IBM IBM DataStage
As part of the configuration process in Management Console, you can generate a definition file (*.dsx) that is imported into IBM® DataStage®.
To generate a IBM DataStage definition file, you must complete the configuration steps in Management Console.
The .dsx definition file you generate in Management Console and import into IBM DataStage contains the information that is used to re-create columns in IBM DataStage based on the data types of the source columns as determined by your table mapping choices. The .dsx file also contains information on which of the connection methods that you select when you map your tables. For V11.4 and later the supported connection type is Flat File, which uses a file system to deposit source changes for IBM DataStage to retrieve.
Flat files are sent to IBM DataStage by CDC Replication when either data limits are reached (determined by the Batch Size Threshold settings that you indicated in the IBM DataStage Properties dialog box in Management Console after mapping your tables) or when a refresh or mirroring operation ends.
Understanding the Flat File workflow
- SPFolderPath
- The full path name for the folder that IBM DataStage searches for the source flat files created by the CDC Replication Engine for InfoSphere® DataStage.
- SPFileNamePattern
- The file name pattern used to identify the source flat files created by the CDC Replication Engine for InfoSphere DataStage.
- SPEndFileNamePattern
- the file name thatIBM DataStage creates when subscriptions stop mirroring. The name of this file signals IBM DataStage to stop. If you do not want IBM DataStage to stop, you can change the name of the file with this parameter.
For the Flat File connection method, CDC Replication creates units of work that will be picked up and processed by IBM DataStage. The process begins once a refresh or mirroring operation begins, and the CDC Replication Engine for InfoSphere DataStage starts writing change information to temporary data files for only those tables in the subscription for which there are changes. Once the Batch Size Threshold limits (or the Time Limit Threshold limit, whichever comes first) are met, the CDC Replication Engine for InfoSphere DataStage hardens the temporary data files at the subscription level with timestamps in the filenames and saves them to the flat file location. No data files are produced for tables that have no changes. Once the refresh or mirroring operation is ended, <TABLE_NAME>.STOPPED files, which serve as status flags, are produced for each table in the subscription, then the bookmark is updated. These files are ready for consumption by the IBM DataStage job.
- On the computer where the source database is installed, the CDC Replication Engine for InfoSphere DataStage service for the database reads the transaction log to capture changes.
- The CDC Replication Engine for InfoSphere DataStage server transfers the change data according to the replication definition.
- The CDC Replication Engine for InfoSphere DataStage server hardens the files and deposits them in the flat file location.
- The IBM DataStage sequential file reader retrieves the flat files as part of an IBM DataStage job and transforms them.
- The IBM DataStage sequential file reader deposits the transformed flat files in the new flat file location.