Using CDC Replication with IBM InfoSphere DataStage
As part of the configuration process in Management Console, you can generate a definition file (*.dsx) that is imported into InfoSphere® DataStage®.
To generate a InfoSphere DataStage definition file, you must complete the configuration steps in Management Console.
- Flat File
- Uses a file system to deposit source changes for InfoSphere DataStage to retrieve.
- Direct Connect
- Uses TCP/IP as the transport protocol to stream data from the CDC Replication Engine for InfoSphere DataStage to InfoSphere DataStage. Note that to use the full functionality of the Direct Connect option, including the autostart option, you must have Management Console and Access Server installed as well as having the CDC Replication Engine for InfoSphere DataStage installed on the same server as InfoSphere DataStage, a component of IBM® Information Server version 8.5.
Depending on the connection method you choose, flat files are sent (Flat File) or data is streamed (Direct Connect) to InfoSphere DataStage by CDC Replication when either when data limits are reached (determined by the Batch Size Threshold settings you've indicated in the InfoSphere DataStage Properties dialog box in Management Console after mapping your tables) or when a refresh or mirroring operation ends.
Understanding the Flat File workflow
- SPFolderPath
- The full path name for the folder that InfoSphere DataStage searches for the source flat files created by the CDC Replication Engine for InfoSphere DataStage.
- SPFileNamePattern
- The file name pattern used to identify the source flat files created by the CDC Replication Engine for InfoSphere DataStage.
- SPEndFileNamePattern
- the file name thatInfoSphere DataStage creates when subscriptions stop mirroring. The name of this file signals InfoSphere DataStage to stop. If you do not want InfoSphere DataStage to stop, you can change the name of the file with this parameter.
For the Flat File connection method, CDC Replication creates units of work that will be picked up and processed by InfoSphere DataStage. The process begins once a refresh or mirroring operation begins, and the CDC Replication Engine for InfoSphere DataStage starts writing change information to temporary data files for only those tables in the subscription for which there are changes. Once the Batch Size Threshold limits (or the Time Limit Threshold limit, whichever comes first) are met, the CDC Replication Engine for InfoSphere DataStage hardens the temporary data files at the subscription level with timestamps in the filenames and saves them to the flat file location. No data files are produced for tables that have no changes. Once the refresh or mirroring operation is ended, <TABLE_NAME>.STOPPED files, which serve as status flags, are produced for each table in the subscription, then the bookmark is updated. These files are ready for consumption by the InfoSphere DataStage job.
- On the computer where the source database is installed, the CDC Replication Engine for InfoSphere DataStage service for the database reads the transaction log to capture changes.
- The CDC Replication Engine for InfoSphere DataStage server transfers the change data according to the replication definition.
- The CDC Replication Engine for InfoSphere DataStage server hardens the files and deposits them in the flat file location.
- The InfoSphere DataStage sequential file reader retrieves the flat files as part of an InfoSphere DataStage job and transforms them.
- The InfoSphere DataStage sequential file reader deposits the transformed flat files in the new flat file location.
Understanding the Direct Connect workflow
For the Direct Connect connection method, the process is similar. The size and time limits set in the InfoSphere DataStage Properties dialog box determine when data is sent, and the matching Project Name, Job Name, and Connection Key information set in the InfoSphere DataStage Properties dialog box permit the CDC Replication Engine for InfoSphere DataStage to send the data to InfoSphere DataStage directly, without saving any of the data as flat files.
For the Direct Connect connection method, the data is not written to a file, but is sent instead over a TCP/IP connection directly to InfoSphere DataStage to be processed by a specific InfoSphere DataStage job that you have identified by specifying the matching Project Name, Job Name, and Connection Key in the InfoSphere DataStage Properties dialog box in Management Console after mapping your tables. The InfoSphere DataStage Connector processes the data, then transforms and translates it into a format recognized by the InfoSphere DataStage job.
Additionally, with the Direct Connect connection method, you can enable the autostart feature to run in active mode, which allows InfoSphere DataStage to start a job when appropriate and begin to stream data to InfoSphere DataStage. Running with autostart enabled requires both the CDC Replication Engine for InfoSphere DataStage and InfoSphere DataStage to be installed on the same server. If autostart is not enabled, you must run jobs from InfoSphere DataStage before the Direct Connect data stream can begin. For instructions on enabling autostart, see the Management Console documentation.
- On the computer where the source database is installed, the CDC Replication Engine for InfoSphere DataStage service for the database reads the transaction log to capture changes.
- The CDC Replication Engine for InfoSphere DataStage server transfers the change data according to the replication definition.
- The CDC Replication Engine for InfoSphere DataStage server sends the CDC Transaction stage through a TCP/IP session that is created with replication begins. Periodically, the CDC Replication Engine for InfoSphere DataStage server also sends a COMMIT message along with bookmark information to mark the transaction boundary in the captured log.
- In the InfoSphere DataStage job, the data flows over links from the CDC Transaction stage to the target database connector stage. The bookmark information is sent over a bookmark link. For each COMMIT message sent by the CDC Replication Engine for InfoSphere DataStage server, CDC Transaction stage creased end-of-wave (EOW) marker that is sent on output links to the target database connector stage.
- The target database connector stage connects to the target database and sends data over the session. When the target database connector stage receives an EOW marker on all input links, it writes bookmark information to the bookmark table and then commits the transaction to the target database.
- Periodically, the CDC Replication Engine for InfoSphere DataStage server requests bookmark information from the bookmark table on the target database. In response to the request, the CDC Transaction stage fetches the bookmark information through ODBC and returns it to the CDC Replication Engine for InfoSphere DataStage server.
- The CDC Replication Engine for InfoSphere DataStage server
receives the bookmark information which is used to determine:
- the starting point in the transaction log where changes are read when replication begins; the starting point in the transaction log is the ending point from the previous replication.
- if the existing transaction log can be cleaned up.