Using CDC Replication with IBM InfoSphere DataStage

As part of the configuration process in Management Console, you can generate a definition file (*.dsx) that is imported into InfoSphere® DataStage®.

To generate a InfoSphere DataStage definition file, you must complete the configuration steps in Management Console.

The .dsx definition file you generate in Management Console and import into InfoSphere DataStage contains the information that is used to re-create columns in InfoSphere DataStage based on the data types of the source columns as determined by your table mapping choices. The .dsx file also contains information on which of the connection methods that you select when you map your tables. The connection type options are:

Flat File: Uses a file system to deposit source changes for InfoSphere DataStage to retrieve.
Direct Connect: Uses TCP/IP as the transport protocol to stream data from the CDC Replication Engine for InfoSphere DataStage to InfoSphere DataStage. Note that to use the full functionality of the Direct Connect option, including the autostart option, you must have Management Console and Access Server installed as well as having the CDC Replication Engine for InfoSphere DataStage installed on the same server as InfoSphere DataStage, a component of IBM® Information Server version 8.5.

Depending on the connection method you choose, flat files are sent (Flat File) or data is streamed (Direct Connect) to InfoSphere DataStage by CDC Replication when either when data limits are reached (determined by the Batch Size Threshold settings you've indicated in the InfoSphere DataStage Properties dialog box in Management Console after mapping your tables) or when a refresh or mirroring operation ends.

Understanding the Flat File workflow

For the Flat File connection method, the package consists of a job sequence, a parallel job, and two utility routines that are used by the job sequence. The job sequence has three parameters. The values for these parameters are specified by Management Console when it generates the InfoSphere DataStage .dsx definition file:

SPFolderPath: The full path name for the folder that InfoSphere DataStage searches for the source flat files created by the CDC Replication Engine for InfoSphere DataStage.
SPFileNamePattern: The file name pattern used to identify the source flat files created by the CDC Replication Engine for InfoSphere DataStage.
SPEndFileNamePattern: the file name thatInfoSphere DataStage creates when subscriptions stop mirroring. The name of this file signals InfoSphere DataStage to stop. If you do not want InfoSphere DataStage to stop, you can change the name of the file with this parameter.

For the Flat File connection method, CDC Replication creates units of work that will be picked up and processed by InfoSphere DataStage. The process begins once a refresh or mirroring operation begins, and the CDC Replication Engine for InfoSphere DataStage starts writing change information to temporary data files for only those tables in the subscription for which there are changes. Once the Batch Size Threshold limits (or the Time Limit Threshold limit, whichever comes first) are met, the CDC Replication Engine for InfoSphere DataStage hardens the temporary data files at the subscription level with timestamps in the filenames and saves them to the flat file location. No data files are produced for tables that have no changes. Once the refresh or mirroring operation is ended, <TABLE_NAME>.STOPPED files, which serve as status flags, are produced for each table in the subscription, then the bookmark is updated. These files are ready for consumption by the InfoSphere DataStage job.

Attention: If you kill a refresh or mirroring operation using the dmterminate command, the temporary data files cannot be hardened at the subscription level, no <TABLE_NAME>.STOPPED status flag files are generated for the tables in the subscription, and the bookmark is not updated. You must restart the refresh or mirroring process. Be aware that restarting uses the last-saved bookmark and starts a new set of temporary data files to be hardened as thresholds are met. To ensure that the temporary data files are hardened, and the <TABLE_NAME>.STOPPED status flag files are created, use a Normal or Scheduled End shutdown in Management Console, or you can issue a dmshutdown command with the appropriate flags for the severity level. If you use the Abort or Immediate shutdown options, the CDC Replication Engine for InfoSphere DataStage may opt not harden the temporary data files as a way of facilitating these more rapid shutdown requests.

The following diagram illustrates the basic workflow of an InfoSphere DataStage job that using the .dsx definition file generated in Management Console for use in the Flat File connection method. Note that this represents the most basic workflow for the Flat File replication method. Once you have generated the .dsx definition file and imported it into the InfoSphere DataStage Designer, you can define additional stages as necessary and configure the business logic in InfoSphere DataStage Designer to suit your data transformation requirements.

A depiction of the data flow between InfoSphere CDC and InfoSphere DataStage in the Flat File replication method.

On the computer where the source database is installed, the CDC Replication Engine for InfoSphere DataStage service for the database reads the transaction log to capture changes.
The CDC Replication Engine for InfoSphere DataStage server transfers the change data according to the replication definition.
The CDC Replication Engine for InfoSphere DataStage server hardens the files and deposits them in the flat file location.
The InfoSphere DataStage sequential file reader retrieves the flat files as part of an InfoSphere DataStage job and transforms them.
The InfoSphere DataStage sequential file reader deposits the transformed flat files in the new flat file location.

Note: This represents the most basic workflow for the Flat File replication method. Once you have generated the .dsx definition file and imported it into the InfoSphere DataStage Designer, you can define additional stages as necessary and configure the business logic in InfoSphere DataStage Designer to suit your data transformation requirements.

Understanding the Direct Connect workflow

For the Direct Connect connection method, the process is similar. The size and time limits set in the InfoSphere DataStage Properties dialog box determine when data is sent, and the matching Project Name, Job Name, and Connection Key information set in the InfoSphere DataStage Properties dialog box permit the CDC Replication Engine for InfoSphere DataStage to send the data to InfoSphere DataStage directly, without saving any of the data as flat files.

For the Direct Connect connection method, the data is not written to a file, but is sent instead over a TCP/IP connection directly to InfoSphere DataStage to be processed by a specific InfoSphere DataStage job that you have identified by specifying the matching Project Name, Job Name, and Connection Key in the InfoSphere DataStage Properties dialog box in Management Console after mapping your tables. The InfoSphere DataStage Connector processes the data, then transforms and translates it into a format recognized by the InfoSphere DataStage job.

Additionally, with the Direct Connect connection method, you can enable the autostart feature to run in active mode, which allows InfoSphere DataStage to start a job when appropriate and begin to stream data to InfoSphere DataStage. Running with autostart enabled requires both the CDC Replication Engine for InfoSphere DataStage and InfoSphere DataStage to be installed on the same server. If autostart is not enabled, you must run jobs from InfoSphere DataStage before the Direct Connect data stream can begin. For instructions on enabling autostart, see the Management Console documentation.

Important: In order to use the autostart function in the Direct Connect method on a UNIX or Linux system, ensure that you have correctly set the database library directory in the dsenv file for InfoSphere Information Server. See the topic on Ensuring that InfoSphere DataStage users have the correct localization settings (Linux, UNIX) and Configuring the dsenv file in the InfoSphere Information Server Planning, Installation, and Configuration Guide.

The following figure illustrates workflow when you use the Direct Connect connection method:

A depiction of the data flow between InfoSphere CDC and InfoSphere DataStage in the Direct Connect replication method.

On the computer where the source database is installed, the CDC Replication Engine for InfoSphere DataStage service for the database reads the transaction log to capture changes.
The CDC Replication Engine for InfoSphere DataStage server transfers the change data according to the replication definition.
The CDC Replication Engine for InfoSphere DataStage server sends the CDC Transaction stage through a TCP/IP session that is created with replication begins. Periodically, the CDC Replication Engine for InfoSphere DataStage server also sends a COMMIT message along with bookmark information to mark the transaction boundary in the captured log.
In the InfoSphere DataStage job, the data flows over links from the CDC Transaction stage to the target database connector stage. The bookmark information is sent over a bookmark link. For each COMMIT message sent by the CDC Replication Engine for InfoSphere DataStage server, CDC Transaction stage creased end-of-wave (EOW) marker that is sent on output links to the target database connector stage.
The target database connector stage connects to the target database and sends data over the session. When the target database connector stage receives an EOW marker on all input links, it writes bookmark information to the bookmark table and then commits the transaction to the target database.
Periodically, the CDC Replication Engine for InfoSphere DataStage server requests bookmark information from the bookmark table on the target database. In response to the request, the CDC Transaction stage fetches the bookmark information through ODBC and returns it to the CDC Replication Engine for InfoSphere DataStage server.
The CDC Replication Engine for InfoSphere DataStage server receives the bookmark information which is used to determine:
- the starting point in the transaction log where changes are read when replication begins; the starting point in the transaction log is the ending point from the previous replication.
- if the existing transaction log can be cleaned up.