Using CDC Replication with IBM IBM DataStage

As part of the configuration process in Management Console, you can generate a definition file (*.dsx) that is imported into IBM® DataStage®.

To generate a IBM DataStage definition file, you must complete the configuration steps in Management Console.

The .dsx definition file you generate in Management Console and import into IBM DataStage contains the information that is used to re-create columns in IBM DataStage based on the data types of the source columns as determined by your table mapping choices. The .dsx file also contains information on which of the connection methods that you select when you map your tables. For V11.4 and later the supported connection type is Flat File, which uses a file system to deposit source changes for IBM DataStage to retrieve.

Flat files are sent to IBM DataStage by CDC Replication when either data limits are reached (determined by the Batch Size Threshold settings that you indicated in the IBM DataStage Properties dialog box in Management Console after mapping your tables) or when a refresh or mirroring operation ends.

Understanding the Flat File workflow

For the Flat File connection method, the package consists of a job sequence, a parallel job, and two utility routines that are used by the job sequence. The job sequence has three parameters. The values for these parameters are specified by Management Console when it generates the IBM DataStage .dsx definition file:
SPFolderPath
The full path name for the folder that IBM DataStage searches for the source flat files created by the CDC Replication Engine for InfoSphere® DataStage.
SPFileNamePattern
The file name pattern used to identify the source flat files created by the CDC Replication Engine for InfoSphere DataStage.
SPEndFileNamePattern
the file name thatIBM DataStage creates when subscriptions stop mirroring. The name of this file signals IBM DataStage to stop. If you do not want IBM DataStage to stop, you can change the name of the file with this parameter.

For the Flat File connection method, CDC Replication creates units of work that will be picked up and processed by IBM DataStage. The process begins once a refresh or mirroring operation begins, and the CDC Replication Engine for InfoSphere DataStage starts writing change information to temporary data files for only those tables in the subscription for which there are changes. Once the Batch Size Threshold limits (or the Time Limit Threshold limit, whichever comes first) are met, the CDC Replication Engine for InfoSphere DataStage hardens the temporary data files at the subscription level with timestamps in the filenames and saves them to the flat file location. No data files are produced for tables that have no changes. Once the refresh or mirroring operation is ended, <TABLE_NAME>.STOPPED files, which serve as status flags, are produced for each table in the subscription, then the bookmark is updated. These files are ready for consumption by the IBM DataStage job.

Attention: If you kill a refresh or mirroring operation using the dmterminate command, the temporary data files cannot be hardened at the subscription level, no <TABLE_NAME>.STOPPED status flag files are generated for the tables in the subscription, and the bookmark is not updated. You must restart the refresh or mirroring process. Be aware that restarting uses the last-saved bookmark and starts a new set of temporary data files to be hardened as thresholds are met. To ensure that the temporary data files are hardened, and the <TABLE_NAME>.STOPPED status flag files are created, use a Normal or Scheduled End shutdown in Management Console, or you can issue a dmshutdown command with the appropriate flags for the severity level. If you use the Abort or Immediate shutdown options, the CDC Replication Engine for InfoSphere DataStage may opt not harden the temporary data files as a way of facilitating these more rapid shutdown requests.
The following diagram illustrates the basic workflow of an IBM DataStage job that using the .dsx definition file generated in Management Console for use in the Flat File connection method. Note that this represents the most basic workflow for the Flat File replication method. Once you have generated the .dsx definition file and imported it into the IBM DataStage Designer, you can define additional stages as necessary and configure the business logic in IBM DataStage Designer to suit your data transformation requirements.
A depiction of the data flow between CDC Replication and IBM DataStage in the Flat File replication method.
  1. On the computer where the source database is installed, the CDC Replication Engine for InfoSphere DataStage service for the database reads the transaction log to capture changes.
  2. The CDC Replication Engine for InfoSphere DataStage server transfers the change data according to the replication definition.
  3. The CDC Replication Engine for InfoSphere DataStage server hardens the files and deposits them in the flat file location.
  4. The IBM DataStage sequential file reader retrieves the flat files as part of an IBM DataStage job and transforms them.
  5. The IBM DataStage sequential file reader deposits the transformed flat files in the new flat file location.
Note: This represents the most basic workflow for the Flat File replication method. Once you have generated the .dsx definition file and imported it into the IBM DataStage Designer, you can define additional stages as necessary and configure the business logic in IBM DataStage Designer to suit your data transformation requirements.