IBM InfoSphere Foundation Tools, IBM InfoSphere Information Server, Version 8.5

IBM InfoSphere DataStage and QualityStage Designer

The IBM® InfoSphere® DataStage® and QualityStage® Designer helps you create, manage, and design jobs. You can also use the Designer client to define tables and access metadata services.

Table definitions
Accessing metadata services
Creating jobs
Designing jobs
Stage properties
Complex Flat File stage
Transformer stage
Slowly Changing Dimension stage
Dynamic Relational stage
SQL builder
Job sequences

Table definitions

You can import, create, and edit table definitions from many sources (for example, one source of table definitions is metadata from IBM InfoSphere Information Analyzer). When you edit or view a table, the Table Definitions window opens, as Figure 1 shows.

Figure 1. Table Definitions window

Table Definitions window of the InfoSphere DataStage and QualityStage Designer

This window has the following pages:

General: Contains data source and description information.
Columns: Contains information about the columns including key values, SQL type, and length.
Format: Contains information that describes data format when the data is read from or written to a sequential file.
Relationships: Provides foreign key information about the table.
Parallel: Shows extended properties for table definitions that you can use in parallel jobs.
Layout: Shows the schema format of the column definitions in a table.
Locator: Enables you to view and edit the data resource locator that is associated with the table definition. The data resource locator describes the real-world object.
NLS (if installed): Shows the current character set map for the table definitions.
Analytical information: Shows metadata that InfoSphere Information Analyzer generated.

Accessing metadata services

InfoSphere DataStage and QualityStage Designer accesses the metadata repository to obtain live access to current metadata about integration projects and your organization's enterprise data. You access data that is generated by metadata bridges or InfoSphere Information Analyzer by using the Designer client. The following services provide designers with access to metadata:

Simple and advanced find service: Enables you to search the repository for objects
Where used or impact analysis service: Shows both “used by” and “depends on” relationships

An option in the InfoSphere DataStage and QualityStage Designer shows differences between jobs or table definitions in an InfoSphere DataStage context. Figure 2 shows a textual report with links to the relevant editor in the Designer client. You can also view differences for subsets of jobs such as shared containers and routines. This report can optionally be saved as an XML file.

Figure 2. Job difference report

Job difference report in the InfoSphere DataStage and QualityStage Designer

Creating jobs

When you use the Designer client, you choose the type of job to create and how to create it, as Figure 3 shows.

Figure 3. Choosing a job type

Window for choosing a job type in the Designer

Different job types include parallel, mainframe, and job sequences. Job templates help you build jobs quickly by providing predefined job properties that you can customize. Job templates also provide a basis for commonality between jobs and job designers.

You use the design canvas window and tool palette to design, edit, and save the job, as shown in Figure 4.

Figure 4. Simple IBM InfoSphere DataStage job

Example of a simple InfoSphere DataStage job with a file source, Transformer stage, and load into an Oracle database

Figure 4 shows the most basic IBM InfoSphere DataStage job, which contains three stages:

Data source (input) stage
Transformation (processing) stage
Target (output) stage

IBM InfoSphere DataStage jobs can be as sophisticated as required by your company's data integration needs. Figure 5 is an example of a more complex job.

Figure 5. More complex IBM InfoSphere DataStage job

This figure is described in the surrounding text.

Designing jobs

With the Designer client, you draw the integration process and then add the details for each stage. This method helps you build and reuse components across jobs. The Designer client minimizes the coding that is required to define even the most difficult and complex integration process.

Each data source and each processing step is a stage in the job design. The stages are linked to show the flow of data. You drag stages from the tool palette to the canvas. This palette contains icons for stages and groups that you can customize to organize stages, as shown in Figure 6.

Figure 6. Tool palette

After stages are in place, they are linked together in the direction that the data will flow. For example, in Figure 4, two links were added:

One link between the data source (Sequential File stage) and Transformer stages
One link between the Transformer stage and the Oracle target stage

You load table definitions for each link from a stage property editor, or select definitions from the repository and drag them onto a link.

Stage properties

Each stage in a job has properties that specify how the stage performs or processes data. Stage properties include file name for the Sequential File stage, columns to sort and the ascending-descending order for the Sort stage, database table name for a database stage, and so on. Each stage type uses a graphical editor.

Complex Flat File stage

The Complex Flat File (CFF) stage allows the reading and writing of data files that contain numerous record formats in a single file. Figure 7 shows a three-record join. This stage supports both fixed and variable-length records and joins data from different record types in a logical transaction into a single data record for processing. For example, you might join customer, order, and units data.

Figure 7. Complex Flat File stage window

The CFF stage and Slowly Changing Dimension stage offer a Fast Path concept for improved usability and faster implementation. The Fast Path walks you through the screens and tables of the stage properties that are required for processing the stage. Help is available for each tab by hovering the mouse over the "i" in the lower left.

Transformer stage

Transformer stages can have one primary input link, multiple reference input links, and multiple output links. The link from the main data input source is designated as the primary input link. You use reference links for lookup operations, for example, to provide information that might affect the way the data is changed, but not supplying the actual data to be changed.

Input columns are shown on the left and output columns are shown on the right. The upper panes show the columns with derivation details. The lower panes show the column metadata.

Some data might have to pass through the Transformer stage unaltered, but it is likely that data from some input columns must be transformed first. You can specify such an operation by entering an expression or selecting a transform to apply to the data, called a derivation. IBM InfoSphere DataStage has many built-in functions to use inside the derivations. You can also define custom transform functions that are then stored in the repository for reuse.

You can also specify constraints that operate on entire output links. A constraint is an expression that specifies criteria that data must meet before it can pass to the output link.

Slowly Changing Dimension stage

A typical design for an analytical system is based on a dimensional database that consists of a central fact table that is surrounded by a single layer of smaller dimension tables, each containing a single primary key. This design is also known as a star schema.

Star schema data is typically found in the transactional and operational systems that capture customer information, sales data, and other critical business information. One of the major differences between a transactional system and an analytical system is the need to accurately record the past. Analytical systems often must detect trends to enable managers to make strategic decisions. For example, a product definition in a sales tracking data mart is a dimension that will likely change for many products over time but this dimension typically changes slowly. One major transformation and movement challenge is how to enable systems to track changes that occur in these dimensions over time. In many situations, dimensions change only occasionally.

Figure 8 shows a typical primary key, the product sales keeping unit (PRODSKU).

Figure 8. Looking up primary key for a dimension table

The Slowly Changing Dimension (SCD) stage processes source data for a dimension table within the context of a star schema database structure. The stage lets you overwrite the existing dimension (known as a Type-1 change), update while preserving rows (known as Type 2), or have a hybrid of both types. To prepare data for loading, the SCD stage performs the following process for each changing dimension in the star schema:

Business keys from the source are used to look up a surrogate key in each dimension table. Typically the dimension row is found.
If a dimension row is not found, a row must be created with a surrogate key.
If a dimension row is found but must be updated (Type-1), the update must be done.
For preserving history (Type-2), a new row is added and the original row is marked. A surrogate key is added to the source data and non-fact data is deleted.

In a Type-2 update, a new row with a new surrogate primary key is inserted into the dimension table to capture changes. All the rows that describe a dimension contain attributes that uniquely identify the most recent instance and historical dimensions. Figure 9 shows how the new product dimension is redefined to include the data that goes into the dimension table and also contains the surrogate key, expiry date, and the currency indicator.

Figure 9. Redefining a dimension table

Finally, the new record is written into the dimension table (with all surrogate keys), reflecting the change in product dimension over time. Although the product sales keeping unit has not changed, the database structure enables the user to identify sales of current versions versus earlier versions of the product.

Dynamic Relational stage

Although IBM InfoSphere DataStage provides specific connectivity to virtually any database management system, the Dynamic Relational stage allows the binding of the type (for example, Oracle, IBM DB2®, or SQL Server) to be specified at run time rather than design time. The Dynamic Relational stage reads data from or writes data to a database. Figure 10 shows the general information about the database stage including the database type, name, user ID, and password that is used to connect. Passwords can be encrypted.

Figure 10. Designing for the Dynamic Relational stage

Database information shown while designing for the Dynamic Relational stage

SQL builder

For developers who need to use SQL expressions to define database sources, the SQL builder utility provides a graphical interface for building simple-to-complex SQL query statements. The SQL builder supports DB2, Oracle, SQL Server, Teradata, and ODBC databases. Although ODBC can be used to build SQL that will work for a broad range of databases, the database-specific parsers help you take advantage of database-specific functionality. Figure 11 shows how the SQL builder guides developers in creating well-formed SQL queries.

Figure 11. SQL builder utility

Job sequences

IBM InfoSphere DataStage provides a graphical job sequencer in which you can specify a sequence of jobs to run. The sequence can also contain control information. For example, the sequence might indicate different actions depending on whether a job in the sequence succeeds or fails. After you define a job sequence, you can schedule and run the sequence by using the Director client, the command line, or an API. The sequence appears in the repository and in the Director client as a job.

Designing a job sequence is similar to designing jobs. You create the job sequence in the InfoSphere DataStage and QualityStage Designer, and add activities (rather than stages) from the tool palette. You then join activities with triggers (rather than links) to define control flow. Each activity has properties that can be tested in trigger expressions and passed to other activities farther down the sequence. Activities can also have parameters, which supply job parameters and routine arguments.

The job sequence has properties and can have parameters that can be passed to the activities that it is sequencing. The sample job sequence in Figure 12 shows a typical sequence that is triggered by an arriving file. The job also contains exception handling with looping and flow control.

Figure 12. Sample job sequence

The job sequence supports the following types of activities:

Job: Specifies an IBM InfoSphere DataStage job.
Routine: Specifies a routine.
ExecCommand: Specifies an operating system command to run.
E-mail notification: Specifies that an e-mail notification must be sent at this point of the sequence by using Simple Mail Transfer Protocol (SMTP). This method is often used in exception and error handling.
Wait-for-file: Waits for a specified file to appear or disappear. This activity can send a stop message to a sequence after waiting a specified period of time for a file to appear or disappear.
Run-activity-on-exception: Only one run-activity-on-exception is allowed in a job sequence. This activity runs if a job in the sequence fails to run. (Other exceptions are handled by triggers.)
Checkpoint, restart option for job sequences:: The checkpoint property on job sequences allows a sequence to be restarted at the failed point.
Looping stages: StartLoop and EndLoop activities make the job sequencer more flexible and give you more control.
User expressions and variables: Enables you to define and set variables. You can use these variables to evaluate expressions within a job sequence flow.
Abort-activity-on-exception: Stops job sequences when problems occur.

Job management

The Designer client manages the project data, enabling you to view and edit items that are stored in the metadata repository. This functionality enables you to import and export items between different IBM InfoSphere DataStage systems and exchange metadata with other tools. You can request reports on items in the metadata repository.

The Designer client provides the following capabilities:

Importing and exporting DSX and XML files
EE configuration file editor
Table definitions import
Message Handler Manager
MetaBroker® import and export
Importing Web service definitions
Importing IMS™ definitions
JCL templates editor

Figure 13 shows the Designer client window for importing table definitions.

Figure 13. Importing table definitions

Importing table definitions in the InfoSphere DataStage and QualityStage Designer

Importing and exporting jobs

The InfoSphere DataStage and QualityStage Designer enables you to import and export components for moving jobs among IBM InfoSphere DataStage development, test, and production environments. You can import and export any component in the repository, including a job.

The export facility is also valuable for generating XML documents that describe objects in the repository. You can use a Web browser to view these documents. The Designer client also includes an import facility for importing InfoSphere DataStage components from XML documents.

This topic is also in the IBM InfoSphere Information Server Introduction.

Update timestamp

Last updated: 2012-9-20