Data lineage and business lineage reports

Data lineage reports show the movement of data through a job or multiple jobs. These reports can show the order of activities within a run of a job. Business lineage reports show a simplified view of lineage that highlights the transformation and aggregation of data that is needed by a business user. Business lineage reports do not show jobs and mapping specification asset types.

When you run reports, they display information assets in the context of your enterprise goals. You see them not as isolated database tables, database columns, jobs, or stages, but as integrated parts of the process that extracts, loads, investigates, cleanses, transforms, and reports on your data. Your lineage reports can also include virtual assets, which represent data sources that were not imported or created in the catalog, but are accessed by a job.

On request, lineage reports can show the impact dependencies in addition to data flows. Impacted assets are assets that are influenced by processes other than data flow. Such processes include job scheduling, job optimization management, and rule invocation. For example, you use the Balanced Optimization process in IBM® InfoSphere® DataStage® and QualityStage® Designer to analyze a root job and then to create an optimized job that does the same thing, but with improvements in performance and resource usage. The root job is linked to the optimized job so that impact flow, when enabled, shows the dependency.

Report types

You can run these types of reports:
Data Lineage
Data lineage reports can show different types of information:
  • The flow of data to or from a selected information asset, through stages and stage columns, through one or more jobs, into databases and business intelligence (BI) reports.
  • The order of activities within a job run, including the database tables that the jobs write to or read from.

A user of external programs such as IBM Cognos®, can create a data lineage report for an asset. The user must have at least the Information Governance Catalog User role. The report is displayed in a new window in the web browser.

Business Lineage

Business lineage reports do not display extension mapping documents or jobs from IBM InfoSphere DataStage and QualityStage. Data still flows through assets that are not displayed in the report.

The Information Governance Catalog Information Asset Administrator configures which information assets are displayed in business lineage reports. The business lineage report displays the graphical and textual components for only source, target, and intermediate assets that are configured to be included in business lineage.

Automatic update of lineage flows and information latency in lineage reports

IBM InfoSphere Information Governance Catalog analyzes and indexes lineage flows to make them available for lineage reports on subsequent requests. When changes in lineage-impacting assets occur, InfoSphere Information Governance Catalog analyzes them and automatically updates lineage flows in metadata repository.

Configuring automatic flow updates
By default, automatic update of lineage flows is enabled. It might impact system performance, especially, when other tasks that you run use a lot of memory. Therefore, if you do not use lineage feature, you can disable automatic update of lineage flows. Alternatively, you can disable the update only for the selected asset types.
  1. Go to Administration > Lineage Management > Lineage Administration > Lineage Configuration.
  2. To disable automatic update for selected asset types, clear the check boxes for asset types that you want to skip.
  3. To disable automatic update for all asset types, in Do you want to enable automatic update of lineage flows?, select No.
Note: When you disable automatic update, the lineage flows are not recalculated and lineage reports are not updated. You can do it manually by detecting lineage relationships. You can also enable automatic update again. In this case, the data is analyzed and updated from the moment you disabled the automatic update.
Information latency in lineage reports
Because flows are automatically analyzed and updated, any changes in the catalog might take some time before they are reflected in your lineage report. This delay applies to all assets whose content is relevant for lineage reports, such as:
  • Designs of jobs that are included for lineage
  • Operational metadata from runs of jobs that are included for lineage
  • Extension mapping documents
  • Mapping specifications that are included for lineage
  • Database views
  • BI reports and BI models
  • MDM models
  • Data connection mappings
  • Database schema identity mappings
  • Manual stage bindings
The maximum amount of latency is the sum of the following factors:
Polling interval
The minimum interval is 30 seconds. The default value is 5 minutes.
You can change the polling interval in the Lineage Administration page (Administration > Lineage Management > Lineage Administration > Lineage Configuration).
Number and complexity of changed assets
Flow publication of a change might take 1 second - 30 seconds for each asset, depending on complexity.
Current publication request
The flow publication that is being serviced.
Number of publication requests in the queue
Flow publication occurs sequentially. As a result, a publication might be queued.

You can monitor tasks that are pending or completed in the Lineage Administration page (Administration > Lineage Management > Lineage Administration > Monitor Lineage Tasks). Changes in the catalog might take some time before they are reflected in the lineage report. When there are pending tasks, lineage reports might not reflect the latest changes in the catalog.

At times, a lineage task might stay in the Pending state for several days, either because the task is not valid or for some other reason. You can delete pending lineage tasks by clicking Delete in the Pending Tasks column on the Monitor Lineage Tasks page. By default, only the tasks that were submitted 72 hours before or earlier are deleted. If you want to delete tasks that were submitted later, you can decrease the time interval by completing the following steps:
  1. Open a command-line window on the server where InfoSphere Information Governance Catalog is installed.
  2. In the command-line window, go to installation_directory\ASBServer\bin directory, where installation_directory is the directory where IBM InfoSphere Information Server was installed.
  3. Run the following command. The value specifies time interval in hours. In this example, time interval is set to 1 hour, which means that the tasks that were submitted 1 hour before or earlier are deleted.
    • Linux, UNIX: ./iisAdmin.sh -set -key com.ibm.iis.gov.vr.setting.pendingArchivedTasksPurgeHours -value 1
    • Windows: iisAdmin.bat -set -key com.ibm.iis.gov.vr.setting.pendingArchivedTasksPurgeHours -value 1
Apart from that, lineage tasks that are in the Pending state for several days are automatically deleted whenever InfoSphere Information Governance Catalog is started.