How Do Scanners Work?

Glossary of Terms

What Is a Scan?

IBM Automatic Data Lineage defines a scan as the overall process involving the capture, analysis, loading, and merging of lineage metadata into the Manta repository. Scans are typically run against a set of connections for the selected technologies (such as all Oracle databases) configurable by Manta Administrators into workflows in Manta Process Manager. There is a lot of flexibility regarding their execution, which can be invoked via shell script on the command line, REST API, or graphical user interface. Sites generally determine (based on user needs and the change dynamics of their code and metadata) how frequently to run scans. Each scan results in a revision (see Understanding Revisions for Administrators and Understanding Revisions for more details) that encapsulates what has been captured when as part of the scan. Automatic Data Lineage looks exclusively at metadata (code, schema definitions, etc.) when performing lineage analysis. Automatic Data Lineage generally does not require or use direct access to any formal data when performing its parsing operations; see privileges required for individual scanners in the respective scanner integration requirements pages under Databases and Storage.

What Is an Export?

You can also choose to take advantage of the export feature. This is the ability to run a specific (licensed by IBM) technology export or a more generic export. These export capabilities support third-party or home-grown solutions where Manta lineage metadata is loaded into other complementary systems. Quite often, these are data catalog solutions, but there are many other examples where Manta lineage metadata is used to augment other applications. The export phase, once configured, obtains selected Manta metadata and packages it for shipment to a set of CSV files (in the generic case) or automatically puts it into proprietary API calls (usually a REST API payload in XML or JSON) for a third party and makes the calls on behalf of the Automatic Data Lineage user or site. This loads the lineage metadata into the other application where it can be used for additional purposes.

Phases of a Scan

A scan is organized into phases that run in sequence to serialize execution to first extract the required metadata from the source systems, then analyze and harvest the metadata, persist the metadata into the Manta repository, and eventually export the metadata and lineage for the last revision into independent solutions. Each phase then consists of scenarios that perform specific actions.

The complete flow of the scan has three phases as illustrated below. Predefined workflows make it easy to run a complete analysis with all three phases (the Run All workflow) or just to run a selected phase (e.g., Run Extract or Run Analyze). Users also have the option to control what will be executed in the scan by creating their own workflows in Manta Process Manager.

Three phases of the scan: Extract, Analyze, Export

There are many ways these options can be utilized. For example, you may choose to run the extraction phase independently of the analysis phase, running the extraction only on selected connections that have just been refreshed (to speed up runtime) and then running the analysis on everything — the refreshed connections as well as those that were previously extracted. Similarly, you can review the lineage in Automatic Data Lineage to ensure it meets expectations before pushing it into third-party applications.

What Is a Scenario?

Each step of the scan process described above is called a scenario. Scenarios within the same phase can run in parallel, controlled by Automatic Data Lineage itself based on the resources available. The details below describe the activities in each phase as they relate to specific technologies. This is all managed for you by Manta Flow Process Manager, but it is described here to give you a deeper understanding of what occurs as Automatic Data Lineage retrieves and analyzes your lineage metadata.

Phase 1: Extraction

Metadata is extracted from the host system. Where possible, this is done via direct connection. This ensures that the metadata being retrieved is the most current and most relevant and reflects the truth of how the system is running today. This is usually done by APIs, but it depends on the technology. For most databases, this is done by directly accessing the database catalog via JDBC connection. The extract phase first retrieves database dictionary information (primarily tables and columns) and then picks up assets that define lineage. These are typically views and stored procedures for a database but might be external data access steps and transformation modules when applied to an ETL (extract/transform/load) tool. Business intelligence (reporting) solutions deliver lineage details about their queries along with columns in the report and their potential transformations. The extracted metadata is then ready for analysis.

Metadata can also be ingested manually via an Agent’s filesystem or by a Git Ingest connection where an agent will download the files/folders from a Git repository and use those files as part of the extracted metadata.

For more detailed information on which specific scenarios are run for each scanner during the extraction phase, check out our list of scanner guides. Click on the name of the relevant scanner to read about its specific extraction scenarios.

Phase 2: Analysis

Each Manta scanner is developed only after extensive research into the selected technology. This research yields a deep understanding of how a particular syntax is derived, what constitutes a source or a target, and how data transformations are defined and stored. Specific dialects for certain languages are explored (such as the subtle differences between SQL-based solutions), and important structural issues are reviewed (a tool or technology’s folder, project, and process structure). This allows the scanner, when implemented, to outline the exact flow of data through the selected tool or technology. With every scanner, Automatic Data Lineage strives to provide detailed insight for data flows at the individual column or element level, while also capturing exact transformation syntax. During the analysis phase, the dictionary of column information (created earlier during extract) supports validation and proper column usage.

At the beginning of the analysis phase, a new revision scenario is called, which creates a new revision in the internal metadata repository so it is ready to accept new metadata.

For more detailed information on which specific scenarios are run for each scanner during the analysis phase, check out our list of scanner guides. Click on the name of the relevant scanner to read about its specific analytics scenarios.

At the end of the analysis phase, three scenarios are called.

Repository post-processing scenario — performs the post-processing of metadata in the internal metadata repository.
Commit revision scenario — commits the current revision in the metadata repository so the new metadata is accessible to users.
Prune revision scenario — removes old revisions from the metadata repository.

Phase 3: Export

For the export phase, there are several scenarios. Each is used independently, depending on the requirements.

Open Manta basic export scenario — exports all metadata from the internal metadata repository to general CSV files suitable for import to any relational database.
Open Manta integration export scenario — exports all metadata from the internal metadata repository to CSV files designed specifically for ingestion to external applications and solutions.
Alation/Collibra/EDC/IGC — specialized licensed exports that push Manta lineage metadata to the respective vendor solution repositories.