Cross-platform lineage explained
Overview
In complex data environments, data flows across multiple platforms and systems; for example, ETL tools usually implement data pipelines that move data from one database to another. Similarly, reporting tools present aggregated data that comes from a database or data lake. To connect the lineage across different technologies and provide complete visibility into the data pipeline, IBM Manta Data Lineage needs to understand and correctly map the (database) connections that ETL, reporting, and modeling tools point to.
Manta Data Lineage implements several heuristics to automatically map the connection based on the information available in the connection that comes from the ETL, reporting, or modeling tool itself and the details available in Manta Data Lineage database connection. If the selected database was scanned by Manta Data Lineage but heuristics cannot match it to the appropriate connection, a user-defined alias mapping override is needed. The best way to do this is by using Automapping UI. These cases often manifest in the following ways only after a Manta Data Lineage scan is done.
-
Disconnected lineage in Manta Flow Viewer
-
Deduced or duplicate objects that are displayed in Manta Flow Viewer
-
Messages in the dataflow logs. For examples, see User-defined alias mapping.
To establish lineage across these systems, Manta Data Lineage uses three mapping techniques:
- Dictionary mapping
- Automapping
- User-defined alias mapping
These methods work by connecting calling systems (system from which the connection is established, such as ETL, reporting, or modeling tools) to receiving systems (like databases).
The following diagram contains an example of connecting the calling and receiving systems.
Dictionary mapping is a fully automated process and is the initial method used to establish a connection between calling and receiving systems. For details about technology-specific mapping algorithms, see Specific Mapping Algorithms.
When a connection object cannot be automatically matched to a scanned receiving system by using dictionary mapping, alias mapping is used if it is defined. Automapping UI contains a list of suggested user-defined alias mappings to be defined by users. Review the list first before you manually create alias mappings.
Terminology
Calling system : The system that initiates a connection to another system, typically to read from or to write to it. It can be for example an ETL tool or a Business Intelligence (BI) platform, or a database, in case of Database links or Linked servers.
Receiving system : The system that receives the connection that was initiated by the calling system. It is usually a database or data storage system.
Connection object : A collection of properties (such as technology type, connection string, server name, database name) that the calling system uses to establish a connection to the receiving system. These properties are logged during analysis and are critical for mapping the alias to the correct system. For details about the connection object, see Detailed Information About Connection Mappings.
Connection ID : The specific identifier within the connection object that is used for alias mapping. It is typically the connection string or another unique attribute that Manta Data Lineage uses to match the alias with a scanned system. It links the calling system (PowerBI) to the correct receiving system (SQL Server) within Manta Data Lineage.
Automapping (as of R42)
When automapping is used, schemas and tables in the query are compared with those in the scanned dictionaries. If a match is found, the calling system is temporarily linked to the matching receiving system, and the connection is automatically mapped.
Automapping is used only when a single, clear match is found, which makes the process reliable, and the risk of false positives is minimized.
Limitations of Automapping:
- Automapping impacts performance by increasing the processing load during scans as it dynamically evaluates and matches schemas and tables.
- Automapping is run for each table and each query separately, which means that it might result in a different receiving system match even though they are referred by the same source system connection.
- When a new database connection is added, it might change the results of automapping. For example, there is no longer a single match found, and as a result automapping is not run.
Automapping is disabled by default in versions R42.0 - R42.6. You can enable it in Configuration > CLI > Common > Common Config > Enable dictionary automapping.
Starting in version R42.7, automapping is enabled by default.
Automapping UI (as of R42.7)
The Automapping UI provides a centralized view of all aliases that Manta Data Lineage detected that do not have a user-defined mapping. It contains information about whether aliases were automatically mapped to receiving systems or not. You can also find there the alias that is used in the connection ID field.
You can access automapping in Manta Admin UI.
The table contains details about aliases found during a scan. The Status column contains information about the status of each alias:
- Automapped - The alias was successfully matched to a receiving system during the last scan.
- Not Mapped - Either a match for the alias was not found, or two or more systems matched.
- Review - Aliases were mapped to multiple receiving systems, or have mixed results. It requires further investigation.
The Count column shows the number of unique occurrences where the scenario + calling system + alias + status combination was detected during the scan. Start creating user-defined aliases for the entries that have the highest count number as those resolve most of the cross-system lineage gaps.
Each time the connection associated with an alias is scanned, the status, count, and receiving system details are updated. If a user defines an alias mapping for that alias and rescans the connection, the corresponding row is removed.
Configuring user-defined alias mappings
After each scan, review the automapping results and configure user-defined alias mappings.
- Review automapping results. In case of the Review status, select the correct receiving system for that particular connection ID.
- Configure user-defined alias mappings for all aliases that you identified during the review. These are usually all aliases with the Automapped status and the selected aliases with the Review status.
- Copy alias details from the Alias column.
- Click the Alias Mapping button to open existing alias mapping section.
- Create a new alias mapping. Use the connection details that you copied earlier to create a new user-defined alias mapping. Alternatively, copy the connection details from the log messages.
A connection ID cannot be mapped to multiple systems, because later only one randomly selected mapping is used.
- Rescan the calling system to map every instance of the alias to the appriopriate receiving system. Rescanning also removes all rows related to the alias from the Automapping UI table.
Example scenarios for automapping
Scenario 1
- Context - Microsoft Power BI (PowerBI) loads data from Oracle HR (OracleHR) and Microsoft SQL Server (MSSQL).
- Automapping results - Manta Data Lineage automatically maps Power BI to Oracle HR but cannot find a match for Microsoft SQL Server.
- Data displayed in the table
- One row showing PowerBI → MyOracleAlias → OracleHR with a status Automapped.
- One row showing PowerBI → MyMSSQLAlias → [blank] with a status Not Mapped.
- Action - Copy the Microsoft SQL Server alias details and create a user-defined alias mapping in the alias mapping screens.
Scenario 2
- Context - Informatica Power Center (IFPC) writes data to Oracle Finance by using Alias_OraFin and OracleClaims by using Alias_OraClaims.
- Automapping results - Manta Data Lineage incorrectly automatically maps IFPC (Alias_OraFin) to both Oracle Finance and Oracle HR, but cannot find a match for Alias_OraClaims.
- Data displayed in the table
- A summary row showing IFPC → Alias_OraFin with a status of Review. When you expand this row, the following rows are displayed:
- One for Alias_OraFin → OracleHR (status: Automapped)
- One for Alias_OraFin → OracleFinance (status: Automapped)
- One row showing IFPC → Alias_OraClaims → [blank] with a status Not Mapped.
- A summary row showing IFPC → Alias_OraFin with a status of Review. When you expand this row, the following rows are displayed:
- Action - Copy the Alias_OraFin details and create an alias mapping to OracleFinance. Copy the Alias_OraClaims details and create an alias mapping to OracleClaims.
User-defined alias mapping
You manually define the alias mappings by explicitly associating the connection ID (alias) with the appropriate database (identified by its dictionary ID). You can define alias mappings in Admin UI > Configuration > CLI > Target Database technology
> Alias Mapping.
Before you start defining alias mappings manually, review the suggestions on Automapping UI.
To configure mappings manually, use the connectionString
value from the following log message in the Connection ID field for the connection that you want to be recognized under this name. You can use serverName
if connectionString
is null.
2022-07-11 17:42:42.388 [CLI] 0 WARN eu.profinit.manta.dataflow.generator.common.query.impl.DataflowQueryServiceImpl [Context: \ZenDesk Metrics.pbix]
DATABASE_CONNECTION_ERRORS CANNOT_FIND_SYSTEM_FOR_CONNECTION "ODBC dsn=ENT_ODS"
User message: Cannot find any matching system for Connection [type=ODBC, connectionString=dsn=ENT_ODS, serverName=null, databaseName=null, schemaName=null, userName=null].
Technical message: Cannot find any matching system for Connection [type=ODBC, connectionString=dsn=ENT_ODS, serverName=null, databaseName=null, schemaName=null, userName=null].
Solution: Identify the particular database technology that "dsn=ENT_ODS" points to and if the technology is supported by Manta, configure it as a new connection or configure the manual mapping for that database technology in Manta Admin UI.
Impact: SINGLE_INPUT
2022-05-11 20:00:32.769 [main] WARN eu.profinit.manta.connector.mssql.MssqlReaderProvider [Context: \IMEDW\wf_ZenDesk_Metrics.xml]
DICTIONARY_MANTA_MAPPING_ERRORS NO_MAPPING_FOR_CONNECTION
User message: Connection in use could not be automatically mapped to one of the database connections configured in Manta.
Technical message: There is no mapping for the connection Connection [type=MSSQL, connectionString=null, serverName=CTC2EDDB01, databaseName=EDW, schemaName=null, userName=].
Solution: Identify the particular database technology MSSQL leading to "null" and configure it as a new connection or configure the manual mapping for that database technology in Manta Configurator.
Impact: SINGLE_INPUT