Databricks Integration Requirements
The following are the prerequisites necessary for IBM Automatic Data Lineage to connect to this third-party system, which you may choose to do at your sole discretion. Note that while these are usually sufficient to connect to this third-party system, we cannot guarantee the success of the connection or integration since we have no control, liability, or responsibility for third-party products or services, including their performance.
The Manta Databricks scanner uses the Databricks API to connect to the Databricks instance. The Automatic Data Lineage instance must have network access to the Databricks API (hosted by Databricks). To access the Databricks API, it is necessary to provide a personal access token (PAT). The token can be obtained through the Databricks UI and will be used to authenticate the Manta Databricks scanner to the scanned Databricks instance. To be able to extract all assets, the safest bet is for the user (that the PAT belongs to) to be a metastore admin. otherwise, the following privileges are needed for individual entity extraction.
-
Catalogs — Only catalogs that the user owns or that have the
USE_CATALOGprivilege are extracted -
Schemas — Only schemas that the user owns or that have the
USE_SCHEMAprivilege are extracted, and the user needs to be an owner or have theUSE_CATALOGprivilege for the owner catalog -
Tables/Views — Only tables/views that the user owns or that have the
SELECTprivilege are extracted, and the user needs to be an owner or have theUSE_CATALOGprivilege for the owner catalog, and be an owner or have theUSE_SCHEMAprivilege for the owner schema -
Functions — Only functions that the user owns or that have the
EXECUTEprivilege are extracted, and the user needs to be an owner or have theUSE_CATALOGprivilege for the owner catalog, and be an owner or have theUSE_SCHEMAprivilege for the owner schema -
Notebooks — Only notebooks that the user owns or that have (at least) view access are extracted
Hive metastore assets are also accessed with the personal access token. Here, the SELECT privilege is needed for all the assets to be extracted (e.g., schemas, tables).
<MANTA_AGENT_DIR_HOME>/manta-flow-agent-dir/lib-ext folder. Otherwise, the extraction from Hive
Metastore won’t be performed. For more information, Go to IBM Support. As of Automatic Data Lineage R42, if the driver is not provided, the extraction will always produce an error reminding the user
about the missing driver. If the driver was intentionally not provided — for example, if nothing from Hive Metastore should be extracted — then the hive_metastore catalog should be included in the excluded catalogs list in the connection
configuration.
Requirements to extract Unity Catalog Lineage
-
The workspace must have Unity Catalog enabled. (https://docs.databricks.com/en/data-governance/unity-catalog/enable-workspaces.html#enable-your-workspace-for-unity-catalog )
-
Tables must be registered in a Unity Catalog metastore.
-
Queries must use the Spark DataFrame (for example, Spark SQL functions that return a DataFrame) or Databricks SQL interfaces.
-
To view lineage for a table or view, users must have the
SELECTprivilege on the table or view. -
To view lineage information for notebooks and workflows, users must have permissions on these objects as defined by the access control settings in the workspace.
-
To view lineage for a Unity Catalog-enabled workflow, you must have
CAN_VIEWpermissions on the pipeline.
Supported Extraction Features
-
Fetching of information about notebooks in the Databricks instance through Databricks APIs
-
Fetching of information about standalone queries in the Databricks instance through Databricks APIs
-
Fetching of information about database assets (e.g., tables, views, functions) in the Databricks instance through Databricks APIs
-
Extraction of dictionaries from Hive Metastore
-
Extraction of dictionaries from Unity Catalog
Supported Data Flow Analysis Features
-
Visualization of lineage information for Unity Catalog views and functions by scanning extracted SQL definitions
-
Visualization of lineage information for standalone queries as returned by the Unity Catalog API
-
Visualization of lineage information for notebooks as returned by the Unity Catalog API
-
Visualization of lineage information for jobs and workflows as returned by the Unity Catalog API
Supported SQL Features
-
SELECTandINSERTand basic expression handling -
CREATE|ALTER|DROP CATALOG -
CREATE|ALTER|DROP SCHEMA -
CREATE|ALTER|DROP DATABASE -
CREATE|ALTER|DROP TABLE -
CREATE|ALTER|DROP VIEW -
USE CATALOG -
USE DATABASE -
USE SCHEMA -
UPDATE— see https://docs.databricks.com/sql/language-manual/delta-update.html -
LOAD DATA— see https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-dml-load.html -
CREATE FUNCTION ... AS ... USING JAR -
CREATE FUNCTION ... RETURNS ...— see https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html -
Basic parsing and resolution of lambda function parameters
Known Unsupported Features
Automatic Data Lineage does not support the following Databricks features. This list includes all of the features that IBM is aware are unsupported, but it might not be comprehensive.
-
Scanning of notebook Scala commands (depending on the use case, the OpenLinegae Scanner could be a solution)
-
Scanning of notebook R commands
-
Analysis of Custom Libraries
-
Extraction of definitions of functions and views from Hive Metastore