Discovering assets (Watson Knowledge Catalog)
Discover assets to get insight about the quality and business content of tables and files analyzed from various data connections. You can choose between a quick scan and automated discovery.
When the number and size of assets is unknown, and you need a quick view of their data quality, quick scan is the quickest method. When you have a good first-level understanding of the assets, you can then take a subset of those files, perhaps the most interesting or most useful, and run a deeper analysis by using automated discovery.
- Quick scan
- Automated discovery
- Supported connections
- Required user permissions and data quality project roles
Quick scan
Quick scan analyzes a sample of each table or file to quickly provide analysis results, including data quality score, and automatically assigned data classes and business terms. You can review the assets, any data quality information, and term and data class assignments from the quick scan results. After reviewing, you can publish the assets along with the other information to one or more catalogs.
Quick scan is best suited to get a fast initial analysis of large numbers of tables and files from data sources that you might not be familiar with. After quick scan completes, you can review the results and decide on which data assets you want to run a deeper analysis, expanding beyond the initial sampling.
Automated discovery
Automated discovery provides detailed analysis results of all assets from data sources. Unlike quick scan, with automated discovery the metadata and analysis results are automatically imported into the default catalog. The analysis results are available for viewing in a data quality project and include data quality score, automatically assigned data classes and business terms, data types, formats, frequency distributions, and more.
This type of discovery is suitable for smaller numbers of tables and files from data sources, or from subsets (schemas or file paths) of data sources. You can use automated discovery when you already have a general overview of the quality and business content of your data, and you want to see and review the additional details.
Supported connections
The following table lists the data sources from which quick scan or automated discovery can discover assets.
Important:
-
If a platform connection is edited, changes to the description, the API key, the password, or the SSL certificate might not be synced to the discovery connection. This applies to MongoDB connections that were added to discovery before 4.0.3 and any connections to other data sources that were added to discovery before 4.0.2 if such connections were used in quick scan jobs and published to the default catalog. To update that information in a discovery connection manually, go to Catalog > Metadata import and edit the connection there.
-
The list of available platform connections can contain connections that are configured to use Cloud Pak for Data credentials for authentication. You cannot use such connections in discovery jobs. If you select such a platform connection, an error message is displayed.
-
The list of available platform connections can contain SSL-enabled connections that use a base64 encoded certificate. You cannot use such connections in discovery jobs. Connections with decoded certificates will work.
-
When you select a platform connection that is configured to use secrets as credentials for discovery, these credentials are still treated as shared credentials. The connection can be used by any user with the permission to configure and run discovery jobs.
-
For Apache HDFS data sources, the following types of credentials from a vault are not supported for discovery connections: Service principal, Keytab under Kerberos
-
Personal connection credentials are not supported. If the connection was created with personal credentials, the following credentials are used instead in discovery jobs:
- If the connection was originally created in the default catalog, the credentials of the user who created the connection there
- If the connection was originally created in Platform connections, the credentials of the user who added the connection to quick scan or automated discovery
-
For Db2, Db2 on Cloud, and Db2 Warehouse data sources, SSL connections for discovery use the following security mechanisms:
- Connect with username and password: ENCRYPTED_USER_AND_PASSWORD_SECURITY (securityMechanism=9)
- Connect with API key: PLUGIN_SECURITY (securityMechanism=15)
The data sources must be configured in the same way. Otherwise, an error occurs when the platform connection is added as a discovery connection and this message is returned:
Security mechanism not supported
| Data source | Connection type | Quick scan | Automated discovery | Synchronization¹ |
|---|---|---|---|---|
| Amazon DynamoDB | JDBC | Not supported | Connection created through metadata import² | Data assets |
| Amazon Redshift | Generic JDBC | Platform connection | Platform connection | Not supported Add asset to the catalog directly. |
| Amazon S3 (CSV files only) |
Amazon S3 | Not supported | Connection created through metadata import² | Data assets Connection |
| Apache Cassandra | JDBC | Not supported | Connection created through metadata import² | Data assets |
| Apache Kudu | Generic JDBC | Platform connection
Include the actual values of any driver configuration options such as SSL options directly in the JDBC URL. Values that you define in the JDBC properties field and add to the JDBC URL as variables are not resolved. |
Platform connection
Include the actual values of any driver configuration options such as SSL options directly in the JDBC URL. Values that you define in the JDBC properties field and add to the JDBC URL as variables are not resolved. |
Not supported Add asset to the catalog directly. |
| Data Virtualization⁴ | Db2 | Platform connection Non-SSL connections only. |
Platform connection Non-SSL connections only. |
Data assets Connection (Db2 JDBC) |
| Data Virtualization Manager for z/OS | JDBC | Not supported | Connection created through metadata import² | Data assets |
| Db2 | Db2 | Platform connection | Platform connection | Data assets Connection (native and JDBC) |
| Db2 Warehouse | Db2 Warehouse | Platform connection | Platform connection | Data assets Connection |
| Google BigQuery | JDBC | Not supported | Connection created through metadata import² | Data assets |
| HDFS | Apache HDFS | Not supported | Platform connection or created through metadata import² Additional considerations apply for this type of connection. For more information, see Known issues with Hive or HDFS connections for data discovery. |
Data assets |
| Hive | Apache Hive | Platform connection or created through metadata import² Additional considerations apply for this type of connection. For more information, see Known issues with Hive or HDFS connections for data discovery. |
Platform connection or created through metadata import² Additional considerations apply for this type of connection. For more information, see Known issues with Hive or HDFS connections for data discovery. |
Data assets Connections |
| Microsoft Azure Data Lake Store | Microsoft Azure Data Lake Store | Not supported | Connection created through metadata import² | Data assets |
| Microsoft SQL Server | Microsoft SQL Server | Platform connection | Platform connection | Data assets Connection (JDBC) |
| MongoDB | MongoDB | Platform connection | Platform connection | Data assets |
| MySQL (Enterprise Edition) |
ODBC | Not supported | Connection created through metadata import² | Data assets |
| Netezza Performance Server | Generic JDBC | Platform connection | Platform connection | Data assets |
| ODBC or IBM Netezza Connector | Not supported | Connection created through metadata import² | ||
| Oracle | Oracle | Platform connection | Platform connection | Data assets Connections (JDBC) |
| Greenplum (Greenplum) | ODBC | Not supported | Connection created through metadata import² | Data assets |
| PostgreSQL | JDBC | Connection created through metadata import² Additional configuration is required for publishing results.³ |
Connection created through metadata import² | Data assets |
| SAP ASE | ODBC | Not supported | Connection created through metadata import² | Data assets |
| SAP HANA | JDBC | Not supported | Connection created through metadata import² | Data assets |
| Snowflake | Generic JDBC | Platform connection | Platform connection | Data assets Connections |
| Teradata | Generic JDBC | Platform connection | Platform connection | Data assets Connections |
Table notes:
1) For automated discovery, information assets are synchronized from the Information assets view to the default catalog. For quick scan, assets are synchronized from the default catalog to the Information assets view. The following assets are synchronized:
- Tables and their associated columns
- Files and their associated columns
-
Connections as indicated for the data source
For more information, see Information assets view.
2) Legacy metadata import must be enabled. For more information about creating such connections, see Creating metadata import connections for discovery.
3) To be able to publish quick scan results from a PostgreSQL connection created through metadata import, you must define a platform connection with exactly the same name (case-sensitive) selecting Third party: PostgreSQL as the
type of data source. Otherwise, publishing will fail with the error Connection not found. Note that this platform connection is for publishing purposes only. You cannot use it for discovery.
4) If a workaround is applied, the connection type Data Virtualization and the connection in the default catalog can be used. However, only non-SSL connections are supported.
In addition to the specific JDBC connections listed, automated discovery can be run on other generic JDBC connections to additional data sources. Such connections are synced to the default catalog with a placeholder for the JDBC driver name. Therefore, you must edit the respective connection asset in the default catalog and update the driver name with the driver name specified in the discovery connection. Go to the default catalog, open the connection asset, and edit the connection from the Asset tab.
If the connection wasn't created as a platform connection and then added for discovery, you must also upload the required JDBC driver jar file to the default catalog when you update the driver name.
Required user permissions and data quality project roles
To run automated discovery or quick scan, you need the following user permission and data quality project roles:
- Manage asset discovery
- The Data Operator role in the data quality project referenced in the discovery job
- The Business Analyst role in the data quality project referenced in the discovery job
Permissions required for working with discovery jobs and results are as follows:
| Action | Permission and role |
|---|---|
| View discovery jobs | Access data quality or Manage asset discovery Owner of the discovery job, isadmin user, or Data Steward role or both the Data Operator and the Business Analyst roles in the data quality project referenced in the discovery job |
| Cancel discovery jobs | Manage asset discovery Owner of the discovery job, isadmin user, or Data Operator and Business Analyst roles in the data quality project referenced in the discovery job |
| Delete discovery jobs | Manage asset discovery Owner of the discovery job, isadmin user, or Data Operator and Business Analyst roles in the data quality project referenced in the discovery job |
| Rerun discovery jobs | Manage asset discovery Owner of the discovery job, isadmin user, or Data Operator and Business Analyst roles in the data quality project referenced in the discovery job |
| Review and publish analysis results | Access data quality or Manage asset discovery Business Analyst or Data Steward role in the data quality project referenced in the discovery job. To publish quick scan results, you must also be a collaborator with the Admin or the Editor role in the catalog to which you want to publish. For automated discovery results, the required catalog collaborator role depends on the configuration. |
Learn more
- Running automated discovery
- Running a quick scan
- Information assets view
- Creating metadata import connections for discovery
Parent topic: Curating data