Data governance (Watson Knowledge Catalog)

Data governance is the process of tracking and controlling data assets based on asset metadata. Catalogs are workspaces where you provide controlled access to governed assets.

Service The Watson Knowledge Catalog service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog, and check whether the service is enabled.

Watch the following video for an overview of Watson Knowledge Catalog features.

This video provides a visual method as an alternative to following the written steps in this documentation.

A catalog contains assets and collaborators. Collaborators are the people who add assets into the catalog and the people who need to use the assets. You can customize data governance to enrich and control data assets in catalogs.

Learn more about governance or get started with catalogs and governance:

Data governance approaches

You can set up data governance in an iterative manner. You can start with a simple implementation of data governance that relies on predefined artifacts and default features. Then, as your needs change, you can customize your data governance framework to better describe and protect your data assets.

Simplest implementation of data governance

You use a catalog to share assets across your organization. A catalog can act as a feature store by containing data sets with columns that are used as features (inputs) in machine learning models. A Watson Knowledge Catalog administrator must create a catalog for sharing assets and add data engineers, data scientists, and business analysts as collaborators.

Catalogs store and track assets. Projects are where users prepare data assets and build models. Assets move between the catalog and projects.

Catalog collaborators can add assets to the catalog to share with others or find and use assets in the following ways:

Data engineers add cleansed data, virtualized data, and integrated data to the catalog.
Data engineers import tables or files from a data source to the catalog.
Data scientists and business analysts find data assets in catalogs and add them to projects to work with the data.

Data assets accumulate metadata over time in the following ways:

Data assets are profiled, which automatically assigns predefined data classes that describe the format of the data and analyzes data quality.
Catalog collaborators add tags, predefined business terms, data classes, and classifications, relationships, and ratings to assets.
All actions on assets are automatically saved in the asset history.

See Creating a catalog.

Customization options for data governance

You can add or update any of these custom options to your data governance implementation at any time. When your data changes, you can reimport metadata about the tables or files and enrich your data assets with your business vocabulary and data quality analysis. You can create increasingly precise rules to protect data as you expand your business vocabulary. Throughout the data governance cycle, your data scientists and other data consumers can find trusted data in catalogs. The following illustration shows how data governance is a continuous cycle of refreshing the metadata for data assets to reflect changes in the data and changes in your business vocabulary.

The cycle of data governance tasks

Establish your business vocabulary

Your governance team can establish a business vocabulary that describes the meaning of data with business terms and the format of data with data classes. A business vocabulary helps your business users more easily find what they are looking for using nontechnical terms.
Your team can quickly establish your business vocabulary by importing your existing business vocabulary or importing Knowledge Accelerators that provide between dozens to thousands of governance artifacts.
Your Watson Knowledge Catalog administrator can customize the workflow, organization, properties, and relationships of governance artifacts.

See Planning to implement a governance framework.

Import and enrich data assets with your business vocabulary

Data stewards can regularly run metadata import and enrichment jobs that update the catalog with changes to tables or files from your data sources and automatically assign the appropriate business terms and data classes.
When your team adds governance artifacts, the metadata enrichment jobs suggest the new artifacts to the new or updated data assets.
When data stewards confirm or adjust business term assignments during metadata enrichment, the machine learning algorithms for term assignment become more accurate for your data.
Data stewards can configure metadata import and enrichment to run only when changes are detected.
You can import lineage for data assets with MANTA Automated Data Lineage for IBM Cloud Pak for Data.
You can use custom term assignment algorithms that you train in your own model to improve accuracy.

See Planning to curate data assets to share in catalogs.

Analyze data quality

Data stewards can analyze data quality with default settings during metadata enrichment. Data quality analysis is applied to each asset as a whole and to columns in tables.
Data stewards can create custom data quality definitions and apply them in data quality rules.

See Planning to curate data assets to share in catalogs.

Protect your data with rules

Your governance team can create a plan for data protection rules by writing policies that document your organization’s standards and guidelines for protecting and managing data. For example, a policy can describe a specific regulation and how a data protection rule ensures compliance with that regulation.
Your governance team can create data protection rules that define how to keep private information private. Data protection rules are automatically evaluated for enforcement every time a user attempts to access a data asset in any governed catalog on the platform. Data protection rules can define how to control access to data, mask sensitive values, or filter rows from data assets.
Data engineers can enforce data protection rules on virtualized data.
Data engineers can permanently mask data in data assets with masking flows.
Your team can start with data protection rules that are based on predefined data classes or classifications, tags, or users. When your governance team adds governance artifacts, the team can define data protection rules based on your business vocabulary.
You can extend the enforcement of data protection rules by integrating Watson Knowledge Catalog with IBM Guardium Data Protection.

See Planning to protect data with rules.

Getting started with Watson Knowledge Catalog

The tasks to get started with Watson Knowledge Catalog depend on your goal. The actions that you can take are defined by your roles and permissions. Some actions also have workspace role requirements, such as being a collaborator in a catalog or category.

To see which roles and permissions that you have, click your user avatar, select Profile and settings, and then view the Permissions page. If you need more permissions, contact your Cloud Pak for Data administrator. To understand your roles, see Predefined roles and permissions.

The following table shows common goals, the required roles, and links to information to get you started.

Goal	Required Cloud Pak for Data service access role	More information
Set up or administer Watson Knowledge Catalog	Manager	• Planning to implement data governance • Managing Watson Knowledge Catalog
Find assets or features in a catalog	Any role	• Finding assets in a catalog • Searching for assets across the platform • Adding a catalog asset to a project
Curate data	Data Steward, Data Quality Analyst, or Data Engineer	• Curating data • Planning to curate data
Manage data quality	Data Quality Analyst or Data Engineer	Managing data quality
Create governance artifacts	Data Steward or Data Engineer	• Managing governance artifacts • Importing Knowledge Accelerators • Planning to implement a governance framework
Create data protection rules	Data Steward or Data Engineer	• Data protection rules • Planning to protect data with rules
Run Watson Knowledge Catalog APIs	The same role for performing the task in the UI.	• Use Watson APIs
Generate reports on Watson Knowledge Catalog	Reporting administrator	• Setting up reporting

Use Watson Knowledge Catalog APIs

To use Watson Knowledge Catalog APIs in your application, you can call endpoints with a request URL in this format:

https://{web-client}/{API-path}?{API-query}

Replace these variables:

Variables for calling API endpoints
Variable	Replace with
`{web-client}`	The IP address or name of your Cloud Pak for Data web client.
`{API-path}`	The path for the API. For example, use `/v2/catalogs` to return the list of catalogs.
`{API-query}`	The query string for the API, if applicable. For example, use `/v2/asset_types?catalog_id=5` to return the list of asset types in the catalog with the ID of 5.

For more information, see the Watson Data API documentation.