Configuring synchronization with external repositories (Watson Knowledge Catalog)

To enable the synchronization of catalog assets and governance artifact assignments with an external repository, set up and configure Watson Knowledge Catalog as a member of an Egeria open metadata repository cohort.

Tech preview This is a technology preview and is not yet supported for use in production environments.

Starting with Cloud Pak for Data 4.6.4, this feature is supported for use in production.

A cohort is a collection of servers that share metadata by using Open Metadata Repository Services (OMRS). These services are provided by the Egeria Open Metadata and Governance (OMAG) Server Platform. After Watson Knowledge Catalog becomes a member of the cohort, it can share metadata with, and receive metadata from, other instances of Watson Knowledge Catalog, Information Governance Catalog, or Apache Atlas (supported through an Egeria connector) if these are also members of this Egeria cohort.

Complete these setup steps:

  1. Configure secure Kafka communication.
  2. Set up the cohort.
  3. Register external repositories.

Administer the configuration by using these API calls.

Required role
To complete this task, you must be an administrator of the project (namespace) where you deployed the Watson Knowledge Catalog service.

Configure secure Kafka communication

Complete the following prerequisite setup.

Add the Kafka secret to the deployment

For secure communication, between the Watson Knowledge Catalog service instances in two Cloud Pak for Data deployments, add the Kafka secret to the wkc-glossary-service deployment.

Complete the following steps:

  1. Log in to the cluster by running the following command:

    oc login <OpenShift_URL>:port -u <username> -p <password>
    

    Alternatively, you can use ssh to log in to the cluster.

  2. Edit the wkc-glossary-service deployment to open the deployment file:

    oc edit deployment wkc-glossary-service -n <namespace>
    
  3. Find the wdp-certs volume entry and add the following entry. Use spaces to match the current indentation level for the existing secret entry. Do not use the Tab key.

    - secret:
        name: omrs-kafka-certs
        optional: true
    
  4. Save the deployment file and exit it. The pod is automatically restarted.

  5. Wait for the new wkc-glossary-service pod to be created and running.

  6. Log out of the cluster.

Complete the required Kafka setup

A Kafka topic must be available in the Kafka server in the namespace where you deployed the Watson Knowledge Catalog service. If no Kafka topic exists before activation of the OMAG instance for the business glossary service, error messages are written to the Watson Knowledge Catalog business glossary service log files during activation. The logs contain the topic name that you can copy and use to create a topic.

To create the topic in Kafka, run the following command replacing cohort-name with the cohort name that you want to use. Use the same cohort name when you register the cohort later (see Set up the cohort for the glossary service, step 2):

kubectl --namespace=zen exec -it kafka-0 -- /opt/kafka/bin/kafka-topics.sh --create --zookeeper <zookeeper-host>:<port>/kafka --topic egeria.omag.openmetadata.repositoryservices.cohort.<cohort-name>.OMRSTopic --replication-factor 1 --partitions 3

To secure Kafka communication, copy the Kafka certificate (usually the kafka_ca.crt file) to the remote cluster. Then, create a secret to store that certificate by running the following command replacing /tmp/kafka_ca.crt with the full path to the certificate:

oc create secret generic omrs-kafka-certs --from-file=omrs-kafka-cert1.pem=/tmp/kafka_ca.crt

Important: Do not change the name of the secret. You must use the name omrs-kafka-certs.

Set up the cohort

To set up the cohort membership for the glossary service and the catalog and asset management service (CAMS), complete the following steps.

Set up the cohort for the glossary service

  1. Generate a bearer token for authentication as described in Creating a CPD bearer token in the Watson Data API documentation. The token is needed in subsequent steps. Replace values as follows:

    • <cpd_cluster_host>: the hostname of the Cloud Pak for Data cluster
    • <username> and <password>: the credentials of a user with the Administrator role in Cloud Pak for Data

    The bearer token is in the access_token field of the response.

  2. Register the cohort. Submit the following API call and replace values as follows:

    • <hostname>: the hostname of the Cloud Pak for Data cluster.
    • <cohort-name>: the name under which you want to register the cohort. This name is embedded in the Kafka topic and must be the same for all repositories that are to join the same cohort. The repositories, for example, can be different business glossaries in different Cloud Pak for Data deployments.
    • <token>: the token obtained in step 1.
    • <kafkaserver>:<port>: the hostname or IP address of the Kafka server and the port to use, for example, kafka:9093.
    • <admin-user> and <password>: the credentials of a user with admin permissions. This is required only for secured Kafka communication. To ensure that the credentials are correct, go to /opt/kafka/config in the Kafka pod and view the kafka_server_jaas.conf file.
    • <sending-user-id>: the user ID of a user on the source system. This ID determines which governance artifacts are synced: only categories and their children are considered where the user has the Admin collaborator role or a collaborator role that includes the Add collaborators and assign roles permission.
      This setting is optional. If you do not want to send but only receive governance artifacts from the external repositories, do not set this value.

    Important: The entries starting with sasl or ssl and the security.protocol entry are required only for secured Kafka communication.

    curl --location --request POST 'https://<hostname>/v3/glossary_terms/admin/open-metadata/cohorts/<cohort-name>?topic_url_root=egeria.omag' \
    --header 'Authorization: Bearer <token>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
       "producer":{
          "bootstrap.servers":"<kafkaserver>:<port>",
          "acks":"all",
          "retries":"0",
          "batch.size":"16384",
          "linger.ms":"1",
          "buffer.memory":"33554432",
          "max.request.size":"10485760",
          "key.serializer":"org.apache.kafka.common.serialization.StringSerializer",
          "value.serializer":"org.apache.kafka.common.serialization.StringSerializer"
          "kafka.omrs.topic.id": "kafka-omrs-topic",
          "sasl.jaas.config": "org.apache.kafka.common.security.plain.PlainLoginModule required username=\\"<admin-user>\\" password=\\"<password>\\";",
          "sasl.mechanism": "PLAIN",
          "security.protocol": "SASL_SSL",
          "ssl.truststore.location":"/opt/java/lib/security/cacerts",
          "ssl.truststore.type":"JKS",
          "ssl.endpoint.identification.algorithm":"",
          "ssl.truststore.password":"changeit"
       },
       "consumer":{
          "bootstrap.servers":"<kafkaserver>:<port>",
          "zookeeper.session.timeout.ms":"400",
          "zookeeper.sync.time.ms":"200",
          "fetch.message.max.bytes":"10485760",
          "max.partition.fetch.bytes":"10485760",
          "key.deserializer":"org.apache.kafka.common.serialization.StringDeserializer",
          "value.deserializer":"org.apache.kafka.common.serialization.StringDeserializer"
          "kafka.omrs.topic.id": "kafka-omrs-topic",
          "sasl.jaas.config": "org.apache.kafka.common.security.plain.PlainLoginModule required username=\\"<admin-user>\\" password=\\"<password>\\";",
          "sasl.mechanism": "PLAIN",
          "security.protocol": "SASL_SSL",
          "ssl.truststore.location":"/opt/java/lib/security/cacerts",
          "ssl.truststore.type":"JKS",
          "ssl.endpoint.identification.algorithm":"",
          "ssl.truststore.password":"changeit"
       },
       "sending_user_id":"<sending-user-id>
    }'
    

    The call should return the response code 201 CREATED to indicate successful registration.

    If you registered more than one cohort, you can get a list of the available cohorts by submitting this API call:

    curl -k -X GET "https://<hostname>/v3/glossary_terms/admin/open-metadata/cohorts" -H "Authorization: Bearer <token>" -H "Content-Type: application/json"
    

    The response lists the cohorts by name in the resources section.

  3. Activate the cohort. Submit the following API call and replace values as follows:

    • <hostname>: the hostname of the Cloud Pak for Data cluster
    • <token>: the token obtained in step 1
    curl --location --request POST 'https://<hostname>/v3/glossary_terms/admin/open-metadata/instance' \
    --header 'Authorization: Bearer <token>' \
    --data-raw ''
    

Complete the same steps on the target Cloud Pak for Data cluster.

Set up the cohort for the catalog

  1. Generate a bearer token for the owner or an administrator of the catalog to be synchronized as described in Creating a CPD bearer token in the Watson Data API documentation. The token is needed in subsequent steps.

    Replace values as follows:

    • <cpd_cluster_host>: the hostname of the Cloud Pak for Data cluster
    • <username> and <password>: the credentials of the catalog owner or a user with the Administrator catalog collaborator role

    The bearer token is in the access_token field of the response.

  2. Determine the ID of the catalog that you want to synchronize. Submit the following API call to retrieve the list of catalogs. Use the bearer token obtained in step 1:

    curl -k -H "Authorization: Bearer <token>" https://<cpd-cluster-host>/v2/catalogs
    

    You can find the catalog IDs in the metadata/guid fields of the response.

    The REST API Swagger documentation for the catalog asset management service (CAMS) is available at: https://<cpd-cluster-host>/v2/cams/explorer.

  3. Register the cohort. Submit the following API call and replace values as follows:

    • <hostname>: the hostname of the Cloud Pak for Data cluster.
    • <cohort-name>: the name under which you want to register the cohort. This name is embedded in the Kafka topic and must be the same for all repositories that are to join the same cohort.
    • <token>: the token obtained in step 1.
    • <catalogID>: the ID of the catalog to be synchronized as obtained in step 2.
    • <kafkaserver>:<port>: the hostname or IP address of the Kafka server and the port to use, for example, kafka:9093.
    • <admin-user> and <password>: the credentials of a user with admin permissions. This is required only for secured Kafka communication. To ensure that the credentials are correct, go to /opt/kafka/config in the Kafka pod and view the kafka_server_jaas.conf file.

    Important: The entries starting with sasl and the security.protocol entry are required only for secured Kafka communication.

    curl --location --request POST 'https://<hostname>/v2/catalogs/<catalogID>/open-metadata/cohorts/<cohort_name>' \
    --header 'Authorization: Bearer <token>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
       "producer":{
          "bootstrap.servers":"<kafkaserver>:<port>",
          "acks":"all",
          "retries":"0",
          "batch.size":"16384",
          "linger.ms":"1",
          "buffer.memory":"33554432",
          "max.request.size":"10485760",
          "key.serializer":"org.apache.kafka.common.serialization.StringSerializer",
          "value.serializer":"org.apache.kafka.common.serialization.StringSerializer"
          "kafka.omrs.topic.id": "kafka-omrs-topic",
          "sasl.jaas.config": "org.apache.kafka.common.security.plain.PlainLoginModule required username=\\"<admin-user>\\" password=\\"<password>\\";",
          "sasl.mechanism": "PLAIN",
          "security.protocol": "SASL_SSL",
       },
       "consumer":{
          "bootstrap.servers":"<kafkaserver>:<port>",
          "zookeeper.session.timeout.ms":"400",
          "zookeeper.sync.time.ms":"200",
          "fetch.message.max.bytes":"10485760",
          "max.partition.fetch.bytes":"10485760",
          "key.deserializer":"org.apache.kafka.common.serialization.StringDeserializer",
          "value.deserializer":"org.apache.kafka.common.serialization.StringDeserializer"
          "kafka.omrs.topic.id": "kafka-omrs-topic",
          "sasl.jaas.config": "org.apache.kafka.common.security.plain.PlainLoginModule required username=\\"<admin-user>\\" password=\\"<password>\\";",
          "sasl.mechanism": "PLAIN",
          "security.protocol": "SASL_SSL",
       },
    }'
    
  4. Activate the Open Metadata and Governance (OMAG) instance for the catalog. Submit the following API call and replace values as follows:

    • <token>: the token obtained in step 1
    • <hostname>: the hostname of the Cloud Pak for Data cluster
    • <token>: the catalog ID used in step 3
    curl -vk -H "Authorization: Bearer <token>" -X POST https://<cpd-cluster-host>/v2/catalogs/<catalogID>/open-metadata/instance
    

Unregister from the cohort

To unregister the glossary service or a catalog from a cohort, use these APIs:

  • Glossary service

    1. DELETE /v3/glossary_terms/admin/open-metadata/cohorts/<cohort-name>
    2. POST /v3/glossary_terms/admin/open-metadata/instance
  • Catalog

    1. DELETE /v2/catalogs/<catalog_id>/open-metadata/cohorts/<cohort_name>
    2. POST /v2/catalogs/<catalogID>/open-metadata/instance

Register external repositories

Register the external repositories to which you want to synchronize data. You can use the default configuration, a customized configuration, or both. If you defined both types, the settings of the customized configuration take precedence over the default configuration.

Consider the following information to decide which type of configuration to use: If you trust all the external repositories configured to the cohorts, you can use the default configuration and give a role that includes the Manage governance categories permission to the user ID that you set for receiving_user_id. Otherwise, create a custom configuration for any source external repository in question.

Complete these steps only on the target Cloud Pak for Data cluster.

Submit one of the following API requests and provide the following information as required for the configuration. The expected response code for either request is 200 OK.

  • <hostname>: the hostname of the server where the source repository is

  • <token>: the bearer token that you obtained for registering the glossary service.

  • receiving_user_id: the user ID of a user on the target system. This ID is used to create artifacts in the target repository. This user is the owner of all synced categories and artifacts. To find the user ID, go to Administration > Access control.

    The receiving user should have the roles required to create, update, and delete the artifacts received from the external repository. If you want root categories to be synced from the external repository, the receiving user must have a role that includes the Manage governance categories permission.

    This parameter is required for the default configuration.

  • default_category_artifact_id: the ID of the category that you want to be default category in the target repository for synced items. This category becomes the parent category for any synced artifacts that do not have a parent category. You can create a category in your target system or select an existing one, and provide the artifact ID of that category when you register external repositories. You can use the same default category for different remote repositories.

    This parameter is optional. If you do not provide this parameter, a default category for the specified source repositories is automatically created and named [<source-repository-name> :: uncategorized].

  • exceptions_category_artifact_id: the ID of the category with which artifacts are associated if the synchronization failed. Any artifacts that could not be created successfully are associated with the exceptions category. You can create a new exceptions category in the target system or select an existing category, and provide the artifact ID of that category when you register external repositories. You can use the same exceptions category for different remote repositories.

    This parameter is optional. If you do not provide this parameter, an exceptions category for the specified source repositories is automatically created and named [<source-repository-name> :: Other-Artifacts-Created-By-Egeria].

  • source_repository_id: the ID of the external repository. You can obtain this ID by running an OMRS health check for the source system. The value that is returned in the metadata_collection_id attribute. This parameter is required for a custom configuration.

  • source_repository_name: the name of the external source repository. This parameter is required for a custom configuration.

  • root_categories: a list of root categories to be created to receive artifacts. Artifacts from categories in this list are directly synced to the correct category. Any other artifacts are synced to the exceptions category and can be moved later.

    A root category contains the following attributes:

    • external_id: the ID of the category in the external repository.
    • name: the name of the category in the external repository. This parameter is optional.

Default configuration

curl -k -X POST "https://<hostname>/v3/glossary_terms/admin/open-metadata/external_repositories" \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json \
--data '{
    "default_configuration": {
        "receiving_user_id": "<receiving-user-id>",
        "default_category_artifact_id": "<default-category-id>",
        "exceptions_category_artifact_id": "<exceptions-category-id>"
    }
}'"

Custom configuration

curl -k -X POST "https://<hostname>/v3/glossary_terms/admin/open-metadata/external_repositories" \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json \
--data '{
    "repositories": [
        {
            "source_repository_id": "<source-repository-id>",
            "source_repository_name": "<source-repository-name>",
            "receiving_user_id": "<receiving-user-id>",
            "default_category_artifact_id": "<default-category-id>",
            "exceptions_category_artifact_id": "<exception-category-id>",
            "root_categories": [
                {
                    "external_id": "<external-id>",
                    "name": "<name>"
                }
            ]
        }
    ]
}'"

Further administration APIs

List external repositories

To get a list of all configured external repositories, submit the following API call:

curl -k -X GET "https://<hostname>/v3/glossary_terms/admin/open-metadata/external_repositories" -H "Authorization: Bearer <token>" -H "Content-Type: application/json"

Update external repositories

To change, for example, the list of root categories or the receiving user ID, update the external repositories. You can also use this API to modify the default configuration by passing the value default as repository_id.

curl -k -X PATCH "https://<hostname>/v3/glossary_terms/admin/open-metadata/external_repositories/{repository_id}" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json \
  --data '{
    "source_repository_name": "<source_repository_name>",
    "receiving_user_id": "<receiving_user_id>",
    "default_category_artifact_id": "<default_category_id>",
    "exceptions_category_artifact_id": "<exception_category_id>",
  }'"

Expected response code: 200 OK

Delete an external repository

If you want to delete a repository configuration, for example because the configuration wasn't correct, you can submit the following API request:

curl -k -X DELETE "https://<hostname>/v3/glossary_terms/admin/open-metadata/external_repositories/{repository_id}" -H "Authorization: Bearer <token>" -H "Content-Type: application/json"

To delete a default repository configuration, replace {repository_id} in the API call with the value default.

Expected response code: 200 OK

OMRS health check

To check whether synchronization with Egeria cohorts is enabled and the connection is healthy, submit the following API call:

curl -k -X GET "https://<hostname>/v3/glossary_terms/admin/open-metadata/healthcheck" -H "Authorization: Bearer <token>"

Expected response:

{

    "metadata_collection_id": "<source_repository_id>",

    "status": "CONNECTED"

}

Note: A status of CONNECTED returned from this method does not mean that the connections to all the cohorts are healthy.

Next steps

Learn more

Parent topic: Administering Watson Knowledge Catalog