Adding a storage-catalog pair

You can organize your data in watsonx.data by creating storage.

watsonx.data Developer edition

watsonx.data on Red Hat® OpenShift®

watsonx.data SaaS on AWS

About this task

A storage is an existing or an externally managed storage. It is one of the data sources for your watsonx.data. A catalog defines the schemas and metadata for a data source.
Note: watsonx.data supports the following storage types: Amazon S3, Ceph, Hadoop Distributed File System (HDFS) and IBM Cloud Object Storage (COS).

Procedure

To add a storage, complete the following steps:

  1. Log in to watsonx.data console.
  2. From the navigation menu, select Infrastructure manager.
  3. To define and connect a storage, click Add component and select Add storage.
  4. In the Add storage window, select a storage from the Storage type drop-down list and provide the details to connect to existing externally managed storage. The list includes the following storage types:
    • IBM Cloud Object Storage
    • IBM Storage Ceph
    • Amazon S3
    • Hadoop Distributed File System (HDFS)
    • MinIO
  5. Based on the storage type selected, click the respective link to configure the storage details.
    IBM Cloud Object Storage or Amazon S3
    If you select IBM Cloud Object Storage or Amazon S3 from the Storage type drop-down list, configure the following details:
    1. In the Add storage window, enter the following details:
    Field Description
    Storage type Select the storage type from the list.
    Region Select the region where the storage is available.
    Bucket name Enter the name of your existing bucket.
    Display name Enter the name to be displayed.
    Endpoint Enter the endpoint URL. Enter the endpoint to activate the test connection feature.
    Access key Enter your access key.
    Secret key Enter your secret key.
    Connection Status Click the Test connection link to test the bucket connection. If the bucket connection is successful, a success message appears.
    Associate Catalog Select the checkbox to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
    Activate now Activate the storage immediately or activate it later. It terminates the in-flight queries against data in any bucket.
    Catalog type Select the catalog type from the list. The recommended catalog is Apache Iceberg. The other options for catalog are Apache Hive, Apache Hudi, and Delta Lake.
    Catalog name Enter the name of the catalog.
    IBM Storage Ceph or MinIO
    If you select IBM Storage Ceph or MinIO from the Storage type drop-down list, configure the following details:
    1. In the Add storage window, enter the following details:
    Field Description
    Storage type Select the storage type from the list.
    Bucket name Enter the name of your existing bucket.
    Display name Enter the name to be displayed.
    Endpoint Enter the endpoint URL. Enter the endpoint to activate the test connection feature.
    Access key Enter your access key.
    Secret key Enter your secret key.
    Connection Status Click the Test connection link to test the bucket connection. If the bucket connection is successful, a success message appears.
    Associate Catalog Select the checkbox to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
    Activate now Activate the storage immediately or activate it later. It terminates the in-flight queries against data in any bucket.
    Catalog type Select the catalog type from the list. The recommended catalog is Apache Iceberg. The other options for catalog are Apache Hive, Apache Hudi, and Delta Lake.
    Catalog name Enter the name of the catalog.
    Hadoop Distributed File System (HDFS)
    If you select Hadoop Distributed File System (HDFS) from the Storage type drop-down list, configure the following details:
    1. In the Add storage window, enter the following details:
    Field Description
    Storage type Select the storage type from the list.
    Display name Enter the name to be displayed.
    Thrift URI Enter the Thrift URI.
    Thrift port Enter Thrift port.
    Kerberos authentication

    (For version 1.1.4)

    Toggle the Kerberos authentication switch to on position for secure connection. Enter the following information:
    1. HDFS principal
    2. Hive client principal
    3. Hive server principal
    Upload the following files:
    1. Kerberos config file (.config)
    2. HDFS keytab file (.keytab)
    3. Hive keytab file (.keytab)
    Upload core site file (.xml) Upload core site file (.xml)
    Upload HDFS site file (.xml) Upload HDFS site file (.xml)
    Associated Catalog Add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
    Catalog type The supported catalog is Apache Hive.
    Catalog name Enter the name of the catalog.
  6. To associate a catalog with the added storage, select Associate Catalog checkbox and then select the suitable catalog from the drop-down list.
    Note: A catalog defines the schemas and metadata for a data source. Depending on storage type, Iceberg, Hive, Hudi, and Delta lake catalogs are supported.
  7. Click Register.

    Limitation:

    Presto does not support creation of schema and table in the Delta Lake. The Delta Lake connector relies on the Hive metastore to find the location of Delta Lake tables.

    Workaround:

    You can create table in the Delta Lake by using external location.

    Example:
    CREATE TABLE sales.apac.sales_data_new (dummyColumn INT)
    WITH (external_location = 's3://db-sa-datasets/presto/sales_data_new');