Adding a storage-catalog pair

You can organize your data in watsonx.data by creating storage.

watsonx.data Developer edition

watsonx.data on Red Hat® OpenShift®

watsonx.data SaaS on AWS

About this task

A storage is an existing or an externally managed storage. It is one of the data sources for your watsonx.data. A catalog defines the schemas and metadata for a data source.

Note: watsonx.data supports the following storage types: Amazon S3, Ceph, Hadoop Distributed File System (HDFS) and IBM Cloud Object Storage (COS).

Procedure

To add a storage, complete the following steps:

Log in to watsonx.data console.
From the navigation menu, select Infrastructure manager.
To define and connect a storage, click Add component and select Add storage.
In the Add storage window, select a storage from the Storage type drop-down list and provide the details to connect to existing externally managed storage. The list includes the following storage types:
- IBM Cloud Object Storage
- IBM Storage Ceph
- Amazon S3
- Hadoop Distributed File System (HDFS)
- MinIO

Based on the storage type selected, click the respective link to configure the storage details.

IBM Cloud Object Storage
IBM Storage Ceph
Amazon S3
Hadoop Distributed File System (HDFS)
MinIO

IBM Cloud Object Storage or Amazon S3

If you select IBM Cloud Object Storage or Amazon S3 from the Storage type drop-down list, configure the following details:

In the Add storage window, enter the following details:

Field	Description
Storage type	Select the storage type from the list.
Region	Select the region where the storage is available.
Bucket name	Enter the name of your existing bucket.
Display name	Enter the name to be displayed.
Endpoint	Enter the endpoint URL. Enter the endpoint to activate the test connection feature.
Access key	Enter your access key.
Secret key	Enter your secret key.
Connection Status	Click the Test connection link to test the bucket connection. If the bucket connection is successful, a success message appears.
Associate Catalog	Select the checkbox to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
Activate now	Activate the storage immediately or activate it later. It terminates the in-flight queries against data in any bucket.
Catalog type	Select the catalog type from the list. The recommended catalog is Apache Iceberg. The other options for catalog are Apache Hive, Apache Hudi, and Delta Lake.
Catalog name	Enter the name of the catalog.

IBM Storage Ceph or MinIO

If you select IBM Storage Ceph or MinIO from the Storage type drop-down list, configure the following details:

In the Add storage window, enter the following details:

Field	Description
Storage type	Select the storage type from the list.
Bucket name	Enter the name of your existing bucket.
Display name	Enter the name to be displayed.
Endpoint	Enter the endpoint URL. Enter the endpoint to activate the test connection feature.
Access key	Enter your access key.
Secret key	Enter your secret key.
Connection Status	Click the Test connection link to test the bucket connection. If the bucket connection is successful, a success message appears.
Associate Catalog	Select the checkbox to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
Activate now	Activate the storage immediately or activate it later. It terminates the in-flight queries against data in any bucket.
Catalog type	Select the catalog type from the list. The recommended catalog is Apache Iceberg. The other options for catalog are Apache Hive, Apache Hudi, and Delta Lake.
Catalog name	Enter the name of the catalog.

Hadoop Distributed File System (HDFS)

If you select Hadoop Distributed File System (HDFS) from the Storage type drop-down list, configure the following details:

In the Add storage window, enter the following details:

Field	Description
Storage type	Select the storage type from the list.
Display name	Enter the name to be displayed.
Thrift URI	Enter the Thrift URI.
Thrift port	Enter Thrift port.
Kerberos authentication (For version 1.1.4)	Toggle the Kerberos authentication switch to on position for secure connection. Enter the following information: HDFS principal Hive client principal Hive server principal Upload the following files: Kerberos config file (.config) HDFS keytab file (.keytab) Hive keytab file (.keytab)
Upload core site file (.xml)	Upload core site file (.xml)
Upload HDFS site file (.xml)	Upload HDFS site file (.xml)
Associated Catalog	Add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
Catalog type	The supported catalog is Apache Hive.
Catalog name	Enter the name of the catalog.

To associate a catalog with the added storage, select Associate Catalog checkbox and then select the suitable catalog from the drop-down list.

Note: A catalog defines the schemas and metadata for a data source. Depending on storage type, Iceberg, Hive, Hudi, and Delta lake catalogs are supported.
Click Register.
Limitation:

Presto does not support creation of schema and table in the Delta Lake. The Delta Lake connector relies on the Hive metastore to find the location of Delta Lake tables.

Workaround:

You can create table in the Delta Lake by using external location.
Example:
```
CREATE TABLE sales.apac.sales_data_new (dummyColumn INT)
WITH (external_location = 's3://db-sa-datasets/presto/sales_data_new');
```