Adding a storage-catalog pair
You can organize your data in watsonx.data by creating storage.
watsonx.data Developer edition
watsonx.data on Red Hat® OpenShift®
watsonx.data SaaS on AWS
About this task
A storage is an existing or an externally managed storage. It is one of the data sources for your
watsonx.data. A catalog
defines the schemas and metadata for a data source.
Note: watsonx.data supports the
following storage types: Amazon S3, Ceph, Hadoop Distributed File System (HDFS) and IBM Cloud Object
Storage (COS).
Procedure
To add a storage, complete the following steps:
- Log in to watsonx.data console.
- From the navigation menu, select Infrastructure manager.
- To define and connect a storage, click Add component and select Add storage.
- In the Add storage window, select a storage from the
Storage type drop-down list and provide the details to connect to existing
externally managed storage. The list includes the following storage types:
- IBM Cloud Object Storage
- IBM Storage Ceph
- Amazon S3
- Hadoop Distributed File System (HDFS)
- MinIO
- Based on the storage type selected, click the respective link to configure the storage
details.
- IBM Cloud Object Storage or Amazon S3
- If you select IBM Cloud Object Storage or Amazon
S3 from the Storage type drop-down list, configure the following details:
- In the Add storage window, enter the following details:
Field Description Storage type Select the storage type from the list. Region Select the region where the storage is available. Bucket name Enter the name of your existing bucket. Display name Enter the name to be displayed. Endpoint Enter the endpoint URL. Enter the endpoint to activate the test connection feature. Access key Enter your access key. Secret key Enter your secret key. Connection Status Click the Test connection link to test the bucket connection. If the bucket connection is successful, a success message appears. Associate Catalog Select the checkbox to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within. Activate now Activate the storage immediately or activate it later. It terminates the in-flight queries against data in any bucket. Catalog type Select the catalog type from the list. The recommended catalog is Apache Iceberg. The other options for catalog are Apache Hive, Apache Hudi, and Delta Lake. Catalog name Enter the name of the catalog. - IBM Storage Ceph or MinIO
- If you select IBM Storage Ceph or MinIO from the
Storage type drop-down list, configure the following details:
- In the Add storage window, enter the following details:
Field Description Storage type Select the storage type from the list. Bucket name Enter the name of your existing bucket. Display name Enter the name to be displayed. Endpoint Enter the endpoint URL. Enter the endpoint to activate the test connection feature. Access key Enter your access key. Secret key Enter your secret key. Connection Status Click the Test connection link to test the bucket connection. If the bucket connection is successful, a success message appears. Associate Catalog Select the checkbox to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within. Activate now Activate the storage immediately or activate it later. It terminates the in-flight queries against data in any bucket. Catalog type Select the catalog type from the list. The recommended catalog is Apache Iceberg. The other options for catalog are Apache Hive, Apache Hudi, and Delta Lake. Catalog name Enter the name of the catalog. - Hadoop Distributed File System (HDFS)
- If you select Hadoop Distributed File System (HDFS) from the
Storage type drop-down list, configure the following details:
- In the Add storage window, enter the following details:
Field Description Storage type Select the storage type from the list. Display name Enter the name to be displayed. Thrift URI Enter the Thrift URI. Thrift port Enter Thrift port. Kerberos authentication (For version 1.1.4)
Toggle the Kerberos authentication switch to on position for secure connection. Enter the following information: - HDFS principal
- Hive client principal
- Hive server principal
Upload the following files:- Kerberos config file (.config)
- HDFS keytab file (.keytab)
- Hive keytab file (.keytab)
Upload core site file (.xml) Upload core site file (.xml) Upload HDFS site file (.xml) Upload HDFS site file (.xml) Associated Catalog Add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within. Catalog type The supported catalog is Apache Hive. Catalog name Enter the name of the catalog. - To associate a catalog with the added storage, select Associate
Catalog checkbox and then select the suitable catalog from the drop-down list. Note: A catalog defines the schemas and metadata for a data source. Depending on storage type, Iceberg, Hive, Hudi, and Delta lake catalogs are supported.
- Click Register.
Limitation:
Presto does not support creation of schema and table in the Delta Lake. The Delta Lake connector relies on the Hive metastore to find the location of Delta Lake tables.
Workaround:
You can create table in the Delta Lake by using external location.
Example:CREATE TABLE sales.apac.sales_data_new (dummyColumn INT) WITH (external_location = 's3://db-sa-datasets/presto/sales_data_new');