Adding a storage-catalog pair

You can organize your data in watsonx.data by creating storage.

watsonx.data on Red Hat® OpenShift®

watsonx.data Developer edition

watsonx.data SaaS on AWS

About this task

A storage is an existing or an externally managed storage. It is one of the data sources for your watsonx.data. A catalog defines the schemas and metadata for a data source.

Note: watsonx.data supports the following storage types: Amazon S3, Ceph, Hadoop Distributed File System (HDFS) and IBM Cloud Object Storage (COS).

Note: The out-of-the-box MinIO object storage is provided for exploratory purposes only. It does not have all the security features and is not configured to provide high-speed data access. Register your own s3 bucket that meets your security and performance requirements.

Procedure

To add a storage, complete the following steps:

Log in to watsonx.data console.
From the navigation menu, select Infrastructure manager.
To define and connect a storage, click Add component and select Add storage.

In the Add storage window, select a storage from the Storage type drop-down list and provide the details to connect to existing externally managed storage. Based on the storage type selected, click the respective link to configure the storage details.

IBM Cloud Object Storage
IBM Storage Ceph
Amazon S3
Hadoop Distributed File System (HDFS)
MinIO
Google Cloud Storage
Azure Data Lake Storage Gen1 Blob
Azure Data Lake Storage Gen2

IBM Cloud Object Storage or Amazon S3: If you select IBM Cloud Object Storage or Amazon S3 from the Storage type drop-down list, configure the following details:

Field	Description
Storage type	Select the storage type from the list.
Region	Select the region where the storage is available.
Bucket name	Enter the name of your existing bucket.
Display name	Enter the name to be displayed.
Endpoint	Enter the Endpoint URL. Enter the endpoint to activate the test connection feature.
Access key	Enter your Access key.
Secret key	Enter your Secret key.
Connection Status	Click the Test connection link to test the bucket connection. If the bucket connection is successful, a success message appears.
Associate catalog	Select the checkbox to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
Activate now	Activate the storage immediately or activate it later. It terminates the in-flight queries against data in any bucket.
Catalog type	Select the catalog type from the list. The recommended catalog is Apache Iceberg. The other options for catalog are Apache Hive, Apache Hudi, and Delta Lake.
Catalog name	Enter the name of the associated catalog.

IBM Storage Ceph or MinIO: If you select IBM Storage Ceph or MinIO from the Storage type drop-down list, configure the following details:

Field	Description
Storage type	Select the storage type from the list.
Bucket name	Enter the name of your existing bucket.
Display name	Enter the name to be displayed.
Endpoint	Enter the Endpoint URL. Enter the endpoint to activate the test connection feature.
Access key	Enter your Access key.
Secret key	Enter your Secret key.
Connection Status	Click the Test connection link to test the bucket connection. If the bucket connection is successful, a success message appears.
Associate catalog	Select the check box to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
Activate now	Activate the storage immediately or activate it later. It terminates the in-flight queries against data in any bucket.
Catalog type	Select the catalog type from the list. The recommended catalog is Apache Iceberg. The other options for catalog are Apache Hive, Apache Hudi, and Delta Lake.
Catalog name	Enter the name of the associated catalog.

Hadoop Distributed File System (HDFS): If you select Hadoop Distributed File System (HDFS) from the Storage type drop-down list, configure the following details:

Field	Description
Storage type	Select the storage type from the list.
Display name	Enter the name to be displayed.
Thrift URI	Enter the Thrift URI.
Thrift port	Enter Thrift port.
Kerberos authentication	Use the toggle switch to enable or disable Kerberos authentication. If enabled, enter the following information: HDFS principal Hive client principal Hive server principal Upload the following files: Kerberos config file (.config) HDFS keytab file (.keytab) Hive keytab file (.keytab)
Upload core site file (.xml)	Upload core site file (.xml)
Upload HDFS site file (.xml)	Upload HDFS site file (.xml)
Associated catalog	Add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
Catalog type	The supported catalog is Apache Hive.
Catalog name	Enter the name of the associated catalog.

Google Cloud Storage: If you select Google Cloud Storage from the Storage type drop-down list, configure the following details:

Field	Description
Bucket name	Enter the Bucket name.
Display name	Enter the name to be displayed.
Upload JSON key file (.json)	Upload the JSON key file. JSON key file is used to authenticate a Google Cloud service account with Google Cloud Storage.
Associate catalog	Select the checkbox to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
Catalog type	Select the catalog type from the list. The supported catalogs are Apache Iceberg and Apache Hive.
Catalog name	Enter the name of the associated catalog.

Azure Data Lake Storage Gen1 Blob: If you select Azure Data Lake Storage Gen1 Blob from the Storage type drop-down list, configure the following details:

Field	Description
Display name	Enter the name to be displayed.
Container name	Enter the Container name.
Storage account name	Enter the Storage account name.
Endpoint	Enter the Endpoint URL.
Authentication mode	Select the Authentication mode. SAS: Enter your SAS token. Account key: Enter your access key.
Associate catalog	Select the checkbox to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
Catalog type	Select the catalog type from the list. The recommended catalog is Apache Iceberg. The other options for catalog are Apache Hive, Apache Hudi, and Delta Lake.
Catalog name	Enter the name of the associated catalog.

Azure Data Lake Storage Gen2: If you select Azure Data Lake Storage Gen2 from the Storage type drop-down list, configure the following details:

Field	Description
Display name	Enter the name to be displayed.
Container name	Enter the Container name.
Storage account name	Enter the Storage account name.
Endpoint	Enter the Endpoint URL.
Authentication mode	Select the Authentication mode. SAS: Enter your SAS token. Service Principle: Enter the Application id, Directory id and Secret key.
Associate catalog	Select the checkbox to add a catalog for your storage. This catalog is associated with your storage and serves as your query interface with the data stored within.
Catalog type	Select the catalog type from the list. The recommended catalog is Apache Iceberg. The other options for catalog are Apache Hive, Apache Hudi, and Delta Lake.
Catalog name	Enter the name of the associated catalog.

Click Register.

Note: A catalog defines the schemas and metadata for a data source. Depending on storage type, Iceberg, Hive, Hudi, and Delta lake catalogs are supported.

Note: For Google Cloud Storage, multiple buckets of different service accounts cannot be configured.
Features
For Iceberg connector:
1. You can delete data from tables by using DELETE FROM statement for Iceberg connector.
2. You can specify the table property delete_mode for new tables by using either copy-on-write mode or merge-on-read mode (default).
For DELETE FROM statement for Iceberg connector:
1. Filtered columns only support comparison operators, such as EQUALS, LESS THAN, or LESS THAN EQUALS.
2. Deletes must only occur on the latest snapshot.
3. For V1 tables, the Iceberg connector can only delete data in one or more entire partitions. Columns in the filter must all be identity-transformed partition columns of the target table.
For the Iceberg connector, ALTER TABLE operations on a column support the following data type conversions:
1. INT to BIGINT
2. FLOAT to DOUBLE
3. DECIMAL(num1, dec_digits) to DECIMAL(num2, dec_digits), where num2>num1
Limitations for SQL statements
1. For Iceberg, Memory and Hive connectors, DROP SCHEMA can do RESTRICT by default.
2. For the database-based catalogs, CREATE SCHEMA, CREATE TABLE, DROP SCHEMA, DROP TABLE, DELETE, DROP VIEW, ALTER TABLE, and ALTER SCHEMA statements are not available in the Data manager UI.
Limitations for data types
1. For the Iceberg connector, the maximum number of digits that can be accommodated in a column of data type FLOAT and DOUBLE is 37. Trying to insert anything larger ends up in a decimal overflow error.
2. When the fields of data type REAL have 6 digits or more in the decimal part with the digits being predominately zero, the values when queried are rounded off. It is observed that the rounding off occurs differently based on the precision of the values. For example, a decimal number 1.654 when rounded to 3-digits after the decimal point are the same. Another example is 10.890009 and 10.89000. It is noticed that 10.89000 is rounded to 10.89, whereas 10.89009 is not rounded off. This is an inherent issue because of the representational limitations of binary floating point formats. This might have a significant impact when querying involves sorting.
For more information on mixed-case feature flag behavior, supported SQL statements and supported data types matrices, see Support content.