What's new in watsonx.data

Read about the new features and enhancements in the current and previous releases of IBM® watsonx.data.

watsonx.data Developer edition

watsonx.data on Red Hat® OpenShift®

IBM watsonx.data is a new open architecture lakehouse that combines the elements of the data warehouse and data lakes. The best-in-class features and optimizations available on the watsonx.data make it an optimal choice for next generation data analytics and automation. It is released in three versions:

  • Software
  • Developer
  • Cloud

IBM watsonx.data Version 2.0.0

A new version of watsonx.data was released in June 2024.

This release of watsonx.data includes the following features and updates:
Azure Data Lake Storage Gen2 (ADLS), Azure Blob and Google Cloud Storage
You can now use the following storage types:
  • You can now add Azure Blob, Azure Data Lake Storage Gen2 (ADLS), and Google Cloud Storage to watsonx.data.
  • You can now use Azure Data Lake Storage (ADLS) Gen1 and Gen2 to store your data while submitting Spark applications.

For more information, see Adding a storage-catalog pair.

New Arrow Flight service based data sources

You can now use the following data sources with Arrow Flight service:

  • Greenplum
  • Salesforce
  • MariaDB
  • Apache Derby

For more information, see Arrow Flight service.

New data sources

You can now use the following data sources:

  • Cassandra
  • BigQuery
  • ClickHouse
  • Apache Pinot

For more information, see Adding a database-catalog pair.

New page for Bring Your Own JAR (BYOJ) process for SAP HANA data source
Users can now use a new dedicated section Driver manager under new Configurations page to manage drivers for SAP HANA data source. Each of these drivers undergo a series of validation.

For more information, see SAP HANA.

Apache Ranger policies
IBM watsonx.data now supports Apache Ranger policies to allow integration with Presto engines.

For more information, see Apache Ranger policy.

Provision Spark as a native engine
In addition to registering external Spark engines, you can now provision native Spark engine in watsonx.data. With native Spark engine, you can manage Spark engine configuration, manage access to Spark engines, and view applications by using REST API endpoints from watsonx.data.

For more information, see Native Spark engine.

Query Optimizer to improve query performance
You can now use Query Optimizer, to improve the performance of queries that are processed by the Presto (C++) engine. If Query Optimizer determines that optimization is feasible, the query undergoes rewriting; otherwise, the native engine optimization takes precedence.

For more information, see Query Optimizer overview.

New name for Presto engine in watsonx.data
Presto is renamed to Presto (Java).
New engine (Presto C++) in watsonx.data
You can provision a Presto (C++) engine ( version 0.286) in watsonx.data to run SQL queries on your data source and fetch the queried data.

For more information, see Presto (C++) overview.

API Customization feature
You can now use catalog and engine API Customization for Presto (Java) and Presto (C++) engines in watsonx.data.

For more information, see IBM API docs.

Mixed case feature flag for Presto (Java) engine
The mixed case feature flag, which allows to switch between case sensitive and case insensitive behavior in Presto (Java), is available. The flag is set to OFF by default and can be set to ON during the deployment of watsonx.data.

For more information, see Presto (Java) mixed-case support overview.

Using proxy to access S3 and S3 compatible buckets
External applications and query engines can access the S3 and S3 compatible buckets managed by watsonx.data through an S3 proxy.

For more information, see Using CAS proxy to access S3 and S3 compatible buckets.

Semantic automation for data enrichment
Semantic automation for data enrichment leverages generative AI with IBM Knowledge Catalog to understand your data on a deeper level and enhance data with automated enrichment to make it valuable for analysis.

For more information, see Semantic automation for data enrichment in watsonx.data.

Hive Metastore (HMS) access in watsonx.data
You can now fetch metadata information for Hive Metastore by using REST APIs instead of getting the information from the engine details. HMS details are used by external entities to integrate with watsonx.data. You must have an Admin, Metastore Admin, or Metastore Viewer role to run the API.
Manage resource quota limits for your Spark engine

You can now manage the resource usage quota for the Spark engine in Cloud Pak for Data by using the REST API or from the Spark engine details page.

For more information, see Managing resource quota.
Version upgrade
  • Presto (Java) engine is now upgraded to version 0.286.
  • Milvus service is now upgraded to version to 2.4.0. Important features include:
    • Better Performance (Low Memory Utilisation)
    • Support Sparse Data
    • Inbuilt SPLADE Engine for Sparse Vector Embedding
    • BGE M3 Hybrid (Dense+Sparse) Search
Command to retrieve ingestion history
You can now retrieve the status of all ingestion jobs that are submitted by using the ibm-lh get-status --all-jobs CLI command. You can retrieve the status of all ingestion jobs that are submitted. You get the history records that you have access to.

For more information, see Options and parameters supported in ibm-lh tool.

New operations for Db2 data source
You can perform the following operations for BLOB and CLOB data types for Db2 data source:
  • INSERT
  • CREATE
  • CTAS
  • ALTER
  • DROP
New data types for data sources
The following new data types are now available for some data sources. You can access these data types on the Data manager page under the Add column option.

BLOB

  • Db2
  • Teradata
  • Oracle
  • MySQL
  • SingleStore

CLOB

  • Db2
  • Teradata
  • Oracle

BINARY

  • SQL Server
  • MySQL

Because the numeric data type is not supported in watsonx.data, you can use the decimal data type as an equivalent alternative to the numeric data type for Netezza data source.

You can now use the BLOB and CLOB data types with the SELECT statement in the Query workspace to build and run queries against your data for Oracle and SingleStore data sources.

You can now use the BLOB and CLOB data types for MySQL and PostgreSQL data sources as equivalents to LONGTEXT, BYTEA, and TEXT because these data types are not compatible with Presto (Java). These data types are mapped to CLOB and BLOB in Presto (Java) if data sources have existing tables with LONGTEXT, TEXT, and BYTEA data types.

  • MySQL (CLOB as equivalent to LONGTEXT)
  • PostgreSQL (CLOB as equivalent to TEXT)
  • PostgreSQL (BLOB as equivalent to BYTEA)
  • Netezza (decimal as equivalent to numeric)
  • Oracle (BLOB and CLOB with the SELECT statement)
  • SingleStore (BLOB and CLOB with the SELECT statement)