Ingesting data by using Spark
You can ingest data into IBM® watsonx.data by using IBM Analytics Engine (Spark) engine through the web console.
watsonx.data on Red Hat® OpenShift®
watsonx.data SaaS on AWS
Before you begin
- You must have the Administrator role and privileges in the catalog to do ingestion through the web console.
- Add and register IBM Analytics Engine (Spark). See Registering an engine.
- Add buckets for the source data files and target catalog. See Adding a storage-catalog pair.
- Optionally, you can create a schema in the catalog for the target table. See Creating schemas.
- Optionally, you can also create a target table in the schema. See Creating tables.
- To enable your Spark application to work with the watsonx.data catalog and storage, you must have
Metastore admin
role. WithoutMetastore admin
privilege, you cannot ingest data to storage using Native Spark engine. To enable your Spark application to work with the watsonx.data catalog and storage, add the following configuration to your application payload:spark.hive.metastore.client.plain.username=ibmlhapikey spark.hive.metastore.client.plain.password=<api-key-of-the-user-which-has-metastore-admin-role> spark.hadoop.wxd.apiKey=Basic base64(ibmlhapikey_ibmcloudid:apikey)
Procedure
Limitations
The following are some of the limitations of Spark ingestion:
- Spark ingestion supports only source data files from object storage bucket. Local files are not supported.
- The default buckets in watsonx.data are not exposed to the Spark engine. Hence, iceberg-bucket and hive-bucket are not supported for source or target table. Users can use their own MinIo or S3 compatible buckets that are exposed and accessible by the Spark engine.