Ingesting data by using Spark

You can ingest data into IBM® watsonx.data by using IBM Analytics Engine (Spark) engine through the web console.

watsonx.data on Red Hat® OpenShift®

watsonx.data SaaS on AWS

Before you begin

You must have the Administrator role and privileges in the catalog to do ingestion through the web console.
Add and register IBM Analytics Engine (Spark). See Registering an engine.
Add buckets for the source data files and target catalog. See Adding a storage-catalog pair.
Optionally, you can create a schema in the catalog for the target table. See Creating schemas.
Optionally, you can also create a target table in the schema. See Creating tables.
To enable your Spark application to work with the watsonx.data catalog and storage, you must have Metastore admin role. Without Metastore admin privilege, you cannot ingest data to storage using Native Spark engine. To enable your Spark application to work with the watsonx.data catalog and storage, add the following configuration to your application payload:
```
spark.hive.metastore.client.plain.username=ibmlhapikey
spark.hive.metastore.client.plain.password=<api-key-of-the-user-which-has-metastore-admin-role>
spark.hadoop.wxd.apiKey=Basic base64(ibmlhapikey_ibmcloudid:apikey)
```

Procedure

Log in to the watsonx.data console.
From the navigation menu, select Data manager and open the Ingest data window in one of the following ways:
1. Select the Ingestion jobs tab and click Create ingestion job. The Ingest data window opens with an auto-generated job ID.
2. From the Data objects tab, select a table from the list of tables under the schema and click Ingest data. The Ingest data window opens with the Engine tab with an auto-generated job ID.
If required, modify the auto-generated ingestion job ID in the Enter job ID field.
Select the co-located IBM Analytics Engine (Spark) from the Select engine list. The registered Spark engines are listed here.
Configure Spark driver cores, executor cores, and memory resources. Click Next.

Note: For IBM Cloud, the Spark driver, executor vCPU and memory combinations must be in a 1:2, 1:4, or 1:8 ratio. The default configuration values are filled. See Default limits and quotas for Analytics Engine instances.
In the Select file(s) tab, click Select remote files.
From the Bucket drop-down, select the bucket from where you want to ingest the data.
Select the required file type based on the source data. The available options are CSV and Parquet.
From the source directory, select the source data files to be ingested and click Next.

Note: You can apply the configuration for Header, Encoding, Escape character, Field delimiter, and Line delimiter for the CSV files.
In the Target tab, select the target catalog from the Select catalog list.
Select one of the schema options:
1. Existing schema: To ingest source data into an existing schema. Corresponding target schemas are listed in the Select schema dropdown.
2. New schema: Enter the target schema name in Schema name to create a new schema from the source data.
Select the corresponding Target table options based on the selection in step 11.
1. Existing table:To ingest source data into an existing table. Corresponding target tables are listed in the Select table dropdown.
2. New table: Enter the target table name in Table name to create a new table from the source data.
Click Next.
Validate the details in the summary page. Click Ingest.

Limitations

The following are some of the limitations of Spark ingestion:

Spark ingestion supports only source data files from object storage bucket. Local files are not supported.
The default buckets in watsonx.data are not exposed to the Spark engine. Hence, iceberg-bucket and hive-bucket are not supported for source or target table. Users can use their own MinIo or S3 compatible buckets that are exposed and accessible by the Spark engine.