Joining data sources (Watson Machine Learning)

When you specify the data for an AutoAI experiment, you can choose to combine two or more data sources that share a common column, or key. You are creating a new data table by combining the data based on the specified join keys.

Service Spark environments are not available by default. An administrator must install the Analytics Engine Powered by Apache Spark service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Notes about joining data sources

Each data source must be a CSV file.
You can join up to 20 files, with each file less than 4GB and a combined maximum of 20GB.
If the total size of the joined data is more than 1GB, a sample size of 1GB will be used to train the model.
The max depth of connections is three. For example, the main source (A) can be connected to source B, which is connected to source C, which in turn is connected to source D. Source D cannot be connected to another source.
The type of join created is a left join, which returns all records from the left table, and the matching records from the right table.
Each join must have at least one join key, or common column, specified. If no key is specified, the join is ignored when the experiment runs.
If you configure more than one join, AutoAI will determine the best order for running the joins.
After you run the experiment, you can download the joined data to review the schema and to see the new columns added as a result of feature engineering.

Joining data sources

Specify a name and description for your experiment.
Select a machine learning service instance and a compute configuration and click Create.
Choose two or more data files from your project,upload them from your file system, or select them from the asset browser, then press Continue. Tip: Click the Preview icon to review your data.
When you are done loading data sources, start the configuration process by selecting one of the sources as the main source for the data join.
Next, click Configure joins to open the canvas for connecting your data sources.
To create a join, hover over either end of the main source, and drag a connection to another source. A join displays.
Click the join icon to open the panel for specifying the key. A key is a common field that can connect the data sources. AutoAI identifies and suggests common fields.
Choose the key to complete the join.
Repeat steps 6 through 8 to create more joins and keys to connect data sources.
When you are done, click Save join to return to the experiment configuration, choose a column to predict and run your experiment.

Join examples

These examples show how to create a single join and multiple joins.

Single join

In this example, two data sources are uploaded: group_customer_main.csv and group_customer_customers.csv``. The file group_customer_main.csv is designated as the main source for the data join.

Single join

The key for the join is the column group_customer_id. Tip: Use the Schema preview tab in the Join panel to view the column names to help you select a key.

Multiple joins

In this configuration, five tables are joined with four joins, as follows:

Main source	Joined source	Key
group_customers_main	group_customers_customers	group_customer_id
group_customers_main	group_customers_transactions	transaction_id
group_customers_main	group_customers_purchases	group_id
group_customers_transactions	group_customers_products	product_id

Multiple join

When you complete the joins, click Save join. AutoAI configures the join

View and edit join settings

From the create experiment page, click Experiment settings to view and edit these settings for the data join.

Stratified sampling limit

Stratified sampling sorts data into subgroups, or strata, for a more accurate representation of your joined data sources. Optionally increase or decrease the number of rows to include in each strata.

Timestamp columns

Timestamps are used by AutoAI to extract time related features. If your data set includes a date/time column and you enable the timestamp threshold, the join result only includes the data from row before the timestamp threshold to avoid data leakage.

To establish a threshold, enable the option, then choose the timestamp column and choose the type of date/time data it contains.

Feature selectors

Feature selectors are options that help to exclude irrelevant data and improve experiment run time. They include:

Deduplication (enabled by default). Removes duplicated features.
Inconsistency (enabled by default). Removes features with inconsistent distribution between random splits.
Filter (disabled by default). Removes low correlation data for regression problems, or low information gains for classification problems.

Run the experiment

Choose a prediction column and run the experiment. In addition to the infographic for viewing the creation of the pipelines, there is also an infographic and panel for examining the join.

Join infographic

Hover over a join path to view the join keys and the transformations applied to create the join.

Join transformation

From the experiment results page you can also download the joined data to review the schema and see the feature engineering columns.

Next steps

After you review the results of your experiment, use your experiment to generate predictions.

Save the best pipeline as a model.
Promote the model to a deployment space.
Promote or add the data sets you will use to test the model to the space. Note that you must have an input data source that corresponds to each of the training data sources you used to create the experiment.
Deploy the model.
Create a batch job, specifying the data sources for input and specifyin a single output location.
Run the job.
Review the results.

For an example of deploying an AutoAI experiment with joined data, see Tutorial: Build and deploy a data join experiment.

Learn more

AutoAI overview

Parent topic: AutoAI