Joining data sources (Watson Machine Learning)
When you specify the data for an AutoAI experiment, you can choose to combine two or more data sources that share a common column, or key. You are creating a new data table by combining the data based on the specified join keys.
Service Spark environments are not available by default. An administrator must install the Analytics Engine Powered by Apache Spark service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.
Notes about joining data sources
- Each data source must be a CSV file.
- You can join up to 20 files, with each file less than 4GB and a combined maximum of 20GB.
- If the total size of the joined data is more than 1GB, a sample size of 1GB will be used to train the model.
- The max depth of connections is three. For example, the main source (A) can be connected to source B, which is connected to source C, which in turn is connected to source D. Source D cannot be connected to another source.
- The type of join created is a left join, which returns all records from the left table, and the matching records from the right table.
- Each join must have at least one join key, or common column, specified. If no key is specified, the join is ignored when the experiment runs.
- If you configure more than one join, AutoAI will determine the best order for running the joins.
- After you run the experiment, you can download the joined data to review the schema and to see the new columns added as a result of feature engineering.
Joining data sources
- Specify a name and description for your experiment.
- Select a machine learning service instance and a compute configuration and click Create.
- Choose two or more data files from your project,upload them from your file system, or select them from the asset browser, then press Continue. Tip: Click the Preview icon to review your data.
- When you are done loading data sources, start the configuration process by selecting one of the sources as the main source for the data join.
- Next, click Configure joins to open the canvas for connecting your data sources.
- To create a join, hover over either end of the main source, and drag a connection to another source. A join displays.
- Click the join icon to open the panel for specifying the key. A key is a common field that can connect the data sources. AutoAI identifies and suggests common fields.
- Choose the key to complete the join.
- Repeat steps 6 through 8 to create more joins and keys to connect data sources.
- When you are done, click Save join to return to the experiment configuration, choose a column to predict and run your experiment.
Join examples
These examples show how to create a single join and multiple joins.
Single join
In this example, two data sources are uploaded: group_customer_main.csv and group_customer_customers.csv``. The file group_customer_main.csv is designated as the main source for the data join.

The key for the join is the column group_customer_id. Tip: Use the Schema preview tab in the Join panel to view the column names to help you select a key.
Multiple joins
In this configuration, five tables are joined with four joins, as follows:
| Main source | Joined source | Key |
|---|---|---|
| group_customers_main | group_customers_customers | group_customer_id |
| group_customers_main | group_customers_transactions | transaction_id |
| group_customers_main | group_customers_purchases | group_id |
| group_customers_transactions | group_customers_products | product_id |

When you complete the joins, click Save join. AutoAI configures the join
View and edit join settings
From the create experiment page, click Experiment settings to view and edit these settings for the data join.
Stratified sampling limit
Stratified sampling sorts data into subgroups, or strata, for a more accurate representation of your joined data sources. Optionally increase or decrease the number of rows to include in each strata.
Timestamp columns
Timestamps are used by AutoAI to extract time related features. If your data set includes a date/time column and you enable the timestamp threshold, the join result only includes the data from row before the timestamp threshold to avoid data leakage.
To establish a threshold, enable the option, then choose the timestamp column and choose the type of date/time data it contains.
Feature selectors
Feature selectors are options that help to exclude irrelevant data and improve experiment run time. They include:
- Deduplication (enabled by default). Removes duplicated features.
- Inconsistency (enabled by default). Removes features with inconsistent distribution between random splits.
- Filter (disabled by default). Removes low correlation data for regression problems, or low information gains for classification problems.
Run the experiment
Choose a prediction column and run the experiment. In addition to the infographic for viewing the creation of the pipelines, there is also an infographic and panel for examining the join.

Hover over a join path to view the join keys and the transformations applied to create the join.

From the experiment results page you can also download the joined data to review the schema and see the feature engineering columns.
Next steps
After you review the results of your experiment, use your experiment to generate predictions.
- Save the best pipeline as a model.
- Promote the model to a deployment space.
- Promote or add the data sets you will use to test the model to the space. Note that you must have an input data source that corresponds to each of the training data sources you used to create the experiment.
- Deploy the model.
- Create a batch job, specifying the data sources for input and specifyin a single output location.
- Run the job.
- Review the results.
For an example of deploying an AutoAI experiment with joined data, see Tutorial: Build and deploy a data join experiment.
Learn more
Parent topic: AutoAI