AutoAI overview (Watson Machine Learning)

The AutoAI graphical tool analyzes your data and uses data algorithms, transformations, and parameter settings to create the best predictive model. AutoAI displays various potential models as model candidate pipelines and rank them on a leaderboard for you to choose from.

Service The Watson Studio, Watson Machine Learning, Watson OpenScale, and other supplemental services are not available by default. An administrator must install these services on the IBM Cloud Pak for Data platform. To determine whether a service is installed, open the Services catalog and check whether the service is enabled.

Required service: Watson Machine Learning; Watson Studio
Data format Tabular:: CSV files, with comma (,) delimiter for all types of AutoAI experiments.; Connected data from Networked File System (NFS).; Data from supported data connections. For details, refer to AutoAI data use.

Data size Limits on data files correspond to the compute size you choose for an experiment. See the following list of supported data connections for more details.

AutoAI data use

Supported data connections for training AutoAI models:

Amazon S3
IBM Cloud Databases for DataStax
Db2
Exasol
genericjdbc
generics3
Google BigQuery
Microsoft SQL Server
MySQL
PostgreSQL
Netezza Performance Server
SingleStoreDB
Snowflake
Teradata
Virtualized data tables created by using Watson Query
Files in IBM Cloud Object Storage infrastructure
Files in a Storage Volume Connection
Files in IBM Cloud Object Storage (S3), including Parquet and Microsoft Excel

Notes for accessing Microsoft Excel sheets:

If you add data from the Cloud Object Storage connection, you can select the file or the sheet directly.
If you add connected data, make sure the connected data points to the Excel sheet you want to use.

Notes for integrating with Git projects:

You can train experiments and save the resulting models to Git-based projects.
The saved models can be shared with project collaborators, but the training assets and associated metadata are not saved.
To add an AutoAI model from a Git-based project to a deployment space, create an export ZIP file for the project that contains the model. Then, follow the steps in Importing space and project assets into deployment spaces in Watson Machine Learning to import the ZIP file into a space.

Choosing a compute configuration

When you configure your experiment, you can choose a compute configuration that supplies the computing resources for your experiment. The configuration that you choose governs data limits.

Compute configuration	Resources	Data limits
Small	2vCPUs and 8 GB RAM	If you use a data file, data connection, or a connected data asset (ex: database table) that exceeds 100 MB, a 100 MB random sample is used. Note: Incremental learning is not available with this configuration.
Medium	4 CPU and 16 GB	If you use a data file, data connection, or a connected data asset (ex: database table) that exceeds 100 MB, a 100 MB random sample is used.
Large	8 CPU and 32 GB	If you use a data file, data connection, or a connected data asset that exceeds 1 GB, a 0.7 GB random sample is used.
Extra Large	16 CPU and 64 GB	If you use a data file, data connection, or a connected data asset that exceeds 1 GB, a 1 GB random sample is used.

Notes:

You can change how your training data is sampled from the Experiment Settings.
If you are using a larger data source (up to 100 GB), the AutoAI experiment can train with all of the data by using ensembling and incremental learning.
If you are connecting to a database as your data source, the configuration of the database affects the performance of accessing the data. By default, AutoAI opens 4 parallel connections to a database to speed up the data download. However, if the configuration of the database does not permit 4 connections, AutoAI rolls back to downloading by using 1 connection at a time. Configuring the database to accept more connections improves the data access performance.
The maximum number of feature columns for a classification or regression experiment is 5000.

AutoAI process

Using AutoAI, you can build and deploy a machine learning model with sophisticated training features and no coding. The tool does most of the work for you.

To view the code that created a particular experiment, or interact with the experiment programmatically, you can save an experiment as a notebook.

The AutoAI process takes data from a structured file, prepares the data, selects the model type, and generates and ranks pipelines so you can save and deploy a model.

AutoAI automatically runs the following tasks to build and evaluate candidate model pipelines:

Data pre-processing
Automated model selection
Automated feature engineering
Hyperparameter optimization
Ensembling and incremental learning

Understanding the AutoAI process

For additional detail on each of these phases, including links to associated research papers and descriptions of the algorithms applied to create the model pipelines, see AutoAI implementation details.

Data pre-processing

Most data sets contain different data formats and missing values, but standard machine learning algorithms work only with numbers and no missing values. Therefore, AutoAI applies various algorithms or estimators to analyze, clean, and prepare your raw data for machine learning. This technique automatically detects and categorizes values based on features, such as data type: categorical or numerical. Depending on the categorization, AutoAI uses hyper-parameter optimization to determine the best combination of strategies for missing value imputation, feature encoding, and feature scaling for your data.

Automated model selection

AutoAI uses automated model selection to identify the best model for your data. This novel approach tests potential models against small subsets of the data and ranks them based on accuracy. AutoAI then selects the most promising models and increases the size of the data subset until it identifies the best match. This approach saves time and improves performance by gradually narrowing down the potential models based on accuracy.

For information on how to handle automatically-generated pipelines to select the best model, refer to Selecting an AutoAI model.

Automated feature engineering

Feature engineering identifies the most accurate model by transforming raw data into a combination of features that best represent the problem. This unique approach explores various feature construction choices in a structured, nonexhaustive manner, while progressively maximizing model accuracy by using reinforcement learning. This technique results in an optimized sequence of transformations for the data that best match the algorithms of the model selection step.

Hyperparameter optimization

Hyperparameter optimization refines the best performing models. AutoAI uses a novel hyperparameter optimization algorithm for certain function evaluations, such as model training and scoring, that are typical in machine learning. This approach quickly identifies the best model despite long evaluation times at each iteration.

Ensembling and incremental learning

The process of building BatchedTreeEnsemble pipelines on top of the ranked pipelines. The ensemble pipelines provides incremental learning capabilities, and can be used to continue training by using the remaining data in a subsampled source, dividing the remaining data into batches, if needed. Each batch of training data is scored independently by using the optimized metric, so you can review the performance of each batch when you explore the results. For details, see Incremental learning.

Next steps

AutoAI tutorial: Build a Binary Classification Model

Parent topic: Analyzing data and building models