AutoAI Overview (Watson Machine Learning)
The AutoAI graphical tool in Watson Studio analyzes your data and discovers data transformations, algorithms, and parameter settings that work best for your predictive modeling problem. AutoAI displays the results as model candidate pipelines ranked on a leaderboard for you to choose from.
Service The Watson Studio, Watson Machine Learning, Watson OpenScale, and other supplemental services are not available by default. An administrator must install these services on the IBM Cloud Pak for Data platform. To determine whether a service is installed, open the Services catalog and check whether the service is enabled.
Required service Watson Machine Learning Watson Studio
Data format Tabular: CSV files, with comma (,) delimiter for all types of AutoAI experiments Connected data from Networked File System (NFS) Data from supported data connections. For details, refer to AutoAI data use.
Data size Limits on data files correspond to the compute size you choose for an experiment. For details, refer to AutoAI data use.
AutoAI data use
Supported data connections for training AutoAI models:
- Generic JDBC
- Snowflake
- IBM Db2
- MySQL
- Microsoft SQL Server
- PostgreSQL
- IBM Netezza Performance Server
- Amazon S3
- Datastax
- Exasol
- Oracle
- Google BigQuery
- Teradata
- Generic S3
- Files in IBM Cloud Object Storage (S3), including Parquet and Microsoft Excel
- Files in IBM Cloud Object Storage infrastructure
- Files in a Storage Volume Connection
Notes for accessing Microsoft Excel sheets:
- If you add data from the Cloud Object Storage connection you can select the file or the sheet directly.
- If you add connected data, make sure the connected data points to the Excel sheet you want to use.
If you are using a small compute size, these rules apply:
- The data file size for an experiment using a single data source must be less than 100 MB.
- The maximum total file size for an experiment using joined data must be less than 100 MB.
If you are using a large compute size, these rules apply:
- You can upload a file up to 1GB for AutoAI experiments with a single data source. If you connect to a data source that exceeds 1GB, the first 1GB of records is used.
- There are no limits on the file sizes when you join multiple data sources to train an experiment. However, if the combined training data is more than 1GB, AutoAI automatically subsamples it and uses 1GB of the data to train the experiment.
Note: If you are connecting to a database as your data source, the configuration of the database affects the performance of accessing the data. By default, AutoAI opens 15 parallel connections to a database to speed up the data download. However, if the configuration of the database does not permit 15 connections, AutoAI rolls back to downloading using 1 connection at a time. Configuring the database to accept more connections will improve the data access performance.
AutoAI data use
AutoAI experiments with a single data source:
- You can upload a file up to 1GB for AutoAI experiments
- If you connect to a data source that exceeds 1GB, only the first 1GB of records is used
AutoAI experiments with joined data sources:
- You can upload files up to 20GB
- You can use up to 20 files, with each file less than 4GB and a combined maximum of 20GB
AutoAI time series experiments
- If the data source contains a timestamp column, the data must be sampled at a uniform frequency. That is, the difference in timestamps of adjacent rows is the same. For example, data can be in increments of one minute, one hour, or one day. The specified timestamp is used to determine the lookback window to improve the model accuracy. Note: If the file size is larger than 1GB, sort the data in descending order by the timestamp, and only the first 1GB is used to train the experiment.
- If the data source does not contain a timestamp column, make sure the data is sampled at regular intervals and sorted in ascending order according to the date/time at which it was sampled. That is, the value in the first row is the oldest, and the value in the last row is the most recent. Note: If the file size is larger than 1GB, truncate the file size so it is smaller than 1GB.
For more information on choosing the right tool for your data and use case, refer to Choosing a tool.
Data operations in AutoAI
When you load data to train an AutoAI experiment, you can load a single data file, or you can join multiple data files that share common keys into a single training data set. For details, refer to:
For data gathered over a specified date/time range (such as stock prices or temperatures), you can create a time series experiment to predict future activity.
AutoAI process
Using AutoAI, you can build and deploy a machine learning model with sophisticated training features and no coding. The tool does most of the work for you.
To view the code that created a particular experiment, or interact with the experiment programmatically, you can save an experiment as a notebook.
AutoAI automatically runs the following tasks to build and evaluate candidate model pipelines:
- Data pre-processing
- Automated model selection
- Automated feature engineering
- Hyperparameter optimization
Understanding the AutoAI process
For additional detail on each of these phases, including links to associated research papers and descriptions of the algorithms applied to create the model pipelines, see AutoAI implementation details.
Data pre-processing
Most data sets contain different data formats and missing values, but standard machine learning algorithms work with numbers and no missing values. AutoAI applies various algorithms, or estimators, to analyze, clean, and prepare your raw data for machine learning. It automatically detects and categorizes features based on data type, such as categorical or numerical. Depending on the categorization, it uses hyper-parameter optimization to determine the best combination of strategies for missing value imputation, feature encoding, and feature scaling for your data.
Automated model selection
The next step is automated model selection that matches your data. AutoAI uses a novel approach that enables testing and ranking candidate algorithms against small subsets of the data, gradually increasing the size of the subset for the most promising algorithms to arrive at the best match. This approach saves time without sacrificing performance. It enables ranking a large number of candidate algorithms and selecting the best match for the data.
For information on how to handle automatically-generated pipelines to select the best model, refer to Selecting an AutoAI model.
Automated feature engineering
Feature engineering attempts to transform the raw data into the combination of features that best represents the problem to achieve the most accurate prediction. AutoAI uses a unique approach that explores various feature construction choices in a structured, non-exhaustive manner, while progressively maximizing model accuracy using reinforcement learning. This results in an optimized sequence of transformations for the data that best match the algorithms of the model selection step.
For more information on AutoAI features, refer to AutoAI feature comparison.
Hyperparameter optimization
Finally, a hyper-parameter optimization step refines the best performing model pipelines. AutoAI uses a novel hyper-parameter optimization algorithm optimized for costly function evaluations such as model training and scoring that are typical in machine learning. This approach enables fast convergence to a good solution despite long evaluation times of each iteration.
Next steps
Use your own data to build an AutoAI model.
Learn more
Parent topic: Analyzing data and building models