Building an AutoAI model (Watson Machine Learning)

AutoAI automatically prepares data, applies algorithms, and attempts to build model pipelines that are best suited for your data and use case. Learn how to generate the model pipelines that you can save as machine learning models.

Follow these steps to upload data and have AutoAI create the best model for your data and use case.

  1. Collect your input data
  2. Open the AutoAI tool
  3. Specify details of your model and training data and start AutoAI
  4. View the results

Collect your input data

Collect and prepare your training data. For details on allowable data sources, see AutoAI overview.

Open the AutoAI tool

For your convenience, your AutoAI model creation uses the default storage that is associated with your project to store your data and to save model results.

  1. Open a project and click New asset.
  2. Click AutoAI Experiment.

Note: After you create an AutoAI asset it displays on the Assets page for your project in the AutoAI experiment section, so you can return to it.

Specify details of your experiment

  1. Specify a name and description for your experiment.

  2. Select a compute configuration and click Create. The compute configuration specifies the computing resources to allocate to running the experiment. Larger sizes improve training speed and might be required for larger data sources, but cost more than smaller configurations.

  3. Choose data from your project or upload it from your file system or from the asset browser, then press Continue. You can click the Preview icon after the data source name to review your data. Optionally add a second file as holdout data for testing the trained pipelines.
  4. Choose the Column to predict for the data you want the experiment to predict.
    • Based on analyzing a subset of the data set, AutoAI chooses a default model type: binary classification, multiclass classification, or regression. Binary is selected if the target column has two possible values, multiclass if it has a discrete set of 3 or more values, and regression if the target column is a continuous numeric variable. You can override this selection. Note: Creating a classification experiment when there are many unique values in the prediction column is resource-intensive and affects the performance and training time for the experiment. The limit on values to classify is 200.
    • AutoAI chooses a default metric for optimizing. For example, the default metric for a binary classification model is Accuracy.
    • By default, 10 percent of the training data is held out to test the performance of the model.
  5. (Optional) Click Experiment settings to view or customize options for your AutoAI run. To edit the settings for your experiment, click:
    • Data source, where you can adjust:
      • whether to subsample data. If you have a large data set, you can choose to train with a representative sample of the data to speed up pipeline creation. You can specify whether subsampling should be done by a percentage of the training data or by a specified number of rows.
      • whether to automatically detect text. If you enable this option, columns containing text is transformed into vectors to compare vectors to better analyze semantic similarity between strings. Note: transforming and comparing strings is a resource-intensive activity and might increase experiment run time. For details, see Creating a text analysis experiment.
      • the percentage of training data vs holdout data. Training data is used to train the model, and holdout data is withheld from training the model and used to measure the performance of the model. You can either split a singe data source into training and testing (holdout) data, or you can use a second data file specifically for the testing data.
      • the number of folds, from the default of three folds to a maximum of 10. Cross validation divides training data into folds, or groups, for testing model performance.
      • columns to include. You can choose to include columns with data that supports the prediction column, and exclude irrelevant columns to speed up pipeline performance.
    • Prediction settings, where you can:
      • change the model type. AutoAI selects a model type that best suits a sampling of the data, but you can override it. For example, if the sample data for the prediction column contains only two types of values, AutoAI chooses binary classification as the model type. If you know that there are more than two values in the column, you can override the setting and choose multiclass classification instead. For binary classification models you can also edit the positive class.
      • choose to optimize for the metrics with the best score, or optimize for those with the highest score in the shortest run time.
      • change the metric to be optimized for the experiment. Note: For a binary classification experiment, if you change the metric to Precision, Average Precision, Recall, or F1, a Positive Class is required. Confirm that the Positive Class is correct or the experiment might generate inaccurate results.
      • optionally specify which algorithms AutoAI should consider for pipeline creation. Only checked algorithms are considered during the model selection phase of the experiment.
      • change the number of algorithms to use to create pipelines. By default, AutoAI chooses the top two performing algorithms of the ones it considers, and use those algorithms to generate eight pipelines that you can view and compare, but you can change the number in the range 1 - 4. For example, if you select three algorithms, AutoAI identifies the top three performing algorithms and use them to generate a total of 12 pipelines that you can view, compare, and save as models. Note that more pipelines increase the training time for the experiment and use more resources.
      • Runtime settings, where you can review experiment settings or change the compute configuration for the experiment.

Click Run Experiment to begin model pipeline creation.

An infographic shows you the creation of pipelines for your data. The duration of this phase depends on the size of your data set. A notification message informs you if the processing time will be brief or require more time. You can work in other parts of the product while the pipelines build.

Building model pipelines

Hover over nodes in the infographic to explore the factors that pipelines share and their unique properties. You can see the factors that pipelines share and the properties that make a pipeline unique. For a guide to the data in the infographic, click the Legend tab in the information panel. Or, to see a different view of the pipeline creation, click the Experiment details tab of the notification pane, then click Switch views to view the progress map. In either view, click a pipeline node to view the associated pipeline in the leaderboard.

View the results

When the pipeline generation process completes, you can view the leading model candidates and evaluate them before saving a pipeline as a model.

Next step

Follow the steps in Selecting an AutoAI model for details on how to evaluate the pipelines as model candidates, then save a model.

Preview

Watch this video to see how to run an AutoAI to build a binary classification model.

This video provides a visual method as an alternative to following the written steps in this documentation.

Learn more

AutoAI overview

Parent topic: AutoAI