Machine learning models on a remote Hadoop cluster

You can easily build and train models on a remote Hadoop cluster by using the utility functions available in Watson Studio Local and Hadoop Integration Service. You can transfer models built on Hadoop to Watson Studio Local to leverage all the model management capabilities.

Tasks you can perform:

Watson Studio Local supports the following machine learning model types on Hadoop:

  • Spark ML
  • scikit-learn
  • XGBoost
  • Keras

See Working with models and custom classes for more information.

Choose the environment

Environment in Watson Studio Local
When you load a model built on Hadoop into Watson Studio Local, ensure that the version of Spark on the Hadoop cluster is compatible with the version of the Spark in the Jupyter environment. If the Spark versions are incompatible, you will get the following error when loading the model: Py4JJavaError: An error occurred while calling o34.load.
Virtual Environment image on Hadoop
The notebooks that build, train, batch score and evaluate models on Hadoop might need to use one of the following options to ensure the required packages are available during execution:
  • If the packages are present in the Jupyter Environment in Watson Studio Local, and are available on Hadoop as a virtual environment image, then choose the virtual environment image for the Spark session. Learn more.
  • If the packages are installed by a Watson Studio Local user, then save the environment, push the virtual environment image to Hadoop, and choose the virtual environment image for the Spark session. Learn more.
  • If you don't choose a virtual environment image, then the spark session will execute with the packages natively available on Hadoop.

Build and train models on Hadoop

Use the Hadoop integration utility methods to build, train, save models on a remote Hadoop cluster. See the following sample notebooks for guidance:
Note: If you're working with HDFS or Hive data sets within a notebook or RStudio environment, using the Insert to code feature is not recommended. Refer to the sample notebooks on how to work with HDFS and Hive data.
  • "Using Watson Studio Machine Learning Service for Model Training and Making Predictions on Hadoop Data" (requires the numpy library)
  • "Modeling Weather Geographies on using Scikit Learn on Hadoop"

Transfer models to Watson Studio Local

Use the Hadoop integration utility methods to transfer models from a remote Hadoop cluster to Watson Studio Local.

After a model is built and trained on Hadoop, and saved as a model asset with your project, you can generate scripts for batch scoring and evaluation of the model on the Hadoop cluster. You can also perform real-time scoring, create model groups, and compare models with Watson Studio Local.

Batch score on Hadoop

To run batch prediction jobs that read in a data set, score the data, and output the predictions in a CSV file on HDFS or to a Hive table, complete the following steps:

  1. In the Models page of your project, click Batch score next to the model.

    Batch score

  2. Specify the details to generate the batch script:
    • For Spark cluster, select the Hadoop system on which you want to batch score the model.
    • If you want to use a virtual environment pushed to the Hadoop cluster, select it from Virtual environment.
    • In the Input data set field, select the data set for batch scoring. You can see only the data sets that are associated with the Hadoop system you selected.
    • Select the output data set type. You can select HDFS Output Directory or Hive table.
    • Select an output update action.

    Click Generate batch script. Watson Studio Local automatically generates a Python script that you can edit directly in the Result view.

  3. Click Run now to immediately create and run a job for the script. Alternatively, you can click Advanced settings to save the script as either a .py script or a .ipynb notebook in your project (ensure the file name is unique); then later from the Jobs page of your project, you can create a scheduled job for the script or notebook you saved with Type set to Batch scoring.

    Create job

Check the log to verify that the job completed successfully. An output CSV file on HDFS or Hive table should appear on the Hadoop cluster.
Tip: If the job reports Success but no CSV file was outputted, the job might have failed. Validate whether the input table exists by using the remote data set in the notebook.

From the job details page, you can click on each run to view results and logs. You can also view a batch scoring history from the model details.

Evaluate on Hadoop

To evaluate the performance of a model, complete the following steps:

  1. In the Models page of your project, click Evaluate next to the model.

    Evaluate

  2. Specify the details to generate the evaluation script:
    • For Spark cluster, select the Hadoop system on which you want to batch score the model.
    • If you want to leverage a virtual environment pushed to the Hadoop cluster, select it from Virtual environment.
    • In the Input data set field, select the data set for evaluation. You can see only the data sets associated with the Hadoop system you selected.
    • For each evaluator, you can opt to customize your own threshold metric and specify what fraction of the overall data must be relevant for the model to be considered healthy. For Spark 2.1 model evaluations, the output data set field is ignored.

    Click Generate evaluation script. Watson Studio Local automatically generates a Python script that you can edit directly in the Result view.

  3. Click Run now to immediately create and run a job for the script. Alternatively, you can click Advanced settings to save the script as either a .py script or a .ipynb notebook in your project (ensure the file name is unique); then later from the Jobs page of your project, you can create a scheduled job for the script or notebook you saved with Type set to Model evaluation.

    Create job

From the job details page, you can click on each run to view results and logs. Go to the model details page to view the evaluation history.

Real-time score

In the Models page of your project, click Real-time score next to the model to input data and simulate predictions on it as a pie chart or bar graph.

Real-time score

Compare models

See Compare model versions in a model group.