Analyzing Apache Hadoop data (Execution Engine for Apache Hadoop)

You can build and train models on a Hadoop cluster. If you have data in a Hive or HDFS storage system on a Hadoop cluster, you can work with that data directly on the Hadoop cluster.

Service The Execution Engine for Apache Hadoop service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Within an analytics project with Watson Studio, you can find Hadoop environment definitions on the Environments page. See Hadoop environments.

You can use Hadoop environments in these ways:

You can train a model on the Hadoop cluster by selecting a Hadoop environment in a Jupyter notebook.
You can manage a model on the Hadoop cluster by running Hadoop integration utility methods within a Jupyter notebook.
You can run Data Refinery flows on the Hadoop cluster by selecting a Hadoop environment for the Data Refinery job.

This diagram shows how data scientists working in an analytics project on a Cloud Pak for Data cluster can train a notebook on a Hadoop cluster with data on the Hadoop cluster.

Hadoop architecture

Outside of Cloud Pak for Data, you can manage models and data on Hadoop clusters in these ways:

You can manage models on HDFS with Hadoop integration utility methods in a remote livy session.
You can transfer files between the HDP cluster and the Cloud Pak for Data cluster with Python utility functions.
If your Hadoop cluster includes the Alluxio data orchestration software, you can interact with the data through Alluxio.

Learn more

Parent topic: Analyzing data and building models