Overview of InfoSphere Information Server on Hadoop

You can profile, validate, cleanse, transform, and integrate your big data on Hadoop, an open source framework that can manage large volumes of structured and unstructured data.

InfoSphere® Information Server provides these capabilities on Hadoop:

A graphical development environment that you can use to solve simple to complex data integration challenges by using prebuilt components and avoiding hand coding.
Scalability to manage large volumes of data across many Hadoop data nodes.
The design platform of InfoSphere Information Server offerings to eliminate the learning curve for the new technology of integrating and cleansing data on Hadoop.
The flexibility to design data transformation and data quality routines once, and then reuse those assets wherever needed by other developers.
Enterprise-level governance that automatically captures profiling, data validation, cleansing, transformation, and integration logic independent of its run time environment for data lineage and data dependency analysis.
Data locality functionality, which improves performance by starting jobs on nodes where HDFS blocks are available and moving the computation closer to where the data is.

This functionality is available with the following product offerings:

IBM BigIntegrate. Provides data integration features of InfoSphere Information Server.
IBM BigQuality. Provides data quality features of InfoSphere Information Server.
InfoSphere Information Server Enterprise Editions. Provides both data integration and data quality features.

For more information on the features that are included with each offering, see Components in the InfoSphere Information Server suite.

If you are already familiar with InfoSphere Information Server, you can continue working with the tools that you know, like IBM® InfoSphere DataStage® and QualityStage® Designer. And you can run your jobs on Hadoop.

To use the functionality on Hadoop, you install the engine tier on a Hadoop edge node in a Hadoop cluster. You can instead install on a full node, but an edge node provides better performance. You configure the product to send jobs to the InfoSphere Information Server engine tier in Hadoop so that the jobs that you create with these tools run on the Hadoop cluster.

The following figure illustrates how the engine tier node communicates with YARN to run a job on the compute nodes on a Hadoop cluster. Open source Apache Hadoop YARN is the framework for job scheduling and cluster resource management.

The figure shows the InfoSphere Information Server work flow. — Figure 1. Work flow on a Hadoop cluster

These steps show how jobs are processed on Hadoop:

Step 1: The conductor process manages the section leader and player processes that run on the InfoSphere Information Server engine. The conductor process on the engine tier receives a job run request for an InfoSphere DataStage, InfoSphere QualityStage job. This job might be generated from an InfoSphere Information Analyzer analysis.
Step 2: The conductor connects to the YARN client, which assigns an Application Master to the job from the available pool of Application Masters it maintains. If an Application Master is not available in the pool the client will start a new one for this job. The conductor connects to the Application Master and sends the details about the resources that are required for running the job.
Step 3: The Application Master requests resources from the Yarn resource manager. The jobs processes run in a YARN container, with each container running a section leader and players. The YARN container designates resource requirements such as CPU and memory. When the resources are allocated, the conductor sends the process commands for the section leader to the Application Master, which starts those commands on the allocated resources.