Run a Hadoop application on LSF

Use the bsub command to submit the Hadoop application to LSF.

Before you begin

You must have a Hadoop application compiled and ready to run as a JAR file.

Procedure

Optional. If you intend to submit multiple MapReduce workloads within a single Hadoop job to be submitted as an LSF job, create a script with multiple Hadoop commands.
1. Prepare a script with multiple Hadoop commands.
  For example, create a file named mrjob.sh with the following content:
```
#!/bin/bash
hadoop jar sort.jar /gpfs/mapreduce/data/sort/input /gpfs/mapreduce/data/sort/
hadoop jar wordcount.jar /gpfs/mapreduce/data/input /gpfs/mapreduce/data/output
```
2. Change the file permissions on the script to make it executable.
This script is run with the lsfhadoop.sh connector script when submitted as an LSF job:
Use the bsub command to submit the Hadoop application to LSF through the lsfhadoop.sh connector script.

Note: To make Hadoop jobs run efficiently, use the -x option to make an exclusive job allocation.
The lsfhadoop.sh connector script creates a temporary directory under the job's working directory and places data files in this temporary directory. To allow a job-based Hadoop cluster to access these files, you must set the job's working directory to a location in a shared file system. Use the -cwd option to specify the job's working directory.
- To specify a Hadoop command to run the Hadoop workload, use hadoop jar as a subcommand of the lsfhadoop.sh connector script:
  bsub bsub_options lsfhadoop.sh [-h] [--hadoop-dir file_path] [--java-home file_path] [--config-dir file_path] [--user-hdfs] [-debug] hadoop jar jarfile_name jarfile_parameters
- To specify a script that runs multiple Hadoop commands within a single LSF job, run the script as an argument to the lsfhadoop.sh connector script:
  bsub bsub_options lsfhadoop.sh [-h] [--hadoop-dir file_path] [--java-home file_path] [--config-dir file_path] [--user-hdfs] [-debug] script_name
The following are the lsfhadoop.sh options:

--hadoop-dir

Specifies the Hadoop installation directory. The default is the value of the $HADOOP_HOME environment variable.

--java-home

Specifies the location of the Java runtime environment. The default is the value of the $JAVA_HOME environment variable.

--config-dir

Specifies the Hadoop configuration file directory, if you want to define your own customized Hadoop configuration. The default is the Hadoop installation directory.

--use-hdfs

Specifies that the MapReduce job runs on an HDFS file system. By default, lsfhadoop.sh sets up a job-based Hadoop cluster to directly access a shared file system.

--debug

Enables the job to retain the intermediate work files for troubleshooting purposes.

--A

Specifies the end of the lsfhadoop.sh options and disables further option processing for lsfhadoop.sh. This allows you to specify other generic Hadoop options after lsfhadoop.sh, such as -D parameter=value

For example,

To run a wordcount application on Hadoop with three hosts, run the following:

bsub -x -R"span[ptile=1]" -n 3 -cwd /gpfs/mapreduce/work lsfhadoop.sh hadoop jar wordcount.jar /gpfs/mapreduce/data/input /gpfs/mapreduce/data/output

To run the mrjob.sh script (which contains multiple hadoop jar commands) on Hadoop with three hosts, run the following:

bsub -x -R"span[ptile=1]" -n 3 -cwd /gpfs/mapreduce/work lsfhadoop.sh mrjob.sh