Run a Hadoop application on LSF
Use the bsub command to submit the Hadoop application to LSF.
Before you begin
You must have a Hadoop application compiled and ready to run as a JAR file.
Procedure
-
Optional. If you intend to submit multiple MapReduce workloads within a single Hadoop job to be
submitted as an LSF
job, create a script with multiple Hadoop commands.
-
Prepare a script with multiple Hadoop commands.
For example, create a file named mrjob.sh with the following content:
#!/bin/bash hadoop jar sort.jar /gpfs/mapreduce/data/sort/input /gpfs/mapreduce/data/sort/ hadoop jar wordcount.jar /gpfs/mapreduce/data/input /gpfs/mapreduce/data/output
- Change the file permissions on the script to make it executable.
This script is run with the lsfhadoop.sh connector script when submitted as an LSF job:
-
Prepare a script with multiple Hadoop commands.
-
Use the bsub command to submit the Hadoop application to LSF
through the lsfhadoop.sh connector script.
Note: To make Hadoop jobs run efficiently, use the -x option to make an exclusive job allocation.
The lsfhadoop.sh connector script creates a temporary directory under the job's working directory and places data files in this temporary directory. To allow a job-based Hadoop cluster to access these files, you must set the job's working directory to a location in a shared file system. Use the -cwd option to specify the job's working directory.
- To specify a Hadoop command to run the Hadoop workload, use hadoop jar as
a subcommand of the lsfhadoop.sh connector script:
bsub bsub_options lsfhadoop.sh [-h] [--hadoop-dir file_path] [--java-home file_path] [--config-dir file_path] [--user-hdfs] [-debug] hadoop jar jarfile_name jarfile_parameters
- To specify a script that runs multiple Hadoop commands within a single LSF job,
run the script as an argument to the lsfhadoop.sh connector
script:
bsub bsub_options lsfhadoop.sh [-h] [--hadoop-dir file_path] [--java-home file_path] [--config-dir file_path] [--user-hdfs] [-debug] script_name
The following are the lsfhadoop.sh options:
- --hadoop-dir
- Specifies the Hadoop installation directory. The default is the value of the $HADOOP_HOME environment variable.
- --java-home
- Specifies the location of the Java runtime environment. The default is the value of the $JAVA_HOME environment variable.
- --config-dir
- Specifies the Hadoop configuration file directory, if you want to define your own customized Hadoop configuration. The default is the Hadoop installation directory.
- --use-hdfs
- Specifies that the MapReduce job runs on an HDFS file system. By default, lsfhadoop.sh sets up a job-based Hadoop cluster to directly access a shared file system.
- --debug
- Enables the job to retain the intermediate work files for troubleshooting purposes.
- --A
- Specifies the end of the lsfhadoop.sh options and disables further option processing for lsfhadoop.sh. This allows you to specify other generic Hadoop options after lsfhadoop.sh, such as -D parameter=value
For example,
To run a wordcount application on Hadoop with three hosts, run the following:
bsub -x -R"span[ptile=1]" -n 3 -cwd /gpfs/mapreduce/work lsfhadoop.sh hadoop jar wordcount.jar /gpfs/mapreduce/data/input /gpfs/mapreduce/data/output
To run the mrjob.sh script (which contains multiple hadoop jar commands) on Hadoop with three hosts, run the following:
bsub -x -R"span[ptile=1]" -n 3 -cwd /gpfs/mapreduce/work lsfhadoop.sh mrjob.sh
- To specify a Hadoop command to run the Hadoop workload, use hadoop jar as
a subcommand of the lsfhadoop.sh connector script: