Run a Hadoop application on LSF

Use the bsub command to submit the Hadoop application to LSF.

Before you begin

You must have a Hadoop application compiled and ready to run as a JAR file.

Procedure

  1. Optional. If you intend to submit multiple MapReduce workloads within a single Hadoop job to be submitted as an LSF job, create a script with multiple Hadoop commands.
    1. Prepare a script with multiple Hadoop commands.

      For example, create a file named mrjob.sh with the following content:

      #!/bin/bash
      hadoop jar sort.jar /gpfs/mapreduce/data/sort/input /gpfs/mapreduce/data/sort/
      hadoop jar wordcount.jar /gpfs/mapreduce/data/input /gpfs/mapreduce/data/output
    2. Change the file permissions on the script to make it executable.

    This script is run with the lsfhadoop.sh connector script when submitted as an LSF job:

  2. Use the bsub command to submit the Hadoop application to LSF through the lsfhadoop.sh connector script.
    Note: To make Hadoop jobs run efficiently, use the -x option to make an exclusive job allocation.

    The lsfhadoop.sh connector script creates a temporary directory under the job's working directory and places data files in this temporary directory. To allow a job-based Hadoop cluster to access these files, you must set the job's working directory to a location in a shared file system. Use the -cwd option to specify the job's working directory.

    • To specify a Hadoop command to run the Hadoop workload, use hadoop jar as a subcommand of the lsfhadoop.sh connector script:

      bsub bsub_options lsfhadoop.sh [-h] [--hadoop-dir file_path] [--java-home file_path] [--config-dir file_path] [--user-hdfs] [-debug] hadoop jar jarfile_name jarfile_parameters

    • To specify a script that runs multiple Hadoop commands within a single LSF job, run the script as an argument to the lsfhadoop.sh connector script:

      bsub bsub_options lsfhadoop.sh [-h] [--hadoop-dir file_path] [--java-home file_path] [--config-dir file_path] [--user-hdfs] [-debug] script_name

    The following are the lsfhadoop.sh options:

    --hadoop-dir
    Specifies the Hadoop installation directory. The default is the value of the $HADOOP_HOME environment variable.
    --java-home
    Specifies the location of the Java runtime environment. The default is the value of the $JAVA_HOME environment variable.
    --config-dir
    Specifies the Hadoop configuration file directory, if you want to define your own customized Hadoop configuration. The default is the Hadoop installation directory.
    --use-hdfs
    Specifies that the MapReduce job runs on an HDFS file system. By default, lsfhadoop.sh sets up a job-based Hadoop cluster to directly access a shared file system.
    --debug
    Enables the job to retain the intermediate work files for troubleshooting purposes.
    --A
    Specifies the end of the lsfhadoop.sh options and disables further option processing for lsfhadoop.sh. This allows you to specify other generic Hadoop options after lsfhadoop.sh, such as -D parameter=value

    For example,

    To run a wordcount application on Hadoop with three hosts, run the following:

    bsub -x -R"span[ptile=1]" -n 3 -cwd /gpfs/mapreduce/work lsfhadoop.sh hadoop jar wordcount.jar /gpfs/mapreduce/data/input /gpfs/mapreduce/data/output

    To run the mrjob.sh script (which contains multiple hadoop jar commands) on Hadoop with three hosts, run the following:

    bsub -x -R"span[ptile=1]" -n 3 -cwd /gpfs/mapreduce/work lsfhadoop.sh mrjob.sh