Running TPC-H/Hive

This section lists the steps for running TPC-H for Hive.

Execute the following steps to run TPC-H for Hive:
  1. Download TPC-H from the official website and TPC-H_on_Hive from Running TPC-H queries on Hive.

    Download TPC-H_on_Hive_2009-08-14.tar.gz.

  2. Extract the above TPC-H_on_Hive_2009-08-14.tar.gz into $TPC_H_HIVE_HOME
  3. Download DBGEN from the TPC-H website:
    Note: Register your information and download the TPC-H_Tools_v<version>.zip.
  4. Extract TPC-H_Tools_v<version>.zip and build dbgen:
    #unzip TPC-H_Tools_v<version>.zip
    Assuming it is located under $TPCH_DBGEN_HOME
    #cd $TPCH_DBGEN_HOME
    #cd dbgen
    #cp makefile.suite makefile
    
    Update the following values in $TPCH_DBGEN_HOME/dbgen/makefile accordingly:
    #vim makefile
    
    CC = gcc
    DATABASE = SQLSERVER
    MACHINE=LINUX
    WORKLOAD = TPCH
    
    Modify the $TPCH_DBGEN_HOME/dbgen/tpcd.h:
    vim $TPCH_DBGEN_HOME/dbgen/tpcd.h
    
    #ifdef  SQLSERVER
    #define GEN_QUERY_PLAN   "EXPLAIN;"
    #define START_TRAN       "START TRANSACTION;\n"
    #define END_TRAN         "COMMIT;\n"
    #define SET_OUTPUT       ""
    #define SET_ROWCOUNT     "limit %d;\n"
    #define SET_DBASE        "use %s;\n"
    #endif
    Execute make to build dbgen:
    #cd $TPCH_DBGEN_HOME/dbgen/ 
    #make
    …
    gcc  -g -DDBNAME=\"dss\" -DLINUX -DSQLSERVER -DTPCH -DRNG_TEST 
    -D_FILE_OFFSET_BITS=64  -O -o qgen build.o bm_utils.o qgen.o 
    rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o -lm
  5. Generate the transaction data:
    #cd $TPCH_DBGEN_HOME/dbgen/
    #./dbgen -s 500 this is to generate 500GB data, change the value 
    accordingly for your benchmark
    
    The generated data is stored under $TPCH_DBGEN_HOME/dbgen/ and all files are named with the suffix .tbl:
    # ls -la *.tbl
    -rw-r--r-- 1 root root  24346144 Jul  5 08:41 customer.tbl
    -rw-r--r-- 1 root root 759863287 Jul  5 08:41 lineitem.tbl
    -rw-r--r-- 1 root root      2224 Jul  5 08:41 nation.tbl
    -rw-r--r-- 1 root root 171952161 Jul  5 08:41 orders.tbl
    -rw-r--r-- 1 root root 118984616 Jul  5 08:41 partsupp.tbl
    -rw-r--r-- 1 root root  24135125 Jul  5 08:41 part.tbl
    -rw-r--r-- 1 root root       389 Jul  5 08:41 region.tbl
    -rw-r--r-- 1 root root   1409184 Jul  5 08:41 supplier.tbl
    
    Note: You will need all the above files. If you find any file missing, you need to regenerate the data.
  6. Prepare the data for TPC-H:
    #cd $TPCH_DBGEN_HOME/dbgen/
    #mv *.tbl $TPC_H_HIVE_HOME/data/
    
    #cd $TPC_H_HIVE_HOME/data/
    #./tpch_prepare_data.sh
    
    Note: Before executing tpch_prepare_data.sh, you need to ensure that IBM Storage® Scale is active (/usr/lpp/mmfs/bin/mmgetstate -a), file system is mounted (/usr/lpp/mmfs/bin/mmlsmount <fs-name> -L) and HDFS Transparency is active (/usr/lpp/mmfs/bin/mmhadoopctl connector getstate).
  7. Check your prepared data:
    #hadoop dfs -ls /tpch

    Check that all the files listed in Step 5 are present.

  8. Run TPC-H:
    # cd $TPC_H_HIVE_HOME/
    #export HADOOP_HOME= /usr/hdp/current/hadoop-client
    #export HADOOP_CONF_DIR=/etc/Hadoop/conf
    #export HIVE_HOME= /usr/hdp/current/hive-client
    
    # vim benchmark.conf change the variables, such as LOG_FILE, if you want
    #./tpch_prepare_data.sh