Running TPC-H/Hive

This section lists the steps for running TPC-H for Hive.

Execute the following steps to run TPC-H for Hive:

Download TPC-H from the official website and TPC-H_on_Hive from Running TPC-H queries on Hive.
Download TPC-H_on_Hive_2009-08-14.tar.gz.
Extract the above TPC-H_on_Hive_2009-08-14.tar.gz into $TPC_H_HIVE_HOME
Download DBGEN from the TPC-H website:
Note: Register your information and download the TPC-H_Tools_v<version>.zip.

Extract TPC-H_Tools_v<version>.zip and build dbgen:

#unzip TPC-H_Tools_v<version>.zip

Assuming it is located under $TPCH_DBGEN_HOME

#cd $TPCH_DBGEN_HOME
#cd dbgen
#cp makefile.suite makefile

Update the following values in $TPCH_DBGEN_HOME/dbgen/makefile accordingly:

#vim makefile

CC = gcc
DATABASE = SQLSERVER
MACHINE=LINUX
WORKLOAD = TPCH

Modify the $TPCH_DBGEN_HOME/dbgen/tpcd.h:

vim $TPCH_DBGEN_HOME/dbgen/tpcd.h

#ifdef  SQLSERVER
#define GEN_QUERY_PLAN   "EXPLAIN;"
#define START_TRAN       "START TRANSACTION;\n"
#define END_TRAN         "COMMIT;\n"
#define SET_OUTPUT       ""
#define SET_ROWCOUNT     "limit %d;\n"
#define SET_DBASE        "use %s;\n"
#endif

Execute make to build dbgen:

#cd $TPCH_DBGEN_HOME/dbgen/ 
#make
…
gcc  -g -DDBNAME=\"dss\" -DLINUX -DSQLSERVER -DTPCH -DRNG_TEST 
-D_FILE_OFFSET_BITS=64  -O -o qgen build.o bm_utils.o qgen.o 
rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o -lm

Generate the transaction data:

#cd $TPCH_DBGEN_HOME/dbgen/
#./dbgen -s 500 this is to generate 500GB data, change the value 
accordingly for your benchmark

The generated data is stored under $TPCH_DBGEN_HOME/dbgen/ and all files are named with the suffix .tbl:

# ls -la *.tbl
-rw-r--r-- 1 root root  24346144 Jul  5 08:41 customer.tbl
-rw-r--r-- 1 root root 759863287 Jul  5 08:41 lineitem.tbl
-rw-r--r-- 1 root root      2224 Jul  5 08:41 nation.tbl
-rw-r--r-- 1 root root 171952161 Jul  5 08:41 orders.tbl
-rw-r--r-- 1 root root 118984616 Jul  5 08:41 partsupp.tbl
-rw-r--r-- 1 root root  24135125 Jul  5 08:41 part.tbl
-rw-r--r-- 1 root root       389 Jul  5 08:41 region.tbl
-rw-r--r-- 1 root root   1409184 Jul  5 08:41 supplier.tbl

Note: You will need all the above files. If you find any file missing, you need to regenerate the data.

Prepare the data for TPC-H:
```
#cd $TPCH_DBGEN_HOME/dbgen/
#mv *.tbl $TPC_H_HIVE_HOME/data/

#cd $TPC_H_HIVE_HOME/data/
#./tpch_prepare_data.sh
```
Note: Before executing tpch_prepare_data.sh, you need to ensure that IBM Storage® Scale is active (/usr/lpp/mmfs/bin/mmgetstate -a), file system is mounted (/usr/lpp/mmfs/bin/mmlsmount <fs-name> -L) and HDFS Transparency is active (/usr/lpp/mmfs/bin/mmhadoopctl connector getstate).
Check your prepared data:
```
#hadoop dfs -ls /tpch
```
Check that all the files listed in Step 5 are present.

Run TPC-H:

# cd $TPC_H_HIVE_HOME/
#export HADOOP_HOME= /usr/hdp/current/hadoop-client
#export HADOOP_CONF_DIR=/etc/Hadoop/conf
#export HIVE_HOME= /usr/hdp/current/hive-client

# vim benchmark.conf change the variables, such as LOG_FILE, if you want
#./tpch_prepare_data.sh