Running TPC-H/Hive
This section lists the steps for running TPC-H for Hive.
Execute the following steps to run TPC-H for Hive:
- Download TPC-H from the official website and TPC-H_on_Hive from Running TPC-H
queries on Hive.
Download TPC-H_on_Hive_2009-08-14.tar.gz.
- Extract the above TPC-H_on_Hive_2009-08-14.tar.gz into $TPC_H_HIVE_HOME
- Download DBGEN from the TPC-H website:Note: Register your information and download the TPC-H_Tools_v<version>.zip.
- Extract TPC-H_Tools_v<version>.zip and build
dbgen:
#unzip TPC-H_Tools_v<version>.zipAssuming it is located under $TPCH_DBGEN_HOME#cd $TPCH_DBGEN_HOME #cd dbgen #cp makefile.suite makefileUpdate the following values in $TPCH_DBGEN_HOME/dbgen/makefile accordingly:#vim makefile CC = gcc DATABASE = SQLSERVER MACHINE=LINUX WORKLOAD = TPCHModify the $TPCH_DBGEN_HOME/dbgen/tpcd.h:vim $TPCH_DBGEN_HOME/dbgen/tpcd.h #ifdef SQLSERVER #define GEN_QUERY_PLAN "EXPLAIN;" #define START_TRAN "START TRANSACTION;\n" #define END_TRAN "COMMIT;\n" #define SET_OUTPUT "" #define SET_ROWCOUNT "limit %d;\n" #define SET_DBASE "use %s;\n" #endifExecute make to build dbgen:#cd $TPCH_DBGEN_HOME/dbgen/ #make … gcc -g -DDBNAME=\"dss\" -DLINUX -DSQLSERVER -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -O -o qgen build.o bm_utils.o qgen.o rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o -lm - Generate the transaction data:
#cd $TPCH_DBGEN_HOME/dbgen/ #./dbgen -s 500 this is to generate 500GB data, change the value accordingly for your benchmarkThe generated data is stored under $TPCH_DBGEN_HOME/dbgen/ and all files are named with the suffix .tbl:# ls -la *.tbl -rw-r--r-- 1 root root 24346144 Jul 5 08:41 customer.tbl -rw-r--r-- 1 root root 759863287 Jul 5 08:41 lineitem.tbl -rw-r--r-- 1 root root 2224 Jul 5 08:41 nation.tbl -rw-r--r-- 1 root root 171952161 Jul 5 08:41 orders.tbl -rw-r--r-- 1 root root 118984616 Jul 5 08:41 partsupp.tbl -rw-r--r-- 1 root root 24135125 Jul 5 08:41 part.tbl -rw-r--r-- 1 root root 389 Jul 5 08:41 region.tbl -rw-r--r-- 1 root root 1409184 Jul 5 08:41 supplier.tblNote: You will need all the above files. If you find any file missing, you need to regenerate the data. - Prepare the data for TPC-H:
#cd $TPCH_DBGEN_HOME/dbgen/ #mv *.tbl $TPC_H_HIVE_HOME/data/ #cd $TPC_H_HIVE_HOME/data/ #./tpch_prepare_data.shNote: Before executing tpch_prepare_data.sh, you need to ensure that IBM Storage® Scale is active (/usr/lpp/mmfs/bin/mmgetstate -a), file system is mounted (/usr/lpp/mmfs/bin/mmlsmount <fs-name> -L) and HDFS Transparency is active (/usr/lpp/mmfs/bin/mmhadoopctl connector getstate). - Check your prepared data:
#hadoop dfs -ls /tpchCheck that all the files listed in Step 5 are present.
- Run TPC-H:
# cd $TPC_H_HIVE_HOME/ #export HADOOP_HOME= /usr/hdp/current/hadoop-client #export HADOOP_CONF_DIR=/etc/Hadoop/conf #export HIVE_HOME= /usr/hdp/current/hive-client # vim benchmark.conf change the variables, such as LOG_FILE, if you want #./tpch_prepare_data.sh