The Inspect Files tool presented in this blog allows users to improve the query performance in IBM Db2 Big SQL tables by optimizing the file layout and size. Query performance can suffer when file layout is fragmented into multiple small files. The Inspect Files tool will detect when this condition exists and recommends corrective action.

Introduction

A good practice to resolve the small files issue is to run compaction on the directories containing many small files that logically belong together. The merge of these files improves the Big SQL read performance, by minimizing the metadata to be processed and aligning file sizes to HDFS blocks more efficiently.

The Inspect Files tool helps to inspect and identify problematic small files at the storage level and provides recommendations for file compaction in HDFS directories. It connects to the HDFS NameNode in a remote cluster and recursively traverses the directory tree from the main directory provided, collecting file details and computing size related statistics. Based on these statistics, it detects small files relative to the HDFS block size and it provides suggestions for files compaction and storage optimization with HDFS block size as the size target.

Installation

The tool is packaged as a .jar file named bigsql-inspectfiles-1.0.jar. Use this link to review licensing information.
In order to install the tool, download bigsql-inspectfiles-1.0.jar_.zip and unzip it with this command:

unzip bigsql-inspectfiles-1.0.jar_.zip

The jar file includes all the dependencies needed to run the tool and a copy of the license for future reference after download.

Usage

An interesting use case is to run this tool against a Big SQL table in HDFS where the query performance has been degraded over time due to the potential small files problem. These small files are usually located together, such as files belonging to a Big SQL table or partition. Passing the table or partition location into the tool, will provide detailed file compaction recommendations for the table or partition sub-directories with the goal of improving the query performance.

The compaction recommendations provided by this tool can be executed using the tools described in the blog entry Optimizing ORC and Parquet files for Big SQL queries performance. This tool does not perform the compaction.

Inspect Files too input parameters:

-b, --block <arg>       HDFS block size in MB. If not provided, the default value is 128 MB    -d, --directory <arg>   Comma separated list of directories containing the files to inspect    -j, --json              Creates output report in JSON format for the compaction recommendation    -h, --help              Display the tool help

It returns:

A report showing the directory tree including files names and sizes from the main directory.
Statistics for each nested directory including: number of files, max files size, min file size, mean file size, histogram of file sizes, standard deviation of file size and the ratio of standard deviation to block size in bytes and percentage. The standard deviation of file size relative to block size (instead of the mean) is a feature to capture file size dispersion. It helps to detect small files in the directories. Directories with a standard deviation to block size larger than 40% are recommended for optimization.
List of small files for each directory in the table or partition, if present.
Compaction recommendations. Based on a modified bin packing algorithm, the tool reports an optimal way to merge the files at directory level by targeting HFDS block size. If the –json flag is part of the input arguments, the tool creates a .json file with the compaction recommendations beside the output in the console. The name of the output file automatically created follows this pattern: compaction_<dir_name>_<time_stamp>.json

Examples

Usage examples:

java -jar bigsql-inspectfiles-1.0.jar -b 64 -j -d hdfs://<NameNode_server>:<port>/test/data  java -jar bigsql-inspectfiles-1.0.jar --block 128 --json --directory hdfs://<NameNode_server>:<port>/apps/hive/warehouse/bigsql.db  java -jar bigsql-inspectfiles-1.0.jar -d hdfs://<NameNode_server>:<port>/test/data1,hdfs://<NameNode_server>:<port>/test/data2  java -jar bigsql-inspectfiles-1.0.jar --help

And this is an example of the output for the first example:

Block Size  = 64 MB  Directory = [hdfs://:/test/data]    ********************************************************************************************   Files in directory: hdfs://:/test/data  ********************************************************************************************  data.txt = 48 MB  [dir1]  	file11.txt = 23 MB  	file12.txt = 20 MB  	file13.txt = 18 MB  	file14.txt = 13 MB  	file15.txt = 11 MB  	file16.txt = 6 MB  	file17.txt = 6 MB  	file18.txt = 2 MB  [dir10]  	file101.txt = 40 MB  	file102.txt = 48 MB  	file103.txt = 3 KB  	file104.txt = 3 KB  [dir2]  	[dir21]  		[dir211]  			file2111.txt = 18 MB  			file2112.txt = 13 MB  ...    ============================================================================================   Statistics   ============================================================================================    hdfs://:/test/data = [data.txt = 48 MB]  Number of files = 1 | Maximun = 48 MB | Minimum = 48 MB | Mean = 48 MB  Histogram of file sizes = [0, 0, 0, 1, 0]  	1 file in (3*(block size)/4, (block size)]  Standard deviation = 0 bytes | Deviation to HDFS block size = 15 MB = 23.63 %  No small sizes detected     hdfs://:/test/data/dir1 = [file11.txt = 23 MB, file12.txt = 20 MB, file13.txt = 18 MB, file14.txt = 13 MB, file15.txt = 11 MB, file16.txt = 6 MB, file17.txt = 6 MB, file18.txt = 2 MB]  Number of files = 8 | Maximun = 23 MB | Minimum = 2 MB | Mean = 12 MB  Histogram of file sizes = [5, 3, 0, 0, 0]  	5 files in (0, (block size)/4]  	3 files in ((block size)/4, (block size)/2]  Standard deviation = 7 MB | Deviation to HDFS block size = 51 MB = 80.68 %  Small files list [4] = [file15.txt = 11 MB, file16.txt = 6 MB, file17.txt = 6 MB, file18.txt = 2 MB]    hdfs://:/test/data/dir10 = [file102.txt = 48 MB, file101.txt = 40 MB, file103.txt = 3 KB, file104.txt = 3 KB]  Number of files = 4 | Maximun = 48 MB | Minimum = 3 KB | Mean = 22 MB  Histogram of file sizes = [2, 0, 1, 1, 0]  	2 files in (0, (block size)/4]  	1 file in ((block size)/2, 3*(block size)/4]  	1 file in (3*(block size)/4, (block size)]  Standard deviation = 26 MB | Deviation to HDFS block size = 47 MB = 74 %  Small files list [2] = [file103.txt = 3 KB, file104.txt = 3 KB]    ...    ============================================================================================   Optimization   ============================================================================================    hdfs://:/test/data/dir1 = [file11.txt = 23 MB, file12.txt = 20 MB, file13.txt = 18 MB, file14.txt = 13 MB, file15.txt = 11 MB, file16.txt = 6 MB, file17.txt = 6 MB, file18.txt = 2 MB]  Suggested compaction for files smaller than HDFS block size:  	[ file11.txt = 23 MB + file12.txt = 20 MB + file13.txt = 18 MB + file18.txt = 2 MB]    Compacted file size = 63 MB  	[ file14.txt = 13 MB + file15.txt = 11 MB + file16.txt = 6 MB + file17.txt = 6 MB]    Compacted file size = 38 MB    hdfs://:/test/data/dir10 = [file102.txt = 48 MB, file101.txt = 40 MB, file103.txt = 3 KB, file104.txt = 3 KB]  Suggested compaction for files smaller than HDFS block size:  	[ file102.txt = 48 MB + file103.txt = 3 KB + file104.txt = 3 KB]    Compacted file size = 48 MB    hdfs://:/test/data/dir2 = [file23.txt = 23 MB, file22.txt = 20 MB, file21.txt = 18 MB, file24.txt = 13 MB]  Suggested compaction for files smaller than HDFS block size:  	[ file23.txt = 23 MB + file22.txt = 20 MB + file21.txt = 18 MB]    Compacted file size = 61 MB    ...    -------------------- JSON FORMAT (compaction_data_201802221413608.json) --------------------    {"directories": [      {          "directory": "hdfs://:/test/data/dir1",          "compactions": [              {                  "new_file_size": "63 MB",                  "files_to_merge": [                      "file11.txt",                      "file12.txt",                      "file13.txt",                      "file18.txt"                  ]              },              {                  "new_file_size": "38 MB",                  "files_to_merge": [                      "file14.txt",                      "file15.txt",                      "file16.txt",                      "file17.txt"                  ]              }          ]      },      {          "directory": "hdfs://:/test/data/dir10",          "compactions": [{              "new_file_size": "48 MB",              "files_to_merge": [                  "file102.txt",                  "file103.txt",                  "file104.txt"  ...

The Optimization section in the output and JSON file proposes suggestions for the compaction of files.

In our example, based on the output, the files in /test/data/dir1 could be merged into a new file as follows: file11.txt + file12.txt + file13.txt + file18.txt to create a larger file of 63 MB, which would align more efficiently to HDFS block size (64 MB in the example). Similarly, file14.txt + file15.txt + file16.txt + file17.txt could be compacted to create a new larger file of 38 MB.

Once compacted /test/data/dir1 would have a more optimal distribution of files with two larger files closer to HDFS block size instead of multiple sub-optimal small files.

Notes:

If you need to identify the HDFS block size in MB for your file system, you can run the following command:

echo $[`hdfs getconf -confKey dfs.blocksize`/1024/1024]

Similarly, if you need to determine HDFS block size in MB for a given file, run:

echo $[`hdfs dfs -stat %o /path/to/file `/1024/1024]

Thanks to Robin Noble, Rishabh Patel and Pierre Regazzoni for the testing effort, comments and reviews.

Tips

Inspect Files tooling for IBM Db2 Big SQL - Hadoop Dev

Technical Blog Post

Abstract

Body

Introduction

Installation

Usage

Examples

Notes:

UID

Share your feedback

Need support?