File formats that are supported by Db2 Big SQL
The Hadoop environment supports a large number of file formats. The formats that are described here are available either by using explicit SQL syntax, such as STORED AS ORC, or by using installed interfaces such as Avro. Columnar storage saves both time and space during big data processing. The ORC and Parquet file formats provide excellent performance advantages when used with Db2® Big SQL.
Optimized Row Columnar (ORC)
The ORC file format provides a highly efficient way to store data. ORC files store collections of rows in a columnar format, which enables parallel processing of row collections across your cluster. As of Db2 Big SQL 5.0.2, the ORC file format is recommended for optimal performance and functionality.
The ORC file format uses type-specific encoders for each column and divides the file into large stripes. Each stripe uses indexes that enable the Db2 Big SQL readers to skip sets of rows that do not satisfy filter conditions. A footer contains metadata that includes byte range information for each stripe and type information for the file. The amount of resources that Db2 Big SQL uses to process ORC files is affected by the ORC stripe size. The recommended stripe size for ORC files, which is determined by the orc.stripe.size property, is 64 MB. Stripe sizes larger than 256 MB should be avoided.
The recommended compression type for this file format is zlib (the default).
Parquet
The Parquet file format is an open source columnar storage format for Hadoop that supports efficient compression and encoding schemes.
- SET HADOOP PROPERTY 'dfs.blocksize' = 268435456;
- SET HADOOP PROPERTY 'parquet.page.size' = 65536;
- SET HADOOP PROPERTY 'parquet.dictionary.page.size' = 65536;
- SET HADOOP PROPERTY 'parquet.block.size' = 268435456;
- SET HADOOP PROPERTY 'parquet.enable.dictionary' = 'true';
- SET HADOOP PROPERTY 'parquet.compression' = 'SNAPPY';
The recommended compression types for this file format are snappy (the default) and gzip.
Text
The text file format is the default storage format for a table. The underlying data is stored in delimited form with one record per line and new line characters separating individual records.
...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LINES TERMINATED BY '\n'
NULL DEFINED AS '\N'
...
Generally, your actual data must not contain delimiter characters. The only exception is that field terminator characters can exist within your data if you specify the ESCAPED BY clause and these characters are properly escaped. Newline characters cannot be escaped.
If an incompatible column value is provided (if, for example, you attempt to insert the value
a
into a column that is defined as INT), that value is treated as a null value. If
null values are not allowed, an error is returned.
If your input data contains more columns (fields) than the table, the additional fields in the data are ignored. If there are fewer fields in the input data than columns in the table, null values are inserted if the columns are defined as nullable; otherwise, an error is returned.
...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '\\'
...
For
example, Hello\, how are you?
.Because the text file format requires type conversion for all non-character columns, it is not as efficient as a binary storage format.
This file format does not support compression.
Avro
Avro data type | Db2 Big SQL data type |
---|---|
BOOLEAN | BOOLEAN |
INT | INT |
LONG | BIGINT |
FLOAT | DOUBLE |
DOUBLE | DOUBLE |
STRING | VARCHAR(max) |
ENUM | VARCHAR(max) |
The Db2 Big SQL schema can be inferred from the Avro schema. You can specify the Avro schema as parameters to an Avro SerDe.
...TBLPROPERTIES (
'avro.schema.url' = 'file:///path/to/the/schema/test_serializer.avsc'
)
...
;
...TBLPROPERTIES (
'avro.schema.literal' =
'{"namespace": "com.howdy",
"name": "some_schema",
"type": "record",
"fields": [{ "name":"string1","type":"string"}]}'
)
...
;
Support for Db2 Big SQL and Hive schema evolution with tables in the Avro file format is primarily driven by the SerDe's understanding of the Avro table definition. When you make changes to the schema, it is critical that you keep the column definitions and the Avro schema in synchrony.
ALTER TABLE...
ADD COLUMN col3 INT;
Because the
Db2 Big SQL column definitions and the Avro schema need to be
kept in synchrony to ensure that data is written and read with the new Avro SerDe definition, the
table property 'avro.schema.literal' or the HDFS file (containing the JSON schema) that is
referenced by 'avro.schema.url' also needs to be updated. For
example:
ALTER TABLE...
SET TBLPROPERTIES (
'avro.schema.literal' =
'{\"type\": \"record\",
\"name\": \"TUPLE_1\",
\"fields\": [
...
{
\"name\": \"col3\", \"type\": [\"null\", \"int\"
}
]
}'
);
You
must specify the full new schema. After this statement runs, you can write data to and select data
from the altered table, and Db2 Big SQL handles that data
according to the new table layout.Best practice is to not specify a column list; the correct details are then automatically extracted from the specified Avro schema. If you do specify a column list for the table, ensure that the number of columns matches the number of columns in the Avro schema; otherwise, the CREATE TABLE statement returns an error.
- bzip2
- deflate
- gzip
- snappy
For more information, see the Avro storage format.
Record Columnar (RC)
The RC file format uses binary key/value pairs. It partitions rows horizontally into row splits and then partitions each row split vertically. The metadata pertaining to a row split is the key part, and all of the actual data in the row split is stored as the value part of a record.
- bzip2
- deflate
- gzip
- snappy
CREATE TABLE my_table (
i INT,
s STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
STORED AS RCFILE;
Sequence
The sequence file format is used to hold arbitrary data that might not otherwise be splittable. For example, in a text file, newline characters (\n) are used to determine the boundaries of a record, so a DFS block can be processed simply by looking for newline characters. However, if the data in that file is in binary form or is compressed with an algorithm that does not maintain markers for a record boundary, reading that block is impossible.
- A binary sequence file stores data in a binary format by using the Hive LazyBinarySerDeSerDe. With binary storage, the data requires very little conversion processing while being read.
- A text sequence file stores delimited data within the sequence file format, which enables the use of compression algorithms on textual data that would not otherwise be splittable.
- bzip2
- deflate
- gzip
- snappy