CREATE TABLE (HADOOP) statement
The CREATE TABLE (HADOOP) statement defines a Db2® Big SQL table that is based on a Hive table for the Hadoop environment. The definition must include its name and the names and attributes of its columns. The definition can include other attributes of the table, such as its primary key or check constraints.
Invocation
This statement can be embedded in an application program or issued by using dynamic SQL statements. It is an executable statement that can be dynamically prepared only if DYNAMICRULES run behavior is in effect for the package (SQLSTATE 42509).
Authorization
- One of the following privileges or authorities:
- USE privilege on the table space
- SYSADM authority
- SYSCTRL authority
- Plus one of these privileges or authorities:
- IMPLICIT_SCHEMA authority on the database, if the implicit or explicit schema name of the table does not exist
- CREATEIN privilege on the schema, if the schema name of the table refers to an existing schema
- SCHEMAADM authority on the schema, if the schema name of the table refers to an existing schema
- REFERENCES privilege on the table
- REFERENCES privilege on each column of the specified parent key
- CONTROL privilege on the table
- SCHEMAADM authority on the schema, if the schema name of the parent table refers to an existing schema
- DBADM authority
- DATAACCESS authority
- The following privileges on the files and the directory in the
HDFS:
- READ and WRITE privileges on the directory and all files in the directory, including files in any subdirectories that might be recursively specified in the LOCATION clause
- EXECUTE privileges on all the directories and subdirectories in the specified LOCATION clause
Syntax
- 1 The HADOOP keyword is required unless you enable the COMPATIBILITY_MODE global variable and use specific Hadoop table clauses. For more information about this requirement, see the Hadoop keyword usage note.
- 2 For compatibility with Hive, when you use the LIKE clause, you can specify only the LOCATION, TBLPROPERTIES, or HINTS clause. This clause is not supported for Hadoop MQTs.
- 3 Each clause can be used only once.
- 4 Specify values for column-name only when you are creating a Hadoop MQT or, optionally, when you are using the AS keyword, not when you are creating a regular Hadoop table. You do not need to specify values for column-name when you use the AS keyword, because the column details can be inferred from the query. Similarly, you do not need to specify values for column-name when you create a table by using a SerDe that dynamically determines the column list from an external data source, such as an Avro schema.
- 5 Db2 Big SQL supports the data types that are described in Data types. However, some data types are syntactically supported for compatibility with Hive but are treated as different data types within the Big SQL runtime environment. For more information, see the mapping of data types that are used in the CREATE TABLE statement.
- 6 If you enable the COMPATIBILITY_MODE global variable, CHAR acts as a synonym for VARCHAR when you create the table, and the system catalog displays the VARCHAR data type.
- 7 Because the STRING data type has no specific size associated with it, the SQL processor might assume that each value in a STRING column always contains 32 KB of data, and performance might be impacted. If performance is a concern, use types that contain a specific length. You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways: run the SET HADOOP PROPERTY command, or set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
Description
- EXTERNAL
- Indicates that the data in the table is not managed by the database manager. If you drop the table, the table definition is removed from both the database manager and Hive catalogs, but the data remains unaffected. This is typically used in conjunction with the LOCATION keyword to define a table with existing data. If a location is not specified for an explicitly defined external table or an HDP 3.0 nonmanaged table, which is implicitly external (see the usage note on the application of EXTERNAL implicitly), the data is stored in the default warehouse location even if a nonstandard location is specified for the schema in which the table is created. Hive applies this data placement.
- HADOOP
- This keyword is required when you are defining a Hadoop table unless you enable the
COMPATIBILITY_MODE global variable. You can verify that you have defined a HADOOP table by querying
the PROPERTY column in the SYSCAT.TABLES catalog view. For
example:
SELECT SUBSTRING(PROPERTY,22,1) FROM SYSCAT.TABLES WHERE TABNAME='<table-name>';
- IF NOT EXISTS
- Checks whether the specified table name already exists. If the name exists, no error is returned.
- table-name
- Specifies a unique name for the new table. The table name must be a valid identifier.
You can use the RENAME statement to rename a Hadoop table. When you rename a Hadoop table, the name in both the local database catalog and the Hive catalog is updated. When you rename a managed Hadoop table, which is a table that is created without the EXTERNAL keyword, the distributed file system storage directory is also updated.
- element-list
- Defines the elements of a table, including the definition of columns and constraints on the table.
- column-definition
- Defines the attributes of the columns for a new Hadoop table that is not derived from another
table by using the AS clause.
- column-name
- Names a column of the table that is not created by using the AS clause. The name cannot be qualified, and the same name cannot be used for more than one column of the table. The column name must be a valid identifier.
- data-type
- Specifies the data type of the column.
- primitive-type
- There are several built-in data types:
- TINYINT
- A 1-byte integer with a range of -128 to 127.
Db2 Big SQL automatically converts the TINYINT to a larger integer type, such as SMALLINT, INT, or BIGINT, or a floating-point type such as FLOAT or DOUBLE.
- SMALLINT | INT2
- A 2-byte integer with a precision of 5 digits and a range of -32,768 to 32,767.
- INTEGER | INT | INT4
- A 4-byte integer with a precision of 10 digits and a range of -2,147,483,648 to +2,147,483,647.
- BIGINT | INT8
- An 8-byte integer with a precision of 19 digits and a range of -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.
- FLOAT(integer) | FLOAT 8 | REAL | FLOAT4 | DOUBLE | FLOAT8
- A single-precision or double-precision floating-point number.
A single-precision floating-point number is a 32-bit approximation of a real number. The number can be zero or can range from -3.4028234663852886e+38 to -1.1754943508222875e-38, or from 1.1754943508222875e-38 to 3.4028234663852886e+38.
A double-precision floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.7976931348623158e+308 to -2.2250738585072014e-308, or from 2.2250738585072014e-308 to 1.7976931348623158e+308.
A REAL number is a single-precision floating-point number. A DOUBLE number is a double-precision floating-point number.
- CHARACTER(integer) | CHAR(integer)
- A character string is a sequence of characters. The length of the string (integer) is the number of characters in the sequence. The default is 1.
- STRING | VARCHAR(integer) | CHARACTER VARYING(integer) | CHAR VARYING(integer)
- The STRING data type has no specific size associated with it. The default is VARCHAR(32672).
- TIMESTAMP | DATETIME
- A six- or seven-part value (year, month, day, hour, minute, second, and optional fractional
seconds) that designates a date and time.
If specified, integer must be between 0 and 9 and represents the precision of fractional seconds from 0 (seconds) to 9 (nanoseconds).
- DATE
- A three-part value (year, month, and day) that designates a date. The range of the year part is 0001 to 9999. The range of the month part is 1 to 12. The range of the day part is 1 to x, where x depends on the month.
- DECIMAL | DEC | NUMERIC | NUM (precision-integer, scale-integer)
- A decimal number with defined precision and scale. The precision is the total number of digits and can range from 1 to 31. The scale is the number of digits to the right of the decimal point and can range from 0 to the precision of the number. The default precision is 10 and scale is 0.
- BINARY | VARBINARY (integer)
- A varying length binary value is specified as VARBINARY(length) or
BINARY(length) and can be up to 32,672 bytes long. VARBINARY is an alias for
BINARY, and they are functionally identical in every way. If no length is provided, the length is
assumed to be 32,672.
However, it is recommended that an explicit length be specified. The Db2 Big SQL engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation.
With this data type, the Java™ I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Db2 Big SQL.
- BOOLEAN
- A Boolean value.
- array-type | row-type
- For detailed information about these types, see Array values and
Row values.
Each ARRAY or ROW type can be used in one of two ways: the SQL standard version or the Hive version. The SQL standard version, which is shown in the syntax diagram, is the recommended usage. When you specify the number of elements in the array as part of the table definition, the database manager can compute the maximum size of the column, which enhances query optimization.
With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Db2 Big SQL.
- ARRAY
- An ARRAY is a list that can be indexed. The elements in the array must be of the same type.
Assuming that column C1 is declared as an ARRAY with two or more elements, you can access these
elements by using the following notation:
...WHERE C1[2]=1...
. An ARRAY of ROW types is valid.- integer-constant
- For simple arrays, an integer constant that indicates the scale of the array, which is the number of elements in the array.
- INT | INTEGER | VARCHAR (int)
- The data type of the associative array index.
- ROW
- A ROW object contains a value for each attribute of the SQL structured type that it represents
(java.sql.Struct). The ROW type is equivalent to the STRUCT type in Hive tables. The ROW type
contains field definitions that contain field names and their data types. An entry is created in the
SYSIBM.SYSATTRIBUTES table for each field in a ROW definition.
- field-name
- Specifies the name of a field for the ROW type. The name cannot be the same as another field of this ROW type (SQLSTATE 42711).
- data-type
- Specifies the data type of the field. The data type must be a primitive data type.
- column-options
- You can define column options including a comment, the nullability attribute, or constraints.
- COMMENT column-comment
- A comment that provides additional information about the column.
- NULL | NOT NULL
- NOT NULL specifies that the column will not contain null values. Because the Hadoop table is external to any local database server, Db2 Big SQL enforces nullability at query execution time (SQL5104N). The nullability attribute of a column is enforced during an INSERT or LOAD USING operation. Do not assign a null value to a column that is defined as NOT NULL (SQLSTATE 23502).
- CONSTRAINT constraint-name
- Specifies the name of a constraint. You can define the following constraints on a column:
- PRIMARY KEY
- A primary key must be unique and must have the NOT NULL attribute. A table cannot have more than one primary key.
- UNIQUE
- A unique key prevents duplicate values in one or more NOT NULL columns of a table. A table can have multiple unique keys.
- REFERENCES table-name
- Ensures that required table relationships are maintained and that data entry rules are followed. The column on which this parameter is specified becomes the foreign key to the referenced parent table table-name.
- CHECK check-condition
- Sets restrictions on data that is added to a specific table. This parameter is enforced during
INSERT operations, but is not enforced during a LOAD HADOOP operation, or on data that is made
available outside of the database manager (by using Hive operations, for example).
- search-condition
- The search-condition must be true or unknown for every row of the table, and
has the following restrictions:
- A column reference must pertain to a column of the table that is being created.
- The search condition cannot contain a TYPE predicate.
- The search condition cannot contain any of the following items (SQLSTATE 42621):
- Subqueries
- CAST specifications with a SCOPE clause
- Column functions
- Functions that are not deterministic
- Functions that are defined to have an external action
- User-defined functions that are defined with either MODIFIES SQL or READS SQL DATA
- OLAP specifications
- Global variables
- References to generated columns other than the identity column
- An error tolerant nested table expression
Check constraints with search-condition are enforced when rows in the table are inserted or updated. A check constraint that is defined on a table automatically applies to all subtables of that table.
- functional-dependency
- Defines a functional dependency between columns or sets of columns.
The parent set of columns precedes the DETERMINED BY clause, and the child set of columns follows the DETERMINED BY clause. All restrictions on search-condition apply to both parent set and child set columns, and only simple column references are allowed in the column sets (SQLSTATE 42621). The same column must not be identified more than once in a functional dependency (SQLSTATE 42709). No column in the child set of columns can be a nullable column (SQLSTATE 42621).
A functional dependency is not enforced by the database manager during normal operations such as insert, update, or delete. The functional dependency might be used during query rewrite for optimization. Incorrect results might be returned if the integrity of a functional dependency is not maintained.
- constraint-attributes
- Defines attributes that are associated with primary key, unique, referential integrity, or check constraints.
- NOT ENFORCED | ENFORCED
- Specifies whether the constraint is enforced by the database manager during normal operations
such as insert, update, or delete.
- NOT ENFORCED
- Specifies that the constraint is not enforced by the database manager. This is the default for
Hadoop tables. A primary key or unique constraint cannot be NOT ENFORCED if there is a dependent
ENFORCED referential constraint.
- TRUSTED | NOT TRUSTED
- Specifies whether the data can be trusted to conform to the constraint.
- TRUSTED
- Specifies that the data can be trusted to conform to the constraint. This is the default. Use this option only if the table data is independently known to conform to the constraint; otherwise, unpredictable results might occur.
- NOT TRUSTED
- Specifies that the data cannot be trusted to conform to the constraint. This option is valid only for referential integrity constraints (SQLSTATE 42613). Use this option when the data conforms to the constraint for most rows, but it is not independently known whether all current and future rows will conform. If a constraint is both not trusted and enabled for query optimization, it will not be used for any optimization that depends on the data being in complete conformity with the constraint.
- ENFORCED
- Although the ENFORCED keyword is available for compatibility with the syntax for local tables that are created with the CREATE TABLE statement, an error is returned when you specify this option for Hadoop tables (SQLSTATE 42858).
- ENABLE QUERY OPTIMIZATION | DISABLE QUERY OPTIMIZATION
- Specifies whether the constraint or functional dependency can be used for query optimization
under appropriate circumstances.
- ENABLE QUERY OPTIMIZATION
- Specifies that the constraint is assumed to be true and can be used for query optimization. This is the default.
- DISABLE QUERY OPTIMIZATION
- Specifies that the constraint cannot be used for query optimization. This option cannot be specified for primary key and unique constraints (SQLSTATE 42613).
- column-name
- Names a column of the source table when you create a new Hadoop (target) table by using the AS clause. If a column name is not specified, column names for the target table are derived from the result set of a query against the source table. You cannot explicitly specify the data types of these columns, because the types are derived from data types in the result set (example 20).
- unique-constraint
- Defines a unique key. This constraint does not allow duplicate values in one or more columns of the table. A table can have multiple unique keys. The columns that are specified in a unique constraint must be defined as NOT NULL.
- referential-constraint
- Defines a referential constraint.
- CONSTRAINT constraint-name
- Specifies a name for the referential constraint.
- FOREIGN KEY (column-name, ...)
- The foreign key of the referential constraint is composed of the identified columns. Each name in the list of column names must identify a column of the parent table, and the same column must not be identified more than once. The number of foreign key columns must be the same as the number of parent key columns, and the data types of the corresponding columns must be compatible (SQLSTATE 42830).
- check-constraint
- Defines a check constraint.
- constraint-name
- Specifies a name for the check constraint.
- check-condition
- See CHECK check-condition.
- LIKE table-name
- Specifies that the table columns are to have exactly the same name and description as the columns of a table, view, or nickname that exists in the catalog, or a declared temporary table. The table name can refer only to another Hadoop table (SQLSTATE 42858).You cannot specify a typed table or view (SQLSTATE 428EC), and the Hadoop table that is referenced in table-name cannot contain an ARRAY or ROW column (SQLSTATE 428H2).
- COMMENT string-constant
- Specifies a comment that provides additional information about the table. The maximum length of string-constant is 254 bytes.
- PARTITIONED BY partitioning-element-list
- Specifies how the data is to be partitioned. The action of including the PARTITIONED BY clause creates a directory in the distributed file system that contains data for each unique value in the column that is being partitioned. At query time, Db2 Big SQL tries to scan only those directories, or partitions, that contain data matching the query's search criteria.
- column-definition
- Specifies one or more columns by which data that is inserted into the table is partitioned. The
definition includes the column name, the column data type, and any column options. This column
definition does not apply to tables that are created by using the AS clause. The following data
types cannot be used as partitioning columns:
- ARRAY
- BINARY
- BOOLEAN
- DOUBLE
- FLOAT
- MAP
- REAL
- ROW
- STRUCT
- TIMESTAMP
- VARBINARY
Important: A partitioning key value cannot include any trailing blank spaces. This could be a problem if you plan to partition on a CHAR data type, because these values are always padded to their full length with blank spaces. Db2 Big SQL also does not support partitioning on a zero-length string. - column-name
- Names a column of the source table that identifies a partitioning column that is to be used when creating a new (target) table or MQT by using the AS clause. The data type of the partitioning column is derived from the result set of a query against the source table. If you specify a list of column names, the list must contain as many names as there are columns in the result set. Each column name must be unique and unqualified.
- partition-expression
- Specifies one or more partitioning expressions by which data that is inserted into the table is
partitioned. The action of including the PARTITIONED BY clause creates a directory in the
distributed file system (DFS) that contains data for each unique value in the partitioning
expression. At query time, Db2 Big SQL tries to scan only
those directories, or partitions, that contain data that match the search criteria of the
query.The expression can reference one or more columns in the columns list. The AS clause specifies a name for the column that is derived from the expression. The data type of this identifier matches the result type of the expression. If a query references the column that is derived from the expression, a new predicate is automatically injected into the query so that Db2 Big SQL scans only the qualifying partitions. The data type of the result of the expression cannot be one of the following types:
- ARRAY
- BINARY
- BOOLEAN
- DOUBLE
- FLOAT
- MAP
- REAL
- ROW
- STRUCT
- TIMESTAMP
- VARBINARY
Important: A partitioning key value cannot include any trailing blank spaces. This could be a problem if you plan to partition on a CHAR data type, because these values are always padded to their full length with blank spaces. Db2 Big SQL also does not support partitioning on a zero-length string.The partitioning expression cannot contain any of the following items (SQLSTATE 42621):- Subqueries
- Column functions
- Dereference operations or DEREF functions
- User-defined or built-in functions that are non-deterministic
- User-defined functions that use the EXTERNAL ACTION option
- User-defined functions that are not defined with NO SQL
- Host variables or parameter markers
- Special registers or built-in functions that depend on the value of a special register
- Global variables
- An expression on a partitioning expression column
- An expression that uses a JSON function
Important: A load Hadoop operation does not automatically generate column values when a table is partitioned by a partition-expression. When using the LOAD HADOOP statement on a table that is partitioned by a partition-expression, ensure that the column values are generated in advance and stored in a file. The INSERT statement, on the other hand, does automatically generate column values when a table is partitioned by a partition-expression. Therefore, if values for the new column must be automatically generated, use the INSERT statement instead of the LOAD HADOOP statement. If all column values are generated in advance, use the LOAD HADOOP statement.
- ROW FORMAT
- Specifies the format of data rows.
- delimited-row-format
- Specifies a delimiter at the table level.
- DELIMITED
- Valid values for the FIELDS
TERMINATED BY and ESCAPED BY parameters are the following items. Important: It is recommended to use different values for each of these parameters; otherwise, the behavior is undefined.
- A literal single-byte character such as vertical bar (|) or comma (,).
- A value of the format
\nnn
, where nnn is the octal value of a character. For example, \001 is the ASCII character that represents CTRL-A. When you specify a delimiter as an octal value (\nnn
), nnn must be in the range of 0 to 177, which is the range for lower ASCII characters. - A value of the format
\xYY
, where YY is a two-digit hexadecimal value of a character. This representation is retained for compatibility with Db2 Big SQL v1; however, the literal character or octal representations are recommended. - You can also specify a delimiter as a decimal integer value in the range of -128 to 127. For example, the vertical bar (|) character is represented either by octal value \174 or decimal integer value 124.
- To specify a higher (or extended) ASCII character, you can use a negative decimal value. Negative values are subtracted from 256 to get the extended ASCII code value. For example, you can specify a section sign character (§, ASCII code 245) as -11.
- FIELDS TERMINATED BY terminator-char
- Specifies a delimiter for structured fields. You can include any single character, but the default is \001, the ASCII character that represents CTRL-A. Characters that match the field terminator character can be part of your data if you specify the ESCAPED BY clause and those characters are properly escaped.
- COLLECTION ITEMS TERMINATED BY terminator-char
- Specifies a delimiter for the individual items in a collection. You can include any single character, but the default is \002, the ASCII character that represents CTRL-B.
- LINES TERMINATED BY terminator-char
- Specifies a delimiter for data rows. The only valid delimiter is a newline character (\n). Newline characters must not exist within your data because they cannot be escaped.
- MAP KEYS TERMINATED BY terminator-char
- Specifies a delimiter for map keys. A map is a simple set of key/value pairs in a
list. It is a collection, so the collection items terminator enables you to split out each map
entry. However, within each map entry, the map keys terminator separates the key from its value. For
example,
pf#500$epf#200
represents two map entries. The collection items terminator is $ and the map keys terminator is #. - NULL DEFINED AS null-value
- Specifies a value to represent the null value. The default is a literal backslash character followed by an uppercase N (\N).
- serde-row-format
- Specifies the format of data rows as a SerDe class.
- SERDE serde-class
- Specifies the name of a Java class in the Db2 Big SQL server CLASSPATH that implements the Hive SerDe interface org.apache.hadoop.hive.serde2.SerDe.
- WITH SERDEPROPERTIES ('string-constant1' = 'string-constant2')
- Associates SerDe properties with the SerDe class. For example, if you want to eventually load HBase data into a Hive table, create the table by using this option and the hbase.columns.mapping parameter.
To use a SerDe JAR file, complete the following steps to copy the JAR file to the appropriate libraries:- Stop the Db2 Big SQL server.
- Copy the JAR file to $BIGSQL_HOME/userlib and $HIVE_HOME/lib.
- Restart the Db2 Big SQL server.
- STORED AS file-format
- Specifies the format of the underlying data files. Db2 Big SQL can read most available file storage formats. For more information about the supported file formats, see File formats supported by Db2 Big SQL.
- STORED BY storage-handler
- Specifies the name of a Java class in the Db2 Big SQL server CLASSPATH that implements the storage handler interface. For example, you can use org.apache.hadoop.hive.ql.metadata.HiveStorageHandler for the Hive interface. Use org.apache.hadoop.hive.hbase.HBaseStorageHandler for the HBase interface.
- LOCATION string-constant
- Specifies that a particular distributed file system directory must be used to store data files.
Access to the path that is specified in string-constant is controlled by the HDFS
file permissions. Access can also be
granted.Note: The LOCATION clause must always specify a directory that contains the files for the table that is being created. The files must have the same structure. If you have files with different structures and you want to create a different table for each structure, the files must be in separate directories.
Db2 Big SQL provides configuration parameters to enforce the requirement that the LOCATION clause must be used with the EXTERNAL clause. The action of these configuration parameters ensures that the DROP statement does not accidentally drop data that is outside of the hive\warehouse directory.
For more information, see Using S3 Object Storage with Db2 Big SQL tables.
- TBLPROPERTIES ('string-constant1' = 'string-constant2')
- Defines job properties that can configure input or output formats, and that you can use to
specify more detailed information about the table. The properties are passed
as is
to the underlying InputFormat and OutputFormat classes. Properties that are not implemented by these classes are ignored. Db2 Big SQL does not validate the properties. - SORT BY column-name
- Specifies one or more columns by which data that is inserted into the table is sorted. Use this clause to improve query performance with some file formats (such as ORC).
- as-result-table
- Specifies that the table is to be based on a SELECT statement. The table name cannot refer to a
Hadoop table that contains an ARRAY or ROW column (SQLSTATE 428H2).
- AS (fullselect)
- Specifies that, for each column in the derived result table of the
fullselect, a corresponding column is to be defined for the table. Each defined
column adopts the following attributes from its corresponding column of the result table (if
applicable to the data type):
- Column name
- Column description
- Data type, length, precision, and scale
- Nullability
The following attributes are not included (although the default value and identity attributes can be included by using copy-options):- Default value
- Identity attributes
- Hidden attribute
- ROW CHANGE TIMESTAMP
- Any other optional attributes of the tables or views that are referenced in the fullselect
The following restrictions apply:- Every select list element must have a unique name (SQLSTATE 42711). The AS clause can be used in the select clause to provide unique names.
- The fullselect cannot refer to host variables or include parameter markers.
- The data types of the result columns of the fullselect must be data types that are valid for columns of a table.
- If row or column level access control (RCAC) is activated for any table that is specified in the fullselect, RCAC is not cascaded to the new table.
- The fullselect cannot include a data-change-table-reference clause (SQLSTATE 428FL).
- Any valid fullselect that does not reference a typed table or a typed view can be specified.
- WITH NO DATA
- Specifies that the target table is not to be populated with data.
- DATA INITIALLY DEFERRED
- Specifies that data is not inserted into the table as part of the CREATE TABLE (HADOOP) statement.
- REFRESH DEFERRED
- Specifies that the table will populated with data at a later time.
- MAINTAINED BY USER
- Specifies that the data in the materialized query table is maintained by the user. The user is allowed to perform insert or truncate operations against user-maintained materialized query tables. The REFRESH TABLE statement cannot be run against user-maintained materialized query tables.
- ENABLE QUERY OPTIMIZATION
- Specifies that the materialized query table can be used for query optimization.
- DISABLE QUERY OPTIMIZATION
- Specifies that the materialized query table is not used for query optimization. The table can still be queried directly. This is the default.
Usage notes
- The maximum number of Hadoop tables that can be created in Db2 Big SQL is 1,048,576.
- By default, the HADOOP keyword is a required element of the CREATE TABLE
statement and is recommended. However, if you enable the COMPATIBILITY_MODE global variable and the
HADOOP keyword is not included on the CREATE TABLE statement, the type of table that is created is
determined by the presence of one or more of the following Hadoop table clauses:
- COMMENT
- EXTERNAL
- LOCATION
- PARTITIONED BY
- ROW FORMAT
- STORED AS
- STORED BY
- STRING
- TBLPROPERTIES
- WITH HINTS
- One default consequence of running the CREATE TABLE (HADOOP) statement is that GRANT statements
are applied for a connected user in Hive. If Hive impersonation is disabled, you must complete the
following steps to change the hive.security.metastore.authorization.manager
property in the hive.xml file:
- In the Ambari dashboard, click .
- Scroll to find the hive.security.metastore.authorization.manager property.
- Type the following value into the
field:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
- Click Save at the top of the Configs page. When the Save Configuration window opens, type a description of your change and click Save. Click OK in the confirmation window.
- Click Confirm Restart All. to restart the Hive service and related components. Then click
- If the current schema, which might have been created with either the USE or SET SCHEMA command, does not exist in the Hive metastore, an attempt is made to automatically create the schema in Hive.
- All referential integrity constraints and nullability indicators are advisory and are not enforced at data load time.
- When you use data types such as STRING and BINARY, the SQL processor might
assume that such cells always contain 32 KB of data, which can impact performance. Instead, use data
types that have a specific length. To improve performance, use a VARCHAR(n) with
a defined size instead of a STRING. Alter the table to define an appropriate length for the column
by using the following
syntax:
This alter operation changes the definition of the table both in Db2 Big SQL and in Hive. If data in the VARCHAR(n) column exceedsALTER TABLE <schema>.<table> ALTER COLUMN <col> SET DATA TYPE VARCHAR(<size>)
<size>
, a null value is returned when that data is queried.You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways:- Run the SET HADOOP PROPERTY command.
- Set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
- If no precision is included on the TIMESTAMP definition, the default is TIMESTAMP(9), which is compatible with the maximum precision of a TIMESTAMP value in Db2 Big SQL or Hive.
- An ARRAY or ROW column cannot be compared directly with another ARRAY or ROW column
(SQLSTATE 42818). The elements in an ARRAY or ROW can be compared only to another expression of a
primitive type. For example, assume that A and B are ARRAY columns.
- The following statement is not allowed:
...where a = b...
- The following statement is valid if the element type is a primitive
type:
...where a[1] = b[1]...
- The following statement is allowed:
...where c.field1 = d.field1...
- The following statement is not allowed:
- If you create a table in a schema that does not exist, the schema is automatically created with the default name (your user ID).
- The columns and data types for an Avro table are fixed at the time that you run the CREATE TABLE
(HADOOP) statement. If you create the table with an EXTERNAL schema by using the
avro.schema.url configuration property and then change the schema, the change
is not immediately recognized. This change can result in query execution errors.To ensure that your changes are recognized, you must drop the local table from the Db2 Big SQL catalog, and then import the table from the Hive catalog. Use the following commands to drop and import the table:
SET SYSHADOOP.CATALOG_SYNC_MODE=0; DROP TABLE avro_table; SET SYSHADOOP.CATALOG_SYNC_MODE=1; CALL HCAT_SYNC_OBJECTS('schema1', 'avro_table');
- CMX compression is supported in Db2 Big SQL. To enable CMX compression, set the following compression properties before running the CREATE TABLE (HADOOP) statement:
SET HADOOP PROPERTY 'mapred.output.compress'='true'; SET HADOOP PROPERTY 'mapred.output.compression.type'='BLOCK'; SET HADOOP PROPERTY 'mapred.output.compression.codec'='org.apache.hadoop.io.compress.CmxCodex';
- The following table shows the data type mappings between types that are supported in the CREATE TABLE (HADOOP) syntax, the corresponding types in the Hive catalog, and the native SQL types that Db2 Big SQL uses at run time:
Table 1. Data type mappings CREATE TABLE (HADOOP) data type 1 Hive data type 2 SQL data type 3 BIGINT BIGINT BIGINT BINARY BINARY (starting with Hive 0.8.0) VARBINARY BOOLEAN BOOLEAN BOOLEAN CHAR CHAR CHAR DATE DATE DATE DECIMAL DECIMAL (introduced in Hive 0.11.0 with a precision of 38 digits) DECIMAL DOUBLE DOUBLE DOUBLE FLOAT DOUBLE FLOAT INT INT INT REAL FLOAT REAL SMALLINT SMALLINT SMALLINT STRING STRING VARCHAR(32672) TIMESTAMP TIMESTAMP TIMESTAMP(9) TIMESTAMP(n) TIMESTAMP TIMESTAMP(n) TINYINT TINYINT SMALLINT VARBINARY BINARY VARBINARY VARCHAR VARCHAR VARCHAR - This column lists the data types that you can specify in the CREATE TABLE (HADOOP) statement.
- This column lists the corresponding data types as defined in the Hive metastore. For example, a column that is defined as FLOAT in the statement is listed as a DOUBLE column in the Hive catalogs, and table values for that column must conform to Hive (or SerDe) rules for DOUBLE values.
- This column lists the corresponding native Db2 Big SQL data types. When such a type interacts with other columns through a SELECT or INSERT statement, for example, the data must be treated as the Big SQL type.
- If
Db2 Big SQL is reinstalled, existing Hadoop MQTs are
resynchronized from Hive and are no longer defined as Hadoop MQTs. These tables must be dropped and
recreated; otherwise, the following error is
returned:
SQL0159N The statement references an object that identifies an unexpected object type. Object: "<TABLE NAME>". Object type: "NON MQT". Expected object type: "MQT". SQLSTATE=42809
- Hadoop tables, HBase tables, and local Db2 tables all use the same namespace. If you create tables in Hive that are synchronized to Db2 Big SQL, ensure that these tables do not conflict with local Db2 tables in Big SQL.
- When a table is created in
Db2 Big SQL by using the CREATE TABLE (HADOOP) statement, Db2 Big SQL in turn creates a table on the Hive metastore. Starting with HDP 3.0, Hive tables are managed tables by default (for
background information on managed tables, see Managed vs.
External Tables). Hive tables that are implicitly created by Db2 Big SQL, however, are not Hive managed tables. To create a table, Db2 Big SQL runs a CREATE EXTERNAL TABLE statement rather than a
CREATE TABLE statement in Hive. But unlike what happens with Hive tables that are explicitly created
as EXTERNAL, when a Big SQL table is dropped, Db2 Big SQL
deletes the data that was associated with the table, as you would expect with a managed Hive table.
You can use output from the db2look utility to determine whether a table is an external table or a
Hive managed table. For example, consider the following
statement:
Running db2look against this statement returns the following output, which in this example shows that the table is an external table. The table propertyCREATE HADOOP TABLE foo(c1 INT) STORED AS ORC;
'external.table.purge'='true'
indicates that the data that is associated with the table will be deleted when the table is dropped.CREATE EXTERNAL HADOOP TABLE "BIGSQL "."FOO"( C1 INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs:///warehouse/tablespace/external/hive/bigsql.db/foo' TBLPROPERTIES( 'bucketing_version'='2', 'external.table.purge'='true' );
- The following syntax alternatives are supported for compatibility
with other database products:
- You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY data type. For example,
you can specify
ARRAY<VARCHAR(8)>
instead ofVARCHAR(8) ARRAY[ ]
. - You can use the MAP syntax that Hive uses to define an associative ARRAY data type. For example,
you can specify
MAP<VARCHAR(10), DOUBLE>
instead ofDOUBLE ARRAY [VARCHAR(10)]
. - You can use the STRUCT syntax that Hive uses to define a ROW data type. For example, you can
specify
STRUCT<F1:INTEGER, F2:DOUBLE>
instead ofROW(F1 INTEGER, F2 DOUBLE)
.
- You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY data type. For example,
you can specify
Restrictions
- The table-name in the same schema must not already exist in the Hive catalog,
regardless of case differences. For example, the second table definition is not
valid:
CREATE HADOOP TABLE "Foo" (c1 INT); CREATE HADOOP TABLE "FOO" (c1 INT);
- The ENFORCED constraint attribute cannot be applied to Db2 Big SQL Hadoop tables (SQLSTATE 42858).
- If you are defining a local database table and specify the LIKE clause, but you enabled the COMPATIBILITY_MODE global variable, you cannot include the LOCATION, TBLPROPERTIES, or HINTS clause.
- The following restrictions apply to the ARRAY and ROW data types:
- The first column of a table cannot be defined with the ARRAY or ROW data type.
- Use of the ARRAY or ROW data type in a key definition or constraint is not supported (SQLTATE 42962).
- Columns that are defined with the ARRAY or ROW data type cannot be used as the anchor object of an anchored type (SQLSTATE 428HS).
- Referencing a table that contains a column of type ARRAY or ROW in the AS or LIKE clause is not supported (SQLSTATE 428H2).
- The maximum number of ARRAY or ROW definitions in a database is 32763 (SQLSTATE 54035). This limit also includes any user-defined types that use the CREATE TYPE statement.
- Db2 Big SQL ARRAY and ROW types are not compatible with SQL PL ARRAY and ROW types.
- Use of an ARRAY type as a type within an ARRAY type for Hadoop tables is not supported.
- The result type of a column in a fullselect cannot be ARRAY.
- The use of scrollable cursors for Hadoop tables is not supported.
Examples
- Create a simple Hadoop
table.
CREATE HADOOP TABLE t1 ( c1 INT, c2 VARCHAR(20), c3 DOUBLE, c4 REAL );
- Create a table with an underlying delimited text
file.
CREATE HADOOP TABLE t ( i INT, s VARCHAR(10) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
- Create tables with underlying BINARY and TEXT sequence
files.
CREATE HADOOP TABLE seq_tab ( c1 INT NOT NULL, c2 VARCHAR(20) ) STORED AS BINARY SEQUENCEFILE;
CREATE HADOOP TABLE t ( i INT, s VARCHAR(10) ) STORED AS TEXT SEQUENCEFILE;
- Use the
LazySimpleSerDe
class to create a table.CREATE HADOOP TABLE t ( i INT, s VARCHAR(10) ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS SEQUENCEFILE;
- Create a table with underlying Parquet
files.
CREATE HADOOP TABLE t ( i INT, s VARCHAR(10) ) STORED AS PARQUETFILE;
- Create an Avro table with an inline Avro
schema.
CREATE EXTERNAL HADOOP TABLE bs_rev_profit_by_compaign ( Revenue DOUBLE, GrossProfit DOUBLE, CompaignName VARCHAR(10) ) COMMENT 'A table backed by Avro data with the Avro schema embedded in the CREATE TABLE statement' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/biadmin/avro/sheet/' TBLPROPERTIES ( 'avro.schema.literal'='{"type":"record", "name": "TUPLE_3", "fields": [ { "name": "Revenue","type": [ "null","double" ], "doc": "autogenerated from Pig Field Schema"}, {"name":"GrossProfit","type": [ "null", "double" ], "doc": "autogenerated from Pig Field Schema"}, {"name": "CompaignName","type": [ "null", "string" ], "doc":"autogenerated from Pig Field Schema"} ] } ') ;
- Create a table with a partitioning
column.
CREATE HADOOP TABLE orders ( orderkey INT NOT NULL, o_comment VARCHAR(20) COMMENT 'comment column' NOT NULL ) PARTITIONED BY (orderdate VARCHAR(10));
- Create a table with a partition that is based on a partitioning expression on a
column.
CREATE HADOOP TABLE orders ( orderkey INT NOT NULL, o_comment VARCHAR(20) COMMENT 'comment column' NOT NULL, orderdate DATE ) PARTITIONED BY (MONTH(orderdate) AS order_month);
- Create a table that is like the following
table:
Convert the data to a binary sequence file, which can improve performance.CREATE HADOOP TABLE employee ( empno INT, fname VARCHAR(100), lname VARCHAR(100), dept VARCHAR(100) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
CREATE HADOOP TABLE employeeBINSEQ ( empno, fname, lname, dept ) STORED AS BINARY SEQUENCEFILE AS (SELECT * FROM employee);
- Create a table by using the structure of another table, but without using any of the data from
the source
table.
CREATE HADOOP TABLE t1 ( c1, c2 ) AS (SELECT x1, x2 FROM t2) WITH NO DATA;
- Create a table with complex data types (STRUCT and
ARRAY).
CREATE HADOOP TABLE rf_del1 ( id VARCHAR(10), c1 STRUCT<f1:varchar(10), f2:varchar(10)>, c2 ARRAY<varchar(10)> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '/' LINES TERMINATED BY '\n';
- Create a Hadoop table that can ingest HBase
data.
For more information, see Loading HBase data into a Hive table.CREATE HADOOP TABLE staff_hbase ( id SMALLINT, name VARCHAR(100), dept SMALLINT, job VARCHAR(10), years SMALLINT, salary VARCHAR(10), comm VARCHAR(10) ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = ':key,info:name,info:dept,role:job,role:years,money:salary,money:comm');
- Create a simple table with an array of integers whose maximum cardinality is
10.
Select the value of the second element in the array.CREATE HADOOP TABLE mytable ( id INT, col1 INT ARRAY[10]);
Return all values for ID that are based on a particular value in COL1.SELECT col1[2] FROM mytable;
SELECT id FROM mytable WHERE col1[8]=12;
- Create a simple table with a STRUCT
column.
Select element C1 from COL1.CREATE HADOOP TABLE mytable ( id INT, col1 STRUCT<c1:INT, c2:DOUBLE> );
Return all values for ID and COL.C2 that are based on a particular value in COL1.SELECT col1.c1 FROM mytable;
SELECT id, col1.c2 FROM mytable WHERE col1.c1 = 5;
- Create a table with the ROW data
type.
Select field C1 from COL1.CREATE HADOOP TABLE mytable ( id INT, col1 ROW(c1 INT,c2 DOUBLE) );
SELECT col1.c1 FROM mytable;
- Create a table with the MAP data type, an associative array of integer values that can be
indexed by a VARCHAR(20)
key.
Select an element value from the MAP where the index value is 'MY_STRING'.CREATE HADOOP TABLE mytable ( id INT, col1 INT ARRAY[VARCHAR(20)] );
SELECT col1['MY STRING'] FROM mytable;
- Create a view from a Hadoop table named TESTVIEW01 that has a column of type
ARRAY.
Select the value of the first element in the array.CREATE HADOOP TABLE testview01 ( id INT, arr TINYINT ARRAY[5] ); CREATE VIEW view001(i,c1) AS SELECT id, arr FROM testview01 WHERE arr[1] < 8;
SELECT c1[1] FROM view001;
- Create a table T2 with data types that are different from those in the source table
T1.
CREATE HADOOP TABLE t1 ( c1 INT, c2 INT); CREATE HADOOP TABLE t2 ( c1, c2) AS ( SELECT DOUBLE(c1), DOUBLE(c2) FROM t1);
- Create a Hadoop
MQT.
CREATE HADOOP TABLE hadoopmqt AS ( SELECT c1, c2 FROM t1) DATA INITIALLY DEFERRED REFRESH DEFERRED MAINTAINED BY USER DISABLE QUERY OPTIMIZATION;
- Create a partitioned Hadoop
MQT.
CREATE HADOOP TABLE hadooppartitionedmqt PARTITIONED BY (c2) AS ( SELECT c1, c2 FROM t1) DATA INITIALLY DEFERRED REFRESH DEFERRED MAINTAINED BY USER DISABLE QUERY OPTIMIZATION;