CREATE TABLE (HADOOP) statement

Use the CREATE HADOOP TABLE statement to create Hive tables for the Hadoop environment.

Invocation

This statement is an executable statement that cannot be dynamically prepared.

Syntax

Read syntax diagramSkip visual syntax diagram CREATE EXTERNAL HADOOP1 TABLE IF NOT EXISTS table-name element-list2LIKEtable-name34COMMENTstring-constantPARTITIONED BY partitioning-element-listROW FORMATdelimited-row-formatserde-row-formatSTORED AS file-formatSTORED BYstorage-handlerLOCATIONstring-constantTBLPROPERTIES(,string-constant=string-constant)CACHEDINpool-nameas-result-table
element-list
Read syntax diagramSkip visual syntax diagram ( ,column-definitionunique-constraintreferential-constraintcheck-constraint )
column-definition
Read syntax diagramSkip visual syntax diagram column-name data-type column-options
data-type
Read syntax diagramSkip visual syntax diagramprimitive-typearray-typerow-type
primitive-type
Read syntax diagramSkip visual syntax diagram5TINYINTSMALLINTINT2INTEGERINTINT4BIGINTINT8FLOATFLOAT8REALFLOAT4DOUBLEPRECISIONFLOAT8CHARACTER(integer)CHAR6STRING7VARCHAR(integer)CHARACTERVARYINGCHARTIMESTAMP(integer)DATESTORED AS TIMESTAMPSTORED AS DATEDECIMAL(10,0)(integer,0,integer)DECNUMERICNUMBINARY(integer)VARBINARY(integer)
array-type
Read syntax diagramSkip visual syntax diagram primitive-typerow-typeARRAY[integer-constant]primitive-typerow-typeARRAY[INTINTEGERVARCHAR(int)]
row-type
Read syntax diagramSkip visual syntax diagram ROW(,field-namedata-type)
column-options
Read syntax diagramSkip visual syntax diagramCOMMENTcolumn-commentNULLNOT NULLCONSTRAINTconstr-nameconstraint-choicesconstraint-attributes
constraint-choices
Read syntax diagramSkip visual syntax diagramPRIMARY KEYUNIQUEreferences-clauseCHECK(check-condition)
references-clause
Read syntax diagramSkip visual syntax diagram REFERENCES table-name (,column-name)
constraint-attributes
Read syntax diagramSkip visual syntax diagram NOT ENFORCEDTRUSTEDNOT TRUSTEDENFORCED ENABLE QUERY OPTIMIZATIONDISABLE QUERY OPTIMIZATION
partitioning-element-list
Read syntax diagramSkip visual syntax diagram ( ,column-definitionpartition-expression )
partition-expression
Read syntax diagramSkip visual syntax diagram expression AS identifier
delimited-row-format
Read syntax diagramSkip visual syntax diagram DELIMITEDFIELDS TERMINATED BYterminator-charESCAPED BYesc-char COLLECTION ITEMS TERMINATED BYterminator-charLINES TERMINATED BYterminator-charNULL DEFINED ASnull-value
serde-row-format
Read syntax diagramSkip visual syntax diagram SERDE serde-class WITH SERDEPROPERTIES(,string-constant=string-constant)
file-format
Read syntax diagramSkip visual syntax diagramTEXTFILESEQUENCEFILEBINARY SEQUENCEFILETEXT SEQUENCEFILERCFILEORCPARQUETFILEPARQUETcustom-file-format
custom-file-format
Read syntax diagramSkip visual syntax diagram INPUTFORMAT class-name OUTPUTFORMAT class-name INPUTDRIVERclass-nameOUTPUTDRIVERclass-name
storage-handler
Read syntax diagramSkip visual syntax diagram class-name WITH SERDEPROPERTIES(,string-constant=string-constant)
unique-constraint
Read syntax diagramSkip visual syntax diagramCONSTRAINTconstraint-nameUNIQUEPRIMARY KEY ( ,column-name ) constraint-attributes
referential-constraint
Read syntax diagramSkip visual syntax diagram CONSTRAINTconstraint-name FOREIGN KEY (,column-name) references-clause constraint-attributes
check-constraint
Read syntax diagramSkip visual syntax diagram CONSTRAINTconstraint-name CHECK(check-condition)
check-condition
Read syntax diagramSkip visual syntax diagramsearch-conditionfunctional-dependency
functional-dependency
Read syntax diagramSkip visual syntax diagram column-name(,column-name) DETERMINED BY column-name(,column-name)
as-result-table
Read syntax diagramSkip visual syntax diagram (,column-name) AS ( full-select ) WITH NO DATA
Notes:
  • 1 The HADOOP keyword is required unless you enable the SYSHADOOP.COMPATIBILITY_MODE global variable and use specific Hadoop table clauses. For more information about this requirement see Hadoop keyword usage note.
  • 2 You do not need to specify the element-list during a CREATE TABLE AS operation because the column details can be inferred from the query. You also do not need to specify the element-list when you create a table by using a SerDe that dynamically determines the column list from an external data source, such as an Avro schema.
  • 3 For compatibility with Hive, when you use the LIKE clause, you can specify only the LOCATION, TBLPROPERTIES, and HINTS clauses,
  • 4 Each clause can only be used once.
  • 5 The Big SQL database manager natively supports the data types that are described in Data types. However, there are some data types that are syntactically supported for compatibility with Hive, but are physically treated as different data types within the Big SQL runtime. For more information see the mapping of data types that are used in the CREATE TABLE statement.
  • 6 If you enable COMPATIBILITY_MODE, CHAR acts as a synonym for VARCHAR when you create the table. Then, the system catalogs show the VARCHAR data type.
  • 7 Because the STRING data type has no specific size associated with it, the SQL processor might assume that it needs to manipulate 32K of data in a column all the time. Be aware that performance can be impacted. If performance is a concern, use types that contain a specific length. You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways: issue the SET HADOOP PROPERTY command or set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.

Description

EXTERNAL

The EXTERNAL parameter indicates that the data in the table is not managed by the database manager. When you DROP the table, the definition of the table is removed from the database manager and Hive catalogs, but the data remains unaffected. This is typically used in conjunction with the LOCATION keyword to define a table with pre-existing data.

HADOOP
The HADOOP keyword is required to define a Hadoop table unless you enable the SYSHADOOP.COMPATIBILITY_MODE global variable. You can verify that you have defined a HADOOP table by querying the PROPERTY column in SYSCAT.TABLES.
SELECT SUBSTRING(PROPERTY,22,1) FROM SYSCAT.TABLES WHERE TABNAME='<table-name>';
table-name
Specifies the name for the Big SQL table that is to be created.

If the table name exists, this statement fails. However, if the table name exists and IF NOT EXISTS is specified, no error message is issued.

The table name in the table definition must be a valid identifier.

Use the RENAME statement to rename a Hadoop table. When you rename a Hadoop table, both the name in the local database catalogs and the Hive catalogs are altered. When you rename a managed Hadoop table, which is a table that is created without the EXTERNAL keyword, the distributed file system storage directory is also changed.

column-definition
The name, type, constraints, and comments of the column.
Note: column-definition must be an identifier containing only ASCII characters.
column-name
You must specify the list of columns that are to be created as non-partitioned columns in the table. Each column is specified as a column name and data type. All valid Big SQL data types are allowed.

The column name in the table definition must be a valid identifier.

data-type
Specifies the data type of the column.
primitive-type
The built-in data type. Use one of the following types:
TINYINT
A 1-byte integer data type, with the range of -128 to 127.

Hadoop automatically converts the TINYINT to a larger integer type, such as SMALLINT, INT, or BIGINT, or a floating-point type such as FLOAT or DOUBLE.

SMALLINT
A small integer is a 2-byte integer with a precision of 5 digits. The range of small integers is -32,768 to 32,767
INTEGER or INT
A 4-byte integer with a precision of 10 digits. The range of large integers is -2,147,483,648 to +2,147,483,647.
BIGINT
An 8-byte integer data type with a precision of 19 digits. The range is -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.
FLOAT(integer)
A single-precision floating-point number, which is a 32-bit approximation of a real number. The number can be zero or can range from -3.4028234663852886e+38 to -1.1754943508222875e-38, or from 1.1754943508222875e-38 to 3.4028234663852886e+38. A double-precision floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.7976931348623158e+308 to -2.2250738585072014e-308, or from 2.2250738585072014e-308 to 1.7976931348623158e+308.

The REAL data type is a single precision floating-point number. The DOUBLE data type is a double-precision floating-point number.

DECIMAL (precision-integer, scale-integer) or DEC(precision-integer, scale-integer)
A decimal number. The first integer is the precision of the number; that is, the total number of digits; it can range from 1 to 31. The second integer is the scale of the number; that is, the number of digits to the right of the decimal point; it can range from 0 to the precision of the number.

The default value is precision 10 and scale 0. The words NUMERIC and NUM can be used as synonyms for DECIMAL and DEC.

CHARACTER or CHAR
A character string is a sequence of characters. The length of the string is the number of characters in the sequence. If the length is zero, the value is called the empty string, which is not the NULL value.
STRING
The STRING data type has no specific size associated with it. The default is a VARCHAR 32K.
VARCHAR
A VARCHAR value can be up to 32 672 bytes long. You can also use CHARACTER VARYING or CHAR.
TIMESTAMP
A TIMESTAMP is a six- or seven-part value (year, month, day, hour, minute, second, and optional fractional seconds) that designates a date and time.

The integer must be between 0 and 9 and specifies the precision of fractional seconds from 0 (seconds) to 9 (nanoseconds).

DATE
A DATE is a three-part value (year, month, and day) that designates a date. The range of the year part is 0001 to 9999. The range of the month part is 1 to 12. The range of the day part is 1 to x, where x depends on the month.
DATE STORED AS TIMESTAMP
Indicates that the DATE data type is mapped and stored as a Hive TIMESTAMP data type. This is the default.
DATE STORED AS DATE
Indicates that the DATE data type is mapped and stored as a Hive DATE data type. For optimal performance when using this type, be sure to use the ORC file format, because the Java™ I/O interface is used to access the DATE data type (see Big SQL readers and writers).
BINARY or VARBINARY
A binary value is a variable length sequence of bytes. The length of the value is the number of bytes that it contains.

A varying length binary value is specified as VARBINARY(length) or BINARY(length) and can be up to 32,672 bytes long (VARBINARY is an alias for BINARY, and they are functionally identical in every way). If no length is provided, the length is assumed to be 32,672.

However, it is recommended that an explicit length be specified. The Big SQL engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation.

With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL.

array-type, row-type

You can define only 32,673 of ARRAY or ROW types for the Big SQL database (SQLSTATE 54035). For example, each instance of ARRAY or ROW consumes one of these entries, therefore a column of an ARRAY of ROW types consumes two entries. If you exceed the limit, you must drop a HADOOP table that contains a ROW or ARRAY defined as a column type.

Each ARRAY or ROW type can be used in one of two ways, the SQL standard version, or the Hive version. The SQL standard version, which is displayed in the syntax, is the recommended usage. When you use the length of the data type, which is the number of elements, as part of the table declaration, the database manager can take advantage of the fact that it can compute the maximum size of such a column, which optimizes the column.

With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL.

For information about the usage restrictions, see Array values and Row values.

ARRAY

An ARRAY is an indexable list. The elements in the array must be in the same type. You can access the elements in the array by the following notation: ...WHERE C1[2]=1..., assuming that column C1 is declared as an ARRAY with two or more elements.

primitive-type or row-type
Specifies the data type of the field.
primitive-type
Specifies a built-in, or primitive data type. Built-in types include the data types that are described in primitive-types
ROW
Specifies the ROW data type. An ARRAY of ROW types is valid.
ROW

A ROW object contains a value for each attribute of the SQL structured type that it represents.(java.sql.Struct). A ROW is equivalent to the STRUCT type in Hive tables. The ROW type contains field definitions that contain the field name and the data type.

For each field in a ROW definition, an entry is created in the SYSIBM.SYSATTRIBUTES table.

field-name
Specifies the name of a field within the ROW type. The name cannot be the same as any other field of this ROW type (SQLSTATE 42711).
data-type
Specifies the data type of the field.
primitive-type
Specifies a built-in, or primitive data type. Built-in types include the data types that are described in primitive-types
column-options
You can define column options that include date, nullability, or constraints on the column.
COMMENT column-comment

Each column can contain a COMMENT to provide more information about the column.

NULL or NOT NULL

Each column can specify NOT NULL. The NOT NULL parameter signifies to the Big SQL query optimizer that the column will not contain NULL values. The HADOOP table is external to any local database server, so Big SQL enforces nullability at query execution time (SQL5104N).

NULL and NOT NULL options are used in the following ways:
  • The nullability of a column is used during certain query re-write optimizations.
  • The nullability of a column is enforced during the INSERT or LOAD USING statement. Do not assign a NULL value to a column define as NOT NULL (SQLSTATE 23502).
CONSTRAINT constraint-name

You can optionally identify each constraint with a name. For all constraints, the default is NOT ENFORCED.

The following constraints can be set on the column:
PRIMARY KEY

Provides a shorthand method of defining a primary key. A primary key must be unique, and must have the NOT NULL attribute. A table cannot have more than one primary key.

UNIQUE

Provides a shorthand method of defining a unique key. This rule forbids duplicate values in one or more columns within a table. A table can have multiple unique keys. The columns that are specified in a unique constraint must be defined as NOT NULL.

REFERENCES table-name

If you do not specify column names, then the primary key of the table is the assumed set of referenced columns.

By using this parameter, you can define required relationships between tables. The purpose of this parameter is to guarantee that table relationships are maintained and data entry rules are followed. The column on which this parameter is used becomes the FOREIGN KEY to the parent table that is referenced by the REFERENCES keyword.

CHECK check-condition

This parameter sets restrictions on data added to a specific table. It is enforced during INSERT operations, but is not enforced during a LOAD HADOOP, or on data that is made available outside of the database manager, such as by using Hive operations. CHECK constraints are useful to ensure that a column that contains interest rates never contains a value of zero.

The search-condition must be true or unknown for every row of the table.
search-condition
The search-condition has the following restrictions:
  • A column reference must be to a column of the table being created.
  • The search-condition cannot contain a TYPE predicate.
  • The search-condition cannot contain any of the following (SQLSTATE 42621):
    • Subqueries
    • CAST specifications with a SCOPE clause
    • Column functions
    • Functions that are not deterministic
    • Functions defined to have an external action
    • User-defined functions defined with either MODIFIES SQL or READS SQL DATA
    • OLAP specifications
    • Global variables
    • References to generated columns other than the identity column
    • An error tolerant nested-table-expression

Check constraints with search-condition are enforced when rows in the table are inserted or updated. A check constraint defined on a table automatically applies to all subtables of that table.

functional-dependency
Defines a functional dependency between columns.

column-name DETERMINED BY column-name or (column-name,...) DETERMINED BY (column-name,...)

The parent set of columns contains the identified columns that immediately precede the DETERMINED BY clause. The child set of columns contains the identified columns that immediately follow the DETERMINED BY clause. All of the restrictions on the search-condition apply to parent set and child set columns, and only simple column references are allowed in the set of columns (SQLSTATE 42621). The same column must not be identified more than once in the functional dependency (SQLSTATE 42709). No column in the child set of columns can be a nullable column (SQLSTATE 42621).

A functional dependency is not enforced by the database manager during normal operations such as insert, update, or delete. The functional dependency might be used during query rewrite to optimize queries. Incorrect results might be returned if the integrity of a functional dependency is not maintained.

constraint-attributes
These constraints can improve query performance. The default for Hadoop tables is NOT ENFORCED.

For primary key constraints, unique constraints, and check constraints, you can specify that the constraint can be trusted. For referential integrity constraints, if the constraint is not enforced, you can further specify whether the constraint can be trusted. A not-enforced and not-trusted constraint is also known as a statistical referential integrity constraint. You can specify whether a referential integrity constraint or check constraint is to be used for query optimization.

NOT ENFORCED or ENFORCED

Specifies whether the constraint is enforced by the database manager during normal operations such as insert, update, or delete.

NOT ENFORCED
This parameter is the default for Hadoop tables. The constraint is not enforced by the database manager. A primary key constraint or unique constraint cannot be NOT ENFORCED if there is a dependent ENFORCED referential constraint.
TRUSTED

The data can be trusted to conform to the constraint. TRUSTED must be used only if the data in the table is independently known to conform to the constraint. Query results might be unpredictable if the data does not actually conform to the constraint. This is the default option.

NOT TRUSTED

The data cannot be trusted to conform to the constraint. NOT TRUSTED is intended for cases where the data conforms to the constraint for most rows, but it is not independently known that all the rows or future additions will conform to the constraint. If a constraint is NOT TRUSTED and enabled for query optimization, then it will not be used to perform optimizations that depend on the data conforming completely to the constraint. NOT TRUSTED can be specified only for referential integrity constraints (SQLSTATE 42613).

ENFORCED

The ENFORCED keyword is syntactically available for compatibility with the syntax for local tables (created with CREATE TABLE). However the presence of ENFORCED results in an error (SQLSTATE 42858) that indicates the operation is not supported on Hadoop tables.

ENABLE QUERY OPTIMIZATION or DISABLE QUERY OPTIMIZATION
Specifies whether the constraint or functional dependency can be used for query optimization under appropriate circumstances. The default is ENABLE QUERY OPTIMIZATION.
ENABLE QUERY OPTIMIZATION

The constraint is assumed to be true and can be used for query optimization.

DISABLE QUERY OPTIMIZATION

The constraint cannot be used for query optimization. DISABLE QUERY OPTIMIZATION cannot be specified for primary key and unique constraints (SQLSTATE 42613).

unique-constraint

Provides a shorthand method of defining a unique key. This rule forbids duplicate values in one or more columns within a table. A table can have multiple unique keys. The columns that are specified in a unique constraint must be defined as NOT NULL

referential-constraint
Defines a referential constraint.
CONSTRAINT constraint-name
Names the referential constraint.
FOREIGN KEY (column-name, ...)

Defines a referential constraint with the specified constraint-name.

The foreign key of the referential constraint is composed of the identified columns. Each name in the list of column names must identify a column of the object table and the same column must not be identified more than once. There must be the same number of foreign key columns as there are in the parent key and the data types of the corresponding columns must be compatible (SQLSTATE 42830).

check-constraint
Defines a check constraint. A check-constraint is a search-condition that must evaluate to not false or a functional dependency that is defined between columns.
constr-name

Names the check constraint.

check-condition

Defines a check constraint. The search-condition must be true or unknown for every row of the table .

LIKE table-name
Specifies that the columns of the table have exactly the same name and description as the columns of the identified table (table-name), view (view-name) or nickname (nickname). The name specified after LIKE must identify a table, view or nickname that exists in the catalog, or a declared temporary table.

You cannot specify a typed table or typed view (SQLSTATE 428EC). The Hadoop table referenced in table-name cannot contain an ARRAY or ROW column (SQLSTATE 428H2).

The table name can refer to another Hadoop table only. Creating an Hadoop table LIKE a local table results in an error (SQLSTATE 42858).

COMMENT string-constant

A table comment as part of a table definition can be used to provide information about the table.

PARTITIONED BY partitioning-element-list
column-definition
Specifies one or more columns by which data that is inserted into the table is partitioned. The action of including a PARTITIONED BY clause creates a directory in the distributed file system (DFS) that contains data for each unique value in the column that is being partitioned. At query time, Big SQL tries to scan only those directories, or partitions, that contain data that match the search criteria of the query. The following data types cannot be used as partitioning columns:
  • ARRAY
  • BINARY
  • BOOLEAN
  • ROW
  • TIMESTAMP
  • VARBINARY
partition-expression
Specifies one or more partitioning expressions by which data that is inserted into the table is partitioned. The action of including a PARTITIONED BY clause creates a directory in the distributed file system (DFS) that contains data for each unique value in the partition expression. At query time, Big SQL tries to scan only those directories, or partitions, that contain data that match the search criteria of the query. The expression can reference one or more columns in the columns list. The AS clause specifies a name for the column that is derived from the expression. The data type of this identifier matches the result type of the expression. If a query references the column that is derived from the expression, a new predicate is automatically injected into the query so that Big SQL scans only the qualifying partitions. The data type of the result of the expression cannot be one of the following types:
  • ARRAY
  • BINARY
  • BOOLEAN
  • ROW
  • TIMESTAMP
  • VARBINARY
The partitioning expression cannot contain any of the following items (SQLSTATE 42621):
  • Subqueries
  • Column functions
  • Dereference operations or DEREF functions
  • User-defined or built-in functions that are non-deterministic
  • User-defined functions that use the EXTERNAL ACTION option
  • User-defined functions that are not defined with NO SQL
  • Host variables or parameter markers
  • Special registers or built-in functions that depend on the value of a special register
  • Global variables
  • An expression on a partitioning expression column
Note: A load Hadoop operation does not automatically generate column values when a table is partitioned by a partition-expression. When using the LOAD HADOOP statement on a table that is partitioned by a partition-expression, ensure that the column values are generated in advance and stored in a file. The INSERT statement does automatically generate column values for the partition-expression. Therefore, if values for the new column must be automatically generated, use the INSERT statement instead of the LOAD HADOOP statement. If all column values are generated in advance, use the LOAD HADOOP statement.
ROW FORMAT delimited-row-format
Specifies the format of data rows. If ROW FORMAT SERDE is not specified, ROW FORMAT defaults are the ROW FORMAT DELIMITED options that are not explicitly specified.
DELIMITED
Specifies a delimiter at the table level. You can make more than one choice from the delimiter items.
The variables that you provide with the FIELDS TERMINATED BY, ESCAPED BY, and LINES TERMINATED BY parameters can be specified as the following characters:
  • A literal single byte character, such as '|' (pipe) or ',' (comma).
  • \NNN, where NNN is the octal value of a character. For example, \001 is the ASCII character CTRL-A.
  • \xYY, where YY is a two-digit hexadecimal value of a character. This representation is retained for compatibility with Big SQL v1, however the literal character or octal representations are recommended.
FIELDS TERMINATED BY
Specifies a delimiter for structured fields. You can include any single character, but the default is '\001', which is the CTRL-A.

The field terminator can exist within your data, if you specify the ESCAPED BY clause and those characters are properly escaped.

LINES TERMINATED BY
Specifies a delimiter for map keys. You can include a newline character only ('\n'). Newlines must not exist within the data itself, and they cannot be escaped.
NULL DEFINED AS
Defines the value to represent the NULL value. The default value for NULL is \N, that is a literal backslash character followed by an upper case N.
ROW FORMAT serde-row-format
Specifies the format of data rows as a SerDe class. You specify the name of a Java class in the Big SQL server CLASSPATH that implements the Hive Serde interface org.apache.hadoop.hive.serde2.SerDe.
WITH SERDEPROPERTIES

Use the WITH SERDEPROPERTIES parameter to associate SerDe properties with the SerDe class. For example, if you want to eventually load HBase data into a Hive table, create the table by using the WITH SERDEPROPERTIES and the hbase.columns.mapping parameter.

To use a SerDE JAR file, do the following steps to copy the JAR file to the appropriate libraries:
  1. Stop the Big SQL server.
  2. Copy the JAR file to $BIGSQL_HOME/userlib and to $HIVE_HOME/lib.
  3. Restart the Big SQL server.
STORED AS file-format
Specifies the type of file in which data is to be stored. Big SQL can read most file storage formats that are available, either as explicitly defined in the syntax, or with installed interfaces. For more information about the supported file formats, see File formats supported by Big SQL
You can also specify how the data is stored as Java input and output classes.
INPUTFORMAT
Specify the name of a Java class, such as hadoop.input.format.class.name, in the Big SQL server CLASSPATH that implements the Hadoop input format interface, such as org.apache.hadoop.mapred.InputFormat.
Note: In Hadoop, a record is defined by some input format in a data source. An input format refers to a piece of code that specifies how the files are to be read. When a Big SQL table is created, the server chooses an appropriate input format that is based on how the table is created. An input format can also be explicitly specified by naming a Java class that contains an implementation of the InputFormat interface of Hadoop. The Java class can turn a record into a row, which can determine how the data within a record corresponds to the columns in the definition of a table.
OUTPUTFORMAT
Specify the name of a Java class, such as hadoop.output.format.class.name, in the Big SQL server CLASSPATH that implements the Hadoop output format interface, such as org.apache.hadoop.mapred.OutputFormat.
STORED BY storage-handler

Specifies the name of a Java class in the Big SQL server CLASSPATH that implements the storage handler interface. For example, you can use org.apache.hadoop.hive.ql.metadata.HiveStorageHandler for the Hive interface. Use org.apache.hadoop.hive.hbase.HBaseStorageHandler for the HBase interface.

LOCATION string-constant

Indicates that a specific distributed file system directory must be used to store data files.

Access to the path specified in the LOCATION clause is controlled by the HDFS file permissions. Access can also be granted. If the files are not readable by the user, then the CREATE statement fails. You must have one of the following authorizations to use the LOCATION clause:
  • DATAACCESS authority.
  • Privileges on the files and the directory in HDFS (and not Ranger):
    • READ and WRITE privileges on the directory and all files in the directory, including files in any subdirectories that might be recursively specified in the LOCATION clause.
    • EXECUTE privileges on all the directories and subdirectories in the specified LOCATION clause.
If impersonation is not enabled, the bigsql user needs the same permissions that the Hive user needs for this operation. For example, with impersonation enabled in Hive, the bigsql user must have READ, WRITE, and EXECUTE privileges on the directory specified in the LOCATION clause and READ and EXECUTE privileges on the parent directory.

Big SQL provides configuration parameters to enforce the requirement that the LOCATION clause must be used with the EXTERNAL clause. The action of these configuration parameters ensures that the DROP statement does not accidentally drop data that is outside of the hive\warehouse directory.

For more information, see Using S3 Object Storage with Big SQL tables.

TBLPROPERTIES string-constant=string-constant
Defines job properties that can configure input or output formats, and that you can use to specify more detailed information about the table. The properties are passed as is to the underlying InputFormat and OutputFormat classes. Properties that are not implemented by these classes are ignored. Big SQL does not validate the properties.
CACHED

Enables caching on the entire created table. When you add data to the table, or when you create partitions, they are automatically set to be cached. By using the cached clause, Big SQL adds a directive with the table location that is to be cached in the pool_name that you specify.

If you create a partitioned table, directives to cache the individual partitions are added by the INSERT or LOAD statements when they create a partition. The cached property of the table location is inherited

If you create partitions outside of the INSERT or LOAD statements, these partitions are not automatically cached. You can use the hdfs cacheadmin command to create cache directives for those partitions.

(column-name) AS (full-select) WITH NO DATA

Creates a table that is based on a SELECT statement. The table name cannot refer to a Hadoop table that contains an ARRAY or ROW column (SQLSTATE 428H2).

You can optionally rename the columns in the target table.

By using the WITH NO DATA clause, you can define the target table without populating the table.

Usage notes

  • The maximum number of Hadoop tables that can be created in Big SQL is 65536.
  • By default, the HADOOP keyword is a required element of the CREATE TABLE state. However, if you enable the SYSHADOOP.COMPATIBILITY_MODE global variable, and the HADOOP keyword is not included on the CREATE TABLE statement, the type of table that is created is determined by looking for specific Hadoop table clauses. The following clauses are specific Hadoop table clauses:
    • EXTERNAL
    • LOCATION
    • COMMENT
    • PARTITIONED BY
    • ROW FORMAT
    • STORED AS
    • STORED BY
    • TBLPROPERTIES
    • WITH HINTS
    • STRING
    • BINARY
    If the HADOOP keyword is not used, and SYSHADOOP.COMPATIBILITY_MODE is not enabled, and the query is ambiguous, then the source table determines the table type.
  • By default, a result of the CREATE HADOOP TABLE statement is that GRANT statements are applied for a connected user in Hive. If Hive impersonation is disabled, you must do the following steps to change the hive.security.metastore.authorization.manager property in the hive.xml file:
    1. In the Ambari dashboard, click the Hive service and then open the Configs tab.
    2. Click the Advanced tab.
    3. Expand the Advanced hive-site property.
    4. Scroll to find the hive.security.metastore.authorization.manager property.
    5. Type the following value into the field:
      org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
    6. Click Save at the top of the Configs page. When the Save Configuration window opens, type a description of your change, and click Save, and then click OK at the confirmation window.
    7. Click Restart > Restart All Affected to restart the Hive service and the related components. Then click Confirm Restart All.
  • If the current schema, which might have been created with either the USE or SET SCHEMA command, does not exist in the Hive metastore, an attempt is made to automatically create the schema in Hive.
  • All referential integrity constraints and nullability indicators are advisory and are not enforced at data load time.
  • When you use data types such as STRING and BINARY, you can cause the SQL processor to assume that it needs to manipulate 32K of data in a column all the time. This result can impact performance. Instead, use data types that contain a specific length. You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways:
    • Run the SET HADOOP PROPERTY command.
    • Set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
  • If no precision is included on the TIMESTAMP definition, the default is TIMESTAMP(9), which is compatible with a maximum precision of a TIMESTAMP in Big SQL 1.0 or Hive.
  • If you create a table in a schema that does not exist, the schema is automatically created with the default name. The default name is your User ID.
  • The columns and data types for an Avro table are fixed at the time that you run the CREATE HADOOP TABLE statement. If you create the table with an EXTERNAL schema by using the avro.schema.url configuration property, and then change the schema, the change is not immediately recognized. This change can result in query execution errors.
    To ensure that your changes are recognized, you must drop the local table from the Big SQL catalogs, and then re-import the table from the Hive catalogs. Use the following commands to drop and re-import the table:
    
    SET SYSHADOOP.CATALOG_SYNC_MODE=0;
    DROP TABLE avro_table;
    SET SYSHADOOP.CATALOG_SYNC_MODE=1;  
    CALL HCAT_SYNC_OBJECTS('schema1', 'avro_table');
  • The following table shows the mapping of the data type that is provided in this CREATE HADOOP TABLE syntax to the data type that will be defined in the Hive catalogs, and the native SQL type to which values will be treated by Big SQL at runtime:
    Table 1. Data type mapping
    Data type in CREATE TABLE Hive data type SQL data type
    TINYINT TINYINT SMALLINT
    SMALLINT SMALLINT SMALLINT
    INT INT INT
    BIGINT BIGINT BIGINT
    DECIMAL DECIMAL DECIMAL
    REAL FLOAT REAL
    FLOAT DOUBLE FLOAT
    DOUBLE DOUBLE DOUBLE
    CHAR CHAR CHAR
    VARCHAR VARCHAR VARCHAR
    STRING STRING VARCHAR(max)
    TIMESTAMP TIMESTAMP TIMESTAMP(9)
    TIMESTAMP(n) TIMESTAMP TIMESTAMP(n)
    DATE TIMESTAMP/DATE*
    Note: *When you use DATE STORED AS DATE, the Hive representation is DATE.
    DATE
    BOOLEAN BOOLEAN SMALLINT
    Data type in CREATE TABLE
    This column is the data type that you use in the CREATE HADOOP TABLE table definition statement.
    Hive type
    This column is the data type by which the column is defined in the Hive metastore. It specifies the format by which values for the column are physically stored in the underlying data files for the table. For example, a column that is created as DATE is defined within the Hive catalogs as a TIMESTAMP type. Therefore, values in the underlying data files for the table must conform to the Hive (or SerDe) rules for a TIMESTAMP.
    SQL type
    This column is the actual native Big SQL data type for the column. When these columns interact with the columns from within SQL, such as in a SELECT or INSERT statement, the column must be treated as this type.
  • The following syntax alternatives are supported for compatibility with previous versions of this product and with other database products:
    • You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY data type. For example, you can specify ARRAY<VARCHAR(8)> instead of VARCHAR(8) ARRAY[ ].
    • You can use the MAP syntax that Hive uses to define an associative ARRAY data type. For example, you can specify MAP<VARCHAR(10), DOUBLE> instead of DOUBLE ARRAY [VARCHAR(10)].
    • You can use the STRUCT syntax that Hive uses to define a ROW data type. For example, you can specify STRUCT<F1:INTEGER, F2:DOUBLE> instead of ROW(F1 INTEGER, F2 DOUBLE).

Restrictions

The ENFORCED constraint attribute cannot be applied to Big SQL Hadoop tables and results in an error ((SQLSTATE 42858).

The table name must not already exist in the same schema with the same name in the Hive catalogs, regardless of case differences. For example, the second table definition in the following query is not valid:

CREATE HADOOP TABLE "Foo" (c1 INT); 
CREATE HADOOP TABLE "FOO" (c1 INT);

If you use the LIKE clause and you are defining a local database table, but you enabled the SYSHADOOP.COMPATIBILITY_MODE global variable, you cannot include the LOCATION, TBLPROPERTIES, or HINTS clause.

The following restrictions exist for the ARRAY or ROW data types:
  1. Do not use a ROW or ARRAY data type in a constraint or key definition (SQLTATE 42962).
  2. Do not reference a table that contains a column of type ARRAY or ROW in the AS or LIKE clause (SQLSTATE 428H2).
  3. There is a limit of 32,763 ARRAY or ROW definitions per database (SQLSTATE 54035). This limit also includes any user defined types that use the CREATE TYPE statement.
  4. Comparisons or assignments between ARRAY or ROW variables or using ARRAY data types ar parameters or ARRAY as ROW parameters to a stored procedures are not supported with ARRAY columns in a Hadoop table.
  5. Do not use an ARRAY data type as a type within an ARRAY type for Hadoop tables.
  6. The result type of a column in a full-select cannot be of type ARRAY.

The use of scrollable cursors is not supported for Hadoop tables.

Examples

  1. Simple create statement:
    Create a simple HADOOP table.
    CREATE HADOOP TABLE t1 (
       c1 INT, c2 VARCHAR(20), c3 DOUBLE, c4 REAL);
  2. TEXTFILE=delimited:
    The following statement shows an example table definition with a delimited text file:
    
    CREATE HADOOP TABLE t (
      i int, s VARCHAR(10))
      ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
      STORED AS TEXTFILE;
    
  3. Create a table with a BINARY and a TEXT SEQUENCEFILE:
    
    CREATE HADOOP TABLE SEQ_TAB (
      C1 INT NOT NULL, C2 VARCHAR(20))
    STORED AS BINARY SEQUENCEFILE;
    
    CREATE HADOOP TABLE T (
      i int, s VARCHAR(10))
      STORED AS TEXT SEQUENCEFILE;
  4. Use the LazySimpleSerDe SERDE class to create the table:
    The CREATE TABLE in the following example uses the LazySimpleSerDe SERDE class to create the table:
    
    CREATE HADOOP TABLE T (
      i int, s VARCHAR(10))
      ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.lazy.simpleminded'
      STORED AS SEQUENCEFILE;
    
    
  5. Create a table with RC files:
    The following statement is an example RC file CREATE:
    
    CREATE HADOOP TABLE T (
      i int, s VARCHAR(10))
      ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
      STORED AS RCFILE;
  6. Create a table with Parquet files:
    The following statement is an example CREATE for a Parquet file:
    
    CREATE HADOOP TABLE T (
      i int, s VARCHAR(10))
      STORED AS PARQUETFILE;
  7. Create a table with an Avro file:
    The following statement is an example CREATE for an Avro file:
    
    CREATE EXTERNAL TABLE bs_rev_profit_by_compaign (
      Revenue DOUBLE, GrossProfit DOUBLE,
      CompaignName VARCHAR(10)
      )
    COMMENT "A table backed by Avro data with the Avro schema
      embedded in the CREATE TABLE statement" 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'  
    STORED AS INPUTFORMAT
      'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION 
      '/user/biadmin/avro/sheet/'
    TBLPROPERTIES (
      'avro.schema.literal'='{"type":"record",
         "name": "TUPLE_3",
         "fields": 
          [ { "name": "Revenue","type": [ "null",
          "double" ],
          "doc": "autogenerated from Pig Field Schema"},
         {"name":"GrossProfit",
          "type": 
          [ "null", "double" ],
          "doc": "autogenerated from Pig Field Schema"}, 
         {"name": "CompaignName", 
          "type": [ "null", "varchar(10)" ],
          "doc":"autogenerated from Pig Field Schema"} 
         ]
       }
      ')
    ;
  8. Create a table with a partitioning column:
    
    CREATE HADOOP TABLE 
      orders(orderkey INT not null, 
      o_comment varchar(20) comment 'comment column' not null
      ) 
    PARTITIONED BY (orderdate varchar(10));
  9. Create a table with a partition that is based on a partitioning expression on a column:
    
    CREATE HADOOP TABLE 
      orders(orderkey INT not null, 
      o_comment varchar(20) comment 'comment column' not null,
      order_date date
      ) 
    PARTITIONED BY (MONTH(order_date) AS order_month);
  10. Creating a table AS another table (CTAs):
    You can use the CREATE HADOOP TABLE AS... statement to convert data to BINARY SEQUENCEFILE, which can improve performance. Assume that your original table has the following structure:
    
    CREATE HADOOP TABLE employee
     (empno INT, fname VARCHAR(100), 
    lname VARCHAR(100), dept VARCHAR(100))
    ROW FORMAT DELIMITED
     FIELDS TERMINATED BY ','
     LINES TERMINATED BY '\n';
          
    Then, you create a new table based on the original table, but store the data as BINARY SEQUENCEFILE.
    
    CREATE HADOOP TABLE employeeBINSEQ
      (empno, fname, lname, dept)
    STORED AS BINARY SEQUENCEFILE
    AS (SELECT * FROM employee);
  11. Creating a table with ARRAY and STRUCT data types:
    The following example creates a table with complex data types:
    CREATE HADOOP TABLE rf_del1
      (id VARCHAR(10), c1 STRUCT<f1:varchar(10), f2:varchar(10)>,
      c2 ARRAY<varchar(10)>)
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      COLLECTION ITEMS TERMINATED BY '/'
      LINES TERMINATED BY '\n'
    ;
    
  12. Creating a Hadoop table that will be able to ingest HBase data:
    
    CREATE HADOOP TABLE staff_hbase ( 
      id SMALLINT, name VARCHAR(100), dept SMALLINT, 
      job VARCHAR(10), years SMALLINT, salary VARCHAR(10), comm VARCHAR(10)
    ) 
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES (
      'hbase.columns.mapping' = ':key,info:name,info:dept,role:job,role:years,money:salary,money:comm');
    See the Loading an HBase data into a Hive table example for the complete SQL with LOAD.
  13. Create a simple table with an ARRAY data type:
    This example creates a table with a column name COL1 of type ARRAY of integers. In this example, the maximum cardinality of the array is 10.
    CREATE HADOOP TABLE mytable (COL1 INT ARRAY[10]);
    You can select the value that is assigned for the second element in the array.
    SELECT COL1[2] FROM mytable;
    You can return all values based on a particular value in the COL1 ARRAY.
    SELECT * FROM mytable WHERE COL1[8]=12;
  14. Create a simple table with a STRUCT data type:
    This example creates a table with a column COL1 of type STRUCT. The STRUCT data type contains two elements. The first element, c1, is an INTEGER, and the second element, c2, is a DOUBLE.
    CREATE HADOOP TABLE mytable (COL1 STRUCT(c1 int,c2 double));
    You can select the element, C1, from the COL1 STRUCT.
    SELECT COL1.C1 FROM mytable;
    You can return all values based on a particular value in the COL1 STRUCT type.
    SELECT * FROM mytable WHERE COL1.F1 = 5;
  15. Creating a table with a ROW data type:
    This example creates a table with two columns, an INTEGER column and a ROW column. The ROW column contains two fields, an INTEGER column, and a DOUBLE column.
    CREATE HADOOP TABLE mytable
              (id INT, col1 ROW(c1 int,c2 double));
    In the SELECT example, you select the field, c1, from the row type column, col1:
    SELECT col1.c1 FROM mytable;
  16. This example creates a table with a column name col1 of type MAP, an associative array, of integer type values which can be indexed by a VARCHAR(20) type key.
    
    CREATE HADOOP TABLE mytable
       (id INT, col1 INT ARRAY[VARCHAR(20)])
    The following statement returns the element value from the MAP where the index value is 'MY_STRING':
    SELECT col1['MY STRING'] FROM
            mytable
  17. Creating a view from Hadoop tables created with an ARRAY, STRUCT, or MAP:
    
    CREATE HADOOP TABLE testview01( 
      id INT, arr TINYINT ARRAY[5]
    )
    ;
    
    CREATE VIEW view001(i,c1) AS 
      SELECT id, arr FROM testview01 
    WHERE arr[1] < 8
    ;
    
    
    SELECT c1[1] FROM view001
    ;
  18. The ARRAY or ROW column cannot be compared directly with another ARRAY or ROW column (SQLSTATE 42818). Only the elements in the ARRAY or ROW can be compared to another expression of a primitive type. Assume that a and b are ARRAY columns. This statement is not allowed:
    ...where a = b...
    This statement is allowed if the element type is a primitive type and not a of ROW type:
    ...where a[1] = b[1]...
    Assume that c and d are ROW columns. This statement is allowed:
    where c.field1 = d.field1
  19. Create a table using the structure of another table, but using none of the data from the source table:
    
    CREATE HADOOP TABLE T1 (C1, C2) 
      AS (SELECT X1, X2 FROM T2) WITH NO DATA;
  20. CMX compression is supported in Big SQL. The following example uses CMX by setting the compression properties before issuing the CREATE TABLE AS... statement:
    
    SET HADOOP PROPERTY 'mapred.output.compress'='true';
    SET HADOOP PROPERTY 'mapred.output.compression.type'='BLOCK';
    SET HADOOP PROPERTY 'mapred.output.compression.codec'='org.apache.hadoop.io.compress.CmxCodex';
    CREATE HADOOP TABLE t2 AS 
      SELECT * FROM t1;
  21. A date data type is stored as a date.
    CREATE HADOOP TABLE t5 (key INT,dt DATE STORED AS DATE, dt2 DATE);
    If you run a DESCRIBE command in your Hive shell, you see the storage of column dt is DATE because of the column option:
    
    hive>  describe formatted t5;
    col_name    data_type     comment
    key          int
    dt           date
    dt2          timestamp    /*@type=date*/