CREATE TABLE (HADOOP) statement
Use the CREATE HADOOP TABLE statement to create Hive tables for the Hadoop environment.
Invocation
This statement is an executable statement that cannot be dynamically prepared.
Syntax
- 1 The HADOOP keyword is required unless you enable the SYSHADOOP.COMPATIBILITY_MODE global variable and use specific Hadoop table clauses. For more information about this requirement see Hadoop keyword usage note.
- 2 You do not need to specify the element-list during a CREATE TABLE AS operation because the column details can be inferred from the query. You also do not need to specify the element-list when you create a table by using a SerDe that dynamically determines the column list from an external data source, such as an Avro schema.
- 3 For compatibility with Hive, when you use the LIKE clause, you can specify only the LOCATION, TBLPROPERTIES, and HINTS clauses,
- 4 Each clause can only be used once.
- 5 The Big SQL database manager natively supports the data types that are described in Data types. However, there are some data types that are syntactically supported for compatibility with Hive, but are physically treated as different data types within the Big SQL runtime. For more information see the mapping of data types that are used in the CREATE TABLE statement.
- 6 If you enable COMPATIBILITY_MODE, CHAR acts as a synonym for VARCHAR when you create the table. Then, the system catalogs show the VARCHAR data type.
- 7 Because the STRING data type has no specific size associated with it, the SQL processor might assume that it needs to manipulate 32K of data in a column all the time. Be aware that performance can be impacted. If performance is a concern, use types that contain a specific length. You can also set the bigsql.string.size property to default the STRING mapping to a smaller VARCHAR size. Set this property in one of two ways: issue the SET HADOOP PROPERTY command or set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
Description
- EXTERNAL
-
The EXTERNAL parameter indicates that the data in the table is not managed by the database manager. When you DROP the table, the definition of the table is removed from the database manager and Hive catalogs, but the data remains unaffected. This is typically used in conjunction with the LOCATION keyword to define a table with pre-existing data.
- HADOOP
- The HADOOP keyword is required to define a Hadoop table unless you enable
the SYSHADOOP.COMPATIBILITY_MODE global variable.
You
can verify that you have defined a HADOOP table by querying the PROPERTY
column in
SYSCAT.TABLES.
SELECT SUBSTRING(PROPERTY,22,1) FROM SYSCAT.TABLES WHERE TABNAME='<table-name>';
- table-name
- Specifies the name for the Big SQL table that is to be created.
If the table name exists, this statement fails. However, if the table name exists and IF NOT EXISTS is specified, no error message is issued.
The table name in the table definition must be a valid identifier.
Use the RENAME statement to rename a Hadoop table. When you rename a Hadoop table, both the name in the local database catalogs and the Hive catalogs are altered. When you rename a managed Hadoop table, which is a table that is created without the EXTERNAL keyword, the distributed file system storage directory is also changed.
- column-definition
- The name, type, constraints, and comments of the
column.Note: column-definition must be an identifier containing only ASCII characters.
- column-name
- You must specify the list of columns that are to be created as non-partitioned columns in the
table. Each column is specified as a column name and data type. All valid Big SQL
data types are allowed.
The column name in the table definition must be a valid identifier.
- data-type
- Specifies the data type of the column.
- primitive-type
- The built-in data type. Use one of the following types:
- TINYINT
- A 1-byte integer data type, with the range of -128 to
127.
Hadoop automatically converts the TINYINT to a larger integer type, such as SMALLINT, INT, or BIGINT, or a floating-point type such as FLOAT or DOUBLE.
- SMALLINT
- A small integer is a 2-byte integer with a precision of 5 digits. The range of small integers is -32,768 to 32,767
- INTEGER or INT
- A 4-byte integer with a precision of 10 digits. The range of large integers is -2,147,483,648 to +2,147,483,647.
- BIGINT
- An 8-byte integer data type with a precision of 19 digits. The range is -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.
- FLOAT(integer)
- A single-precision floating-point number, which is a 32-bit approximation of a real number. The
number can be zero or can range from -3.4028234663852886e+38 to
-1.1754943508222875e-38, or from
1.1754943508222875e-38 to
3.4028234663852886e+38. A double-precision floating-point number
is a 64-bit approximation of a real number. The number can be zero or can range from
-1.7976931348623158e+308 to
-2.2250738585072014e-308, or from
2.2250738585072014e-308 to
1.7976931348623158e+308.
The REAL data type is a single precision floating-point number. The DOUBLE data type is a double-precision floating-point number.
- DECIMAL (precision-integer, scale-integer) or DEC(precision-integer, scale-integer)
- A decimal number. The first integer is the precision of the number; that is, the total number of
digits; it can range from 1 to 31. The second integer is the scale of the number; that is, the
number of digits to the right of the decimal point; it can range from 0 to the precision of the
number.
The default value is precision 10 and scale 0. The words NUMERIC and NUM can be used as synonyms for DECIMAL and DEC.
- CHARACTER or CHAR
- A character string is a sequence of characters. The length of the string is the number of characters in the sequence. If the length is zero, the value is called the empty string, which is not the NULL value.
- STRING
- The STRING data type has no specific size associated with it. The default is a VARCHAR 32K.
- VARCHAR
- A VARCHAR value can be up to 32 672 bytes long. You can also use CHARACTER VARYING or CHAR.
- TIMESTAMP
- A TIMESTAMP is a six- or seven-part value (year, month, day, hour, minute, second, and optional
fractional seconds) that designates a date and time.
The integer must be between 0 and 9 and specifies the precision of fractional seconds from 0 (seconds) to 9 (nanoseconds).
- DATE
- A DATE is a three-part value (year, month, and day) that designates a date.
The range of the year part is 0001 to 9999.
The
range of the month part is 1 to 12. The range of the day part is 1 to x, where
x depends on the month.
- DATE STORED AS TIMESTAMP
- Indicates that the DATE data type is mapped and stored as a Hive TIMESTAMP data type. This is the default.
- DATE STORED AS DATE
- Indicates that the DATE data type is mapped and stored as a Hive DATE data type. For optimal performance when using this type, be sure to use the ORC file format, because the Java™ I/O interface is used to access the DATE data type (see Big SQL readers and writers).
- BINARY or VARBINARY
- A binary value is a variable length sequence of bytes. The length of the value is the number of
bytes that it contains.
A varying length binary value is specified as VARBINARY(length) or BINARY(length) and can be up to 32,672 bytes long (VARBINARY is an alias for BINARY, and they are functionally identical in every way). If no length is provided, the length is assumed to be 32,672.
However, it is recommended that an explicit length be specified. The Big SQL engine works most efficiently when the table definition enables a row to fit within 32 KB of memory. When the calculated row size exceeds 32 KB, some queries might see a performance degradation.
With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL.
- array-type, row-type
-
You can define only 32,673 of ARRAY or ROW types for the Big SQL database (SQLSTATE 54035). For example, each instance of ARRAY or ROW consumes one of these entries, therefore a column of an ARRAY of ROW types consumes two entries. If you exceed the limit, you must drop a HADOOP table that contains a ROW or ARRAY defined as a column type.
Each ARRAY or ROW type can be used in one of two ways, the SQL standard version, or the Hive version. The SQL standard version, which is displayed in the syntax, is the recommended usage. When you use the length of the data type, which is the number of elements, as part of the table declaration, the database manager can take advantage of the fact that it can compute the maximum size of such a column, which optimizes the column.
With this data type, the Java I/O interface is used for all data access (both insert and select operations), and the ORC file format is recommended. For more information, see File formats supported by Big SQL.
For information about the usage restrictions, see Array values and Row values.
- ARRAY
-
An ARRAY is an indexable list. The elements in the array must be in the same type. You can access the elements in the array by the following notation:
...WHERE C1[2]=1...
, assuming that column C1 is declared as an ARRAY with two or more elements.- primitive-type or row-type
- Specifies the data type of the field.
- primitive-type
- Specifies a built-in, or primitive data type. Built-in types include the data types that are described in primitive-types
- ROW
- Specifies the ROW data type. An ARRAY of ROW types is valid.
- ROW
-
A ROW object contains a value for each attribute of the SQL structured type that it represents.(java.sql.Struct). A ROW is equivalent to the STRUCT type in Hive tables. The ROW type contains field definitions that contain the field name and the data type.
For each field in a ROW definition, an entry is created in the SYSIBM.SYSATTRIBUTES table.
- field-name
- Specifies the name of a field within the ROW type. The name cannot be the same as any other field of this ROW type (SQLSTATE 42711).
- data-type
- Specifies the data type of the field.
- primitive-type
- Specifies a built-in, or primitive data type. Built-in types include the data types that are described in primitive-types
- column-options
- You can define column options that include date, nullability, or constraints on the column.
- COMMENT column-comment
-
Each column can contain a COMMENT to provide more information about the column.
- NULL or NOT NULL
-
Each column can specify NOT NULL. The NOT NULL parameter signifies to the Big SQL query optimizer that the column will not contain NULL values. The HADOOP table is external to any local database server, so Big SQL enforces nullability at query execution time (SQL5104N).
NULL and NOT NULL options are used in the following ways:- The nullability of a column is used during certain query re-write optimizations.
- The nullability of a column is enforced during the INSERT or LOAD USING statement. Do not assign a NULL value to a column define as NOT NULL (SQLSTATE 23502).
- CONSTRAINT constraint-name
-
You can optionally identify each constraint with a name. For all constraints, the default is NOT ENFORCED.
- constraint-attributes
- These constraints can improve query performance. The default for Hadoop tables is NOT ENFORCED.
For primary key constraints, unique constraints, and check constraints, you can specify that the constraint can be trusted. For referential integrity constraints, if the constraint is not enforced, you can further specify whether the constraint can be trusted. A not-enforced and not-trusted constraint is also known as a statistical referential integrity constraint. You can specify whether a referential integrity constraint or check constraint is to be used for query optimization.
- NOT ENFORCED or ENFORCED
-
Specifies whether the constraint is enforced by the database manager during normal operations such as insert, update, or delete.
- NOT ENFORCED
-
This parameter is the default for Hadoop tables. The constraint is not enforced by the database manager. A primary key constraint or unique constraint cannot be NOT ENFORCED if there is a dependent ENFORCED referential constraint.
- TRUSTED
-
The data can be trusted to conform to the constraint. TRUSTED must be used only if the data in the table is independently known to conform to the constraint. Query results might be unpredictable if the data does not actually conform to the constraint. This is the default option.
- NOT TRUSTED
-
The data cannot be trusted to conform to the constraint. NOT TRUSTED is intended for cases where the data conforms to the constraint for most rows, but it is not independently known that all the rows or future additions will conform to the constraint. If a constraint is NOT TRUSTED and enabled for query optimization, then it will not be used to perform optimizations that depend on the data conforming completely to the constraint. NOT TRUSTED can be specified only for referential integrity constraints (SQLSTATE 42613).
- ENFORCED
-
The ENFORCED keyword is syntactically available for compatibility with the syntax for local tables (created with CREATE TABLE). However the presence of ENFORCED results in an error (SQLSTATE 42858) that indicates the operation is not supported on Hadoop tables.
- ENABLE QUERY OPTIMIZATION or DISABLE QUERY OPTIMIZATION
-
Specifies whether the constraint or functional dependency can be used for query optimization under appropriate circumstances. The default is ENABLE QUERY OPTIMIZATION.
- ENABLE QUERY OPTIMIZATION
-
The constraint is assumed to be true and can be used for query optimization.
- DISABLE QUERY OPTIMIZATION
-
The constraint cannot be used for query optimization. DISABLE QUERY OPTIMIZATION cannot be specified for primary key and unique constraints (SQLSTATE 42613).
- unique-constraint
-
Provides a shorthand method of defining a unique key. This rule forbids duplicate values in one or more columns within a table. A table can have multiple unique keys. The columns that are specified in a unique constraint must be defined as NOT NULL
- referential-constraint
- Defines a referential constraint.
- CONSTRAINT constraint-name
- Names the referential constraint.
- FOREIGN KEY (column-name, ...)
-
Defines a referential constraint with the specified constraint-name.
The foreign key of the referential constraint is composed of the identified columns. Each name in the list of column names must identify a column of the object table and the same column must not be identified more than once. There must be the same number of foreign key columns as there are in the parent key and the data types of the corresponding columns must be compatible (SQLSTATE 42830).
- check-constraint
-
Defines a check constraint. A check-constraint is a search-condition that must evaluate to not false or a functional dependency that is defined between columns.
- constr-name
-
Names the check constraint.
- check-condition
-
Defines a check constraint. The search-condition must be true or unknown for every row of the table .
- LIKE table-name
- Specifies that the columns of the table have exactly the same name and
description as the columns of the identified table (table-name), view
(view-name) or nickname (nickname). The name specified after LIKE must
identify a table, view or nickname that exists in the catalog, or a declared
temporary table.
You cannot specify a typed table or typed view (SQLSTATE 428EC). The Hadoop table referenced in table-name cannot contain an ARRAY or ROW column (SQLSTATE 428H2).
The table name can refer to another Hadoop table only. Creating an Hadoop table LIKE a local table results in an error (SQLSTATE 42858).
- COMMENT string-constant
-
A table comment as part of a table definition can be used to provide information about the table.
- PARTITIONED BY partitioning-element-list
-
- column-definition
- Specifies one or more columns by which data that is inserted into the table is partitioned. The
action of including a PARTITIONED BY clause creates a directory in the distributed file system (DFS)
that contains data for each unique value in the column that is being partitioned. At query time, Big
SQL tries to scan only those directories, or partitions, that contain data that match the search
criteria of the query. The following data types cannot be used as partitioning columns:
- ARRAY
- BINARY
- BOOLEAN
- ROW
- TIMESTAMP
- VARBINARY
- partition-expression
- Specifies one or more partitioning expressions by which data that is inserted into the table is
partitioned. The action of including a PARTITIONED BY clause creates a directory in the distributed
file system (DFS) that contains data for each unique value in the partition expression. At query
time, Big SQL tries to scan only those directories, or partitions, that contain data that match the
search criteria of the query. The expression can reference one or more columns in the columns list.
The AS clause specifies a name for the column that is derived from the expression. The data type of
this identifier matches the result type of the expression. If a query references the column that is
derived from the expression, a new predicate is automatically injected into the query so that Big
SQL scans only the qualifying partitions. The data type of the result of the expression cannot be
one of the following types:
- ARRAY
- BINARY
- BOOLEAN
- ROW
- TIMESTAMP
- VARBINARY
The partitioning expression cannot contain any of the following items (SQLSTATE 42621):- Subqueries
- Column functions
- Dereference operations or DEREF functions
- User-defined or built-in functions that are non-deterministic
- User-defined functions that use the EXTERNAL ACTION option
- User-defined functions that are not defined with NO SQL
- Host variables or parameter markers
- Special registers or built-in functions that depend on the value of a special register
- Global variables
- An expression on a partitioning expression column
Note: A load Hadoop operation does not automatically generate column values when a table is partitioned by a partition-expression. When using the LOAD HADOOP statement on a table that is partitioned by a partition-expression, ensure that the column values are generated in advance and stored in a file. The INSERT statement does automatically generate column values for the partition-expression. Therefore, if values for the new column must be automatically generated, use the INSERT statement instead of the LOAD HADOOP statement. If all column values are generated in advance, use the LOAD HADOOP statement.
- ROW FORMAT delimited-row-format
- Specifies the format of data rows. If ROW FORMAT SERDE is not specified, ROW
FORMAT defaults are the ROW FORMAT DELIMITED options that are not explicitly
specified.
- DELIMITED
- Specifies a delimiter at the table level. You can make more than
one choice from the delimiter items.The variables that you provide with the FIELDS TERMINATED BY, ESCAPED BY, and LINES TERMINATED BY parameters can be specified as the following characters:
- A literal single byte character, such as '|' (pipe) or ',' (comma).
- \NNN, where NNN is the octal value of a character. For example, \001 is the ASCII character CTRL-A.
- \xYY, where YY is a two-digit hexadecimal value of a character. This representation is retained for compatibility with Big SQL v1, however the literal character or octal representations are recommended.
- FIELDS TERMINATED BY
- Specifies a delimiter for structured fields. You can
include any single character, but the default is
'\001', which is the
CTRL-A.
The field terminator can exist within your data, if you specify the ESCAPED BY clause and those characters are properly escaped.
- LINES TERMINATED BY
- Specifies a delimiter for map keys. You can include a newline character only ('\n'). Newlines must not exist within the data itself, and they cannot be escaped.
- NULL DEFINED AS
- Defines the value to represent the NULL value. The default value for NULL is \N, that is a literal backslash character followed by an upper case N.
- ROW FORMAT serde-row-format
-
Specifies the format of data rows as a SerDe class. You specify the name of a Java class in the Big SQL server CLASSPATH that implements the Hive Serde interface org.apache.hadoop.hive.serde2.SerDe.
- WITH SERDEPROPERTIES
-
Use the WITH SERDEPROPERTIES parameter to associate SerDe properties with the SerDe class. For example, if you want to eventually load HBase data into a Hive table, create the table by using the WITH SERDEPROPERTIES and the hbase.columns.mapping parameter.
To use a SerDE JAR file, do the following steps to copy the JAR file to the appropriate libraries:- Stop the Big SQL server.
- Copy the JAR file to $BIGSQL_HOME/userlib and to $HIVE_HOME/lib.
- Restart the Big SQL server.
- STORED AS file-format
- Specifies the type of file in which data is to be stored. Big SQL can read most file storage formats that are available, either as explicitly defined in the syntax, or with installed interfaces. For more information about the supported file formats, see File formats supported by Big SQL
- STORED BY storage-handler
-
Specifies the name of a Java class in the Big SQL server CLASSPATH that implements the storage handler interface. For example, you can use org.apache.hadoop.hive.ql.metadata.HiveStorageHandler for the Hive interface. Use org.apache.hadoop.hive.hbase.HBaseStorageHandler for the HBase interface.
- LOCATION string-constant
-
Indicates that a specific distributed file system directory must be used to store data files.
Access to the path specified in the LOCATION clause is controlled by the HDFS file permissions. Access can also be granted. If the files are not readable by the user, then the CREATE statement fails. You must have one of the following authorizations to use the LOCATION clause:- DATAACCESS authority.
- Privileges on the files and the directory in HDFS (and not Ranger):
- READ and WRITE privileges on the directory and all files in the directory, including files in any subdirectories that might be recursively specified in the LOCATION clause.
- EXECUTE privileges on all the directories and subdirectories in the specified LOCATION clause.
Big SQL provides configuration parameters to enforce the requirement that the LOCATION clause must be used with the EXTERNAL clause. The action of these configuration parameters ensures that the DROP statement does not accidentally drop data that is outside of the hive\warehouse directory.
For more information, see Using S3 Object Storage with Big SQL tables.
- TBLPROPERTIES string-constant=string-constant
- Defines
job properties that can configure input or output formats, and that you can use to specify more
detailed information about the table. The properties are passed
as is
to the underlying InputFormat and OutputFormat classes. Properties that are not implemented by these classes are ignored. Big SQL does not validate the properties. - CACHED
-
Enables caching on the entire created table. When you add data to the table, or when you create partitions, they are automatically set to be cached. By using the cached clause, Big SQL adds a directive with the table location that is to be cached in the pool_name that you specify.
If you create a partitioned table, directives to cache the individual partitions are added by the INSERT or LOAD statements when they create a partition. The cached property of the table location is inherited
If you create partitions outside of the INSERT or LOAD statements, these partitions are not automatically cached. You can use the hdfs cacheadmin command to create cache directives for those partitions.
- (column-name) AS (full-select) WITH NO DATA
-
Creates a table that is based on a SELECT statement. The table name cannot refer to a Hadoop table that contains an ARRAY or ROW column (SQLSTATE 428H2).
You can optionally rename the columns in the target table.
By using the WITH NO DATA clause, you can define the target table without populating the table.
Usage notes
- The maximum number of Hadoop tables that can be created in Big SQL is 65536.
- By default, the HADOOP keyword is a required element of the
CREATE TABLE state. However, if you enable the SYSHADOOP.COMPATIBILITY_MODE global variable, and the
HADOOP keyword is not included on the CREATE TABLE statement, the type of table that is created is
determined by looking for specific Hadoop table clauses. The following clauses are specific Hadoop
table clauses:
- EXTERNAL
- LOCATION
- COMMENT
- PARTITIONED BY
- ROW FORMAT
- STORED AS
- STORED BY
- TBLPROPERTIES
- WITH HINTS
- STRING
- BINARY
- By default, a result of the CREATE HADOOP TABLE statement is that GRANT statements are applied
for a connected user in Hive. If Hive impersonation is disabled, you must do the following steps to
change the hive.security.metastore.authorization.manager property in the
hive.xml file:
- In the Ambari dashboard, click the Hive service and then open the Configs tab.
- Click the Advanced tab.
- Expand the Advanced hive-site property.
- Scroll to find the hive.security.metastore.authorization.manager property.
- Type the following value into the
field:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
- Click Save at the top of the Configs page. When the Save Configuration window opens, type a description of your change, and click Save, and then click OK at the confirmation window.
- Click Confirm Restart All. to restart the Hive service and the related components. Then click
- If the current schema, which might have been created with either the USE or SET SCHEMA command, does not exist in the Hive metastore, an attempt is made to automatically create the schema in Hive.
- All referential integrity constraints and nullability indicators are advisory and are not enforced at data load time.
- When you use data types such as STRING and
BINARY, you can cause the SQL processor to assume that it needs to manipulate
32K of data in a column all the time. This result can impact performance.
Instead, use data types that contain a specific length. You can also set the
bigsql.string.size property to default the STRING
mapping to a smaller VARCHAR size. Set this property in one of two ways:
- Run the SET HADOOP PROPERTY command.
- Set the value globally by updating the $BIGSQL_HOME/conf/bigsql-conf.xml configuration file.
- If no precision is included on the TIMESTAMP definition, the default is TIMESTAMP(9), which is compatible with a maximum precision of a TIMESTAMP in Big SQL 1.0 or Hive.
- If you create a table in a schema that does not exist, the schema is automatically created with the default name. The default name is your User ID.
- The columns and data types for an Avro table are fixed at the time that you run
the CREATE HADOOP TABLE statement. If you create the table with an EXTERNAL
schema by using the avro.schema.url configuration property,
and then change the schema, the change is not immediately recognized. This
change can result in query execution errors. To ensure that your changes are recognized, you must drop the local table from the Big SQL catalogs, and then re-import the table from the Hive catalogs. Use the following commands to drop and re-import the table:
SET SYSHADOOP.CATALOG_SYNC_MODE=0; DROP TABLE avro_table; SET SYSHADOOP.CATALOG_SYNC_MODE=1; CALL HCAT_SYNC_OBJECTS('schema1', 'avro_table');
- The following table shows the mapping of the data type that is
provided in this CREATE HADOOP TABLE syntax to the data type that will be
defined in the Hive catalogs, and the native SQL type to which values will be
treated by Big SQL at runtime:
Table 1. Data type mapping Data type in CREATE TABLE Hive data type SQL data type TINYINT TINYINT SMALLINT SMALLINT SMALLINT SMALLINT INT INT INT BIGINT BIGINT BIGINT DECIMAL DECIMAL DECIMAL REAL FLOAT REAL FLOAT DOUBLE FLOAT DOUBLE DOUBLE DOUBLE CHAR CHAR CHAR VARCHAR VARCHAR VARCHAR STRING STRING VARCHAR(max) TIMESTAMP TIMESTAMP TIMESTAMP(9) TIMESTAMP(n) TIMESTAMP TIMESTAMP(n) DATE TIMESTAMP/DATE* Note: *When you use DATE STORED AS DATE, the Hive representation is DATE.DATE BOOLEAN BOOLEAN SMALLINT - Data type in CREATE TABLE
- This column is the data type that you use in the CREATE HADOOP TABLE table definition statement.
- Hive type
- This column is the data type by which the column is defined in the Hive metastore. It specifies the format by which values for the column are physically stored in the underlying data files for the table. For example, a column that is created as DATE is defined within the Hive catalogs as a TIMESTAMP type. Therefore, values in the underlying data files for the table must conform to the Hive (or SerDe) rules for a TIMESTAMP.
- SQL type
- This column is the actual native Big SQL data type for the column. When these columns interact with the columns from within SQL, such as in a SELECT or INSERT statement, the column must be treated as this type.
- The following syntax alternatives are supported for compatibility
with previous versions of this product and with other database products:
- You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY
data type. For example, you can specify
ARRAY<VARCHAR(8)>
instead ofVARCHAR(8) ARRAY[ ]
. - You can use the MAP syntax that Hive uses to define an
associative ARRAY data type. For example, you
can specify
MAP<VARCHAR(10), DOUBLE>
instead ofDOUBLE ARRAY [VARCHAR(10)]
. - You can use the STRUCT syntax that Hive uses to define a ROW data type.
For example, you can specify
STRUCT<F1:INTEGER, F2:DOUBLE>
instead ofROW(F1 INTEGER, F2 DOUBLE)
.
- You can use the ARRAY syntax that Hive uses to define an ordinary ARRAY
data type. For example, you can specify
Restrictions
The ENFORCED constraint attribute cannot be applied to Big SQL Hadoop tables and results in an error ((SQLSTATE 42858).
CREATE HADOOP TABLE "Foo" (c1 INT);
CREATE HADOOP TABLE "FOO" (c1 INT);
If you use the LIKE clause and you are defining a local database table, but you enabled the SYSHADOOP.COMPATIBILITY_MODE global variable, you cannot include the LOCATION, TBLPROPERTIES, or HINTS clause.
- Do not use a ROW or ARRAY data type in a constraint or key definition (SQLTATE 42962).
- Do not reference a table that contains a column of type ARRAY or ROW in the AS or LIKE clause (SQLSTATE 428H2).
- There is a limit of 32,763 ARRAY or ROW definitions per database (SQLSTATE 54035). This limit also includes any user defined types that use the CREATE TYPE statement.
- Comparisons or assignments between ARRAY or ROW variables or using ARRAY data types ar parameters or ARRAY as ROW parameters to a stored procedures are not supported with ARRAY columns in a Hadoop table.
- Do not use an ARRAY data type as a type within an ARRAY type for Hadoop tables.
- The result type of a column in a full-select cannot be of type ARRAY.
The use of scrollable cursors is not supported for Hadoop tables.
Examples
- Simple create statement:Create a simple HADOOP table.
CREATE HADOOP TABLE t1 ( c1 INT, c2 VARCHAR(20), c3 DOUBLE, c4 REAL);
- TEXTFILE=delimited:The following statement shows an example table definition with a delimited text file:
CREATE HADOOP TABLE t ( i int, s VARCHAR(10)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
- Create a table with a BINARY and a TEXT SEQUENCEFILE:
CREATE HADOOP TABLE SEQ_TAB ( C1 INT NOT NULL, C2 VARCHAR(20)) STORED AS BINARY SEQUENCEFILE;
CREATE HADOOP TABLE T ( i int, s VARCHAR(10)) STORED AS TEXT SEQUENCEFILE;
- Use the LazySimpleSerDe SERDE class to create the table:The CREATE TABLE in the following example uses the
LazySimpleSerDe
SERDE class to create the table:CREATE HADOOP TABLE T ( i int, s VARCHAR(10)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.simpleminded' STORED AS SEQUENCEFILE;
- Create a table with RC files:The following statement is an example RC file CREATE:
CREATE HADOOP TABLE T ( i int, s VARCHAR(10)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE;
- Create a table with Parquet files:The following statement is an example CREATE for a Parquet file:
CREATE HADOOP TABLE T ( i int, s VARCHAR(10)) STORED AS PARQUETFILE;
- Create a table with an Avro file:The following statement is an example CREATE for an Avro file:
CREATE EXTERNAL TABLE bs_rev_profit_by_compaign ( Revenue DOUBLE, GrossProfit DOUBLE, CompaignName VARCHAR(10) ) COMMENT "A table backed by Avro data with the Avro schema embedded in the CREATE TABLE statement" ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/biadmin/avro/sheet/' TBLPROPERTIES ( 'avro.schema.literal'='{"type":"record", "name": "TUPLE_3", "fields": [ { "name": "Revenue","type": [ "null", "double" ], "doc": "autogenerated from Pig Field Schema"}, {"name":"GrossProfit", "type": [ "null", "double" ], "doc": "autogenerated from Pig Field Schema"}, {"name": "CompaignName", "type": [ "null", "varchar(10)" ], "doc":"autogenerated from Pig Field Schema"} ] } ') ;
- Create a table with a partitioning column:
CREATE HADOOP TABLE orders(orderkey INT not null, o_comment varchar(20) comment 'comment column' not null ) PARTITIONED BY (orderdate varchar(10));
- Create a table with a partition that is based on a partitioning expression on a column:
CREATE HADOOP TABLE orders(orderkey INT not null, o_comment varchar(20) comment 'comment column' not null, order_date date ) PARTITIONED BY (MONTH(order_date) AS order_month);
- Creating a table AS another table (CTAs):You can use the CREATE HADOOP TABLE AS... statement to convert data to BINARY SEQUENCEFILE, which can improve performance. Assume that your original table has the following structure:
CREATE HADOOP TABLE employee (empno INT, fname VARCHAR(100), lname VARCHAR(100), dept VARCHAR(100)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
Then, you create a new table based on the original table, but store the data as BINARY SEQUENCEFILE.CREATE HADOOP TABLE employeeBINSEQ (empno, fname, lname, dept) STORED AS BINARY SEQUENCEFILE AS (SELECT * FROM employee);
- Creating a table with ARRAY and STRUCT data types:The following example creates a table with complex data types:
CREATE HADOOP TABLE rf_del1 (id VARCHAR(10), c1 STRUCT<f1:varchar(10), f2:varchar(10)>, c2 ARRAY<varchar(10)>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '/' LINES TERMINATED BY '\n' ;
- Creating a Hadoop table that will be able to ingest HBase
data:
See the Loading an HBase data into a Hive table example for the complete SQL with LOAD.CREATE HADOOP TABLE staff_hbase ( id SMALLINT, name VARCHAR(100), dept SMALLINT, job VARCHAR(10), years SMALLINT, salary VARCHAR(10), comm VARCHAR(10) ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = ':key,info:name,info:dept,role:job,role:years,money:salary,money:comm');
- Create a simple table with an ARRAY data type:This example creates a table with a column name COL1 of type ARRAY of integers. In this example, the maximum cardinality of the array is 10.
CREATE HADOOP TABLE mytable (COL1 INT ARRAY[10]);
You can select the value that is assigned for the second element in the array.SELECT COL1[2] FROM mytable;
You can return all values based on a particular value in the COL1 ARRAY.SELECT * FROM mytable WHERE COL1[8]=12;
- Create a simple table with a STRUCT data type:This example creates a table with a column COL1 of type STRUCT. The STRUCT data type contains two elements. The first element, c1, is an INTEGER, and the second element, c2, is a DOUBLE.
CREATE HADOOP TABLE mytable (COL1 STRUCT(c1 int,c2 double));
You can select the element, C1, from the COL1 STRUCT.SELECT COL1.C1 FROM mytable;
You can return all values based on a particular value in the COL1 STRUCT type.SELECT * FROM mytable WHERE COL1.F1 = 5;
- Creating a table with a ROW data type:This example creates a table with two columns, an INTEGER column and a ROW column. The ROW column contains two fields, an INTEGER column, and a DOUBLE column.
CREATE HADOOP TABLE mytable (id INT, col1 ROW(c1 int,c2 double));
In the SELECT example, you select the field, c1, from the row type column, col1:SELECT col1.c1 FROM mytable;
- This example creates a table with a column name col1 of
type MAP, an associative array, of integer type values which can be indexed
by a VARCHAR(20) type key.
The following statement returns the element value from the MAP where the index value is 'MY_STRING':CREATE HADOOP TABLE mytable (id INT, col1 INT ARRAY[VARCHAR(20)])
SELECT col1['MY STRING'] FROM mytable
- Creating a view from Hadoop tables created with an ARRAY, STRUCT, or
MAP:
CREATE HADOOP TABLE testview01( id INT, arr TINYINT ARRAY[5] ) ; CREATE VIEW view001(i,c1) AS SELECT id, arr FROM testview01 WHERE arr[1] < 8 ; SELECT c1[1] FROM view001 ;
- The ARRAY or ROW column cannot be compared directly with another ARRAY or
ROW column (SQLSTATE 42818). Only the elements in the ARRAY or ROW can be
compared to another expression of a primitive type. Assume that
a and b are ARRAY columns.
This statement is not allowed:
This statement is allowed if the element type is a primitive type and not a of ROW type:...where a = b...
Assume that c and d are ROW columns. This statement is allowed:...where a[1] = b[1]...
where c.field1 = d.field1
- Create a table using the structure of another table, but using none of the
data from the source
table:
CREATE HADOOP TABLE T1 (C1, C2) AS (SELECT X1, X2 FROM T2) WITH NO DATA;
- CMX compression is supported in Big SQL. The following example uses CMX by setting the
compression properties before issuing the CREATE TABLE AS...
statement:
SET HADOOP PROPERTY 'mapred.output.compress'='true'; SET HADOOP PROPERTY 'mapred.output.compression.type'='BLOCK'; SET HADOOP PROPERTY 'mapred.output.compression.codec'='org.apache.hadoop.io.compress.CmxCodex';
CREATE HADOOP TABLE t2 AS SELECT * FROM t1;
- A date data type is stored as a
date.
If you run a DESCRIBE command in your Hive shell, you see the storage of column dt is DATE because of the column option:CREATE HADOOP TABLE t5 (key INT,dt DATE STORED AS DATE, dt2 DATE);
hive> describe formatted t5; col_name data_type comment key int dt date dt2 timestamp /*@type=date*/