LOAD HADOOP statement
Extracts data from external data sources and loads that data into a Hadoop or HBase table. Use this statement to load data from relational databases or delimited files that are in an external or local file system.
Authorization
You must have INSERT privileges granted to LOAD into a table with the APPEND option. You must have INSERT and DELETE privileges granted to LOAD into a table with the OVERWRITE option. The instance owner and the table creator already have the appropriate privileges to INSERT and DELETE.
Syntax
- 1 The HADOOP keyword is required unless the COMPATIBILITY_MODE global variable is enabled.
- 2 The JDBC driver of the source database must be installed and Db2® Big SQL must be restarted before loading data. Alternatively, you can provide the JDBC driver JAR file location (see Adding new JDBC drivers). The JDBC JAR files are distributed to the data nodes when the Hadoop job is run.
- 3 Each data node must have connectivity to the data source server. To verify, ping the data source server from each of the cluster data nodes.
- 4 All property key names and values need to be single quoted. For example, PROPERTIES('field.delimiter'=',')
Description
By using the LOAD HADOOP USING command, you can import data from external data sources into target Db2 Big SQL tables. You can LOAD into HBase tables that are created by using Db2 Big SQL CREATE HBase table syntax, or into Hadoop tables that are defined with the Hive HBase storage handler.
- JDBC CONNECTION source-url
-
Defines the connection as a JDBC connection and includes the connection information that is required to connect to your data source. The source-url specifies the type of remote data source, host name, or IP address of the source, port, and the name of the database to which you want to connect. Refer to the specific database documentation for examples of JDBC connection information.
Each data node must have connectivity to the data source. Try to ping the data source server from each node of the cluster.
The following example shows a DB2® connection:
Surround the value for URL with single quotes.'jdbc:db2://hostname.company.com:50000 /data_base_name'
The URL can refer to any database that supports a JDBC connection. The type of connector is based on the source-url prefix. For example,
'jdbc:oracle:thin:@abc.com:port:...'
implies that this connection is an Oracle connection.Before you run the LOAD HADOOP USING command with a JDBC CONNECTION source-url, ensure that the system administrator has installed the JDBC driver JAR files into Sqoop, or that you provide the JDBC driver JAR file location (see Adding new JDBC drivers). For more information about setting up and using Sqoop, see the Sqoop User Guide. See also WITH PARAMETERS 'driver'=.
The following list contains the currently supported databases.- DB2
- Netezza®
- Teradata
- MS-SQL
- Oracle
- Informix®
Restriction:You cannot load from a database source into tables that are defined with ARRAY or ROW data types.
- FILE source-url
-
Defines a path to the source delimited file. The input path can be a single file or a directory that contains the files. The file system scheme of the URI that is provided, such as FTP or SFTP, must reference one of the file systems that is configured in the Hadoop configuration. If the scheme and authority section of the URI is omitted, the default file system that is defined in the Hadoop configuration is used.
Access to the source files is controlled by the HDFS file permissions. Access can also be granted. If the files are not readable by the user, then the LOAD fails. You must have one of the following authorizations to access the source files:- DATAACCESS authority.
- Privileges on the files and the directory in
HDFS:
- READ privileges on the directory and all files in the directory, including files in any subdirectories that might be recursively specified in the FILE URL clause.
- EXECUTE privileges on all the directories and subdirectories in the specified FILE URL clause.
The input path can be a single file or a directory that contains the files. When the source-url path refers to a directory, all of the directories and files in that directory are accessed recursively. The contents of the source-url defines the type of access to the file. The following types of file access can be used:
- FTP
-
Defines the source-url as an FTP path to access your source data.
The authentication information is specified as part of the source-url.LOAD HADOOP USING FILE URL 'ftp://biadmin:password@myserver.abc.com:22/home/biadmin/mydir/abc.csv' ...
- SFTP
-
Defines the source-url as an SFTP path to access your source data.
You can specify the authentication information as part of the source-url.LOAD HADOOP USING FILE URL 'sftp://bigsql:password@myserver.abc.com:22/home/bigsql/mydir/abc.csv' ...
Note: SFTP source is supported for Db2 Big SQL on Hortonworks Data Platform 2.6.2 and above. - HDFS
-
Defines the source-url as an HDFS path to access your local or remote source data.
If you omit thehdfs://
URI, the LOAD HADOOP USING command assumes that it is an HDFS scheme:LOAD HADOOP USING FILE URL '/user/bigsql/mydir/abc.csv'
You can access files on a cluster that does not require authentication.LOAD HADOOP USING FILE URL 'hdfs://myserver.abc.com:9000/user/bigsql/mydir/sample.csv' ...
- WITH PARAMETERS connection-parameters
-
Specifies parameters that are used to connect to your data source. An alternate syntax is WITH CONNECTION PROPERTIES connection-parameters
Each parameter is a parameter name and parameter value pairing as shown in following example:
The key must be surrounded with single quotes. Surround string values with single quotes. No quotes are needed for numerical values.connection-parameter-name = connection-parameter-value
The following are valid parameters:- 'credential.file'='fullpath'
- Specifies the full absolute distributed file system (DFS) path and the name of the file that
contains the credentials. This property can be used for JDBC database sources only.
The bigsql user ID and the user that runs LOAD both need to have READ access to this file.
You can use a credential file instead of the user and password parameters. The contents specify the following required name=value pairs:- JDBC credential file
-
user=myname password=mypass
- 'credential.file.passphrase'='passphrase'
- Specifies the passphrase to decrypt the credential file. This property must be used with the
'credential.file' property. It can be used for JDBC database sources only.
Use the credential file and passphrase with these steps:
- Create a local properties file that contains the user and password properties.
- Run the utility file-encryption.sh to encrypt the file.
- Upload the encrypted file to HDFS.
/usr/ibmpacks/current/bigsql/bigsql/libexec/file-encryption.sh -encrypt -textfile sample.properties -encryptedfile sample_encrypted.properties -passphrase testit
hadoop fs -put sample_encrypted.properties /tmp/credstore
LOAD HADOOP USING ... WITH PARAMETERS ( 'credential.file'= '/tmp/credstore/sample_encrypted.properties', 'credential.file.passphrase'='testit')
- 'user'='username'
- Specifies a user ID to connect to the data source as shown here for example:
'user'='john'
. - 'password'='password'
- Specifies a user password to connect to the data source as shown here for example:
'password'='passwd123'
. The password property can be encoded. - 'driver'='com.somedb.jdbc.somedbDriver'
- Optionally specify the JDBC driver class to use. When you specify the JDBC driver class, the
Sqoop generic connection manager is used.
When you do not specify a JDBC driver class, a specific Apache Sqoop connection manager is selected
by default. The selection is based on the connection URL database type. For example, to use the Informix database, the 'driver' value
might be 'com.informix.jdbc.IfxDriver'. Note: The Informix database must have the transaction logging enabled to use the LOAD HADOOP USING command with the Sqoop connection manager.You must also copy the JDBC driver JAR file for the specific database into the directory.You do not need to specify the JDBC driver class for the following types of databases:
- DB2
- Teradata
- Netezza
- Oracle
- MS SQL Server
- 'driver.location'='sourcepath'
- Optionally specify the path to a JDBC driver JAR file that is needed for the LOAD HADOOP USING statement. The path must be a directory. For more information, see Adding new JDBC drivers.
- 'any.jdbc.prop1'='value'
- You can add as many JDBC connection properties as you need.
- FROM SQL QUERYselect-statement
-
Specifies the SQL query that is used to get data from the data source. The columns in the select clause must match both the names and the types of the columns in the target table definition. The query must have a WHERE clause that contains a $CONDITIONS marker. The $CONDITIONS marker is replaced when the LOAD command is run with a condition expression that is unique for each map task. In this case, the SPLIT COLUMN must be specified:
Surround the value for QUERY with single quotes. If a single quote is needed inside the value then use two single quotes.FROM SQL QUERY 'SELECT cust.id, cust.name, ords.itemname FROM customer cust, orders ords WHERE cust.id = ords.custid AND $CONDITIONS' SPLIT COLUMN ords.custid INTO TABLE orders_q
… FROM SQL QUERY 'SELECT ID, NAME, STATE, CITY FROM CITIES WHERE CITY=''San Jose'' AND $CONDITIONS' ...
If the SQL QUERY clause is specified, the user must have a $CONDITIONS marker in the queryExample query: Q-1 'select * from t where c1 = 1 AND $CONDITIONS'. Q-2 'select * from t where $CONDITIONS'.
- FROM TABLE table-name COLUMNS (column-list) WHERE search-condition
-
Instead of using the SQL QUERY clause, you can specify TABLE, COLUMNS, and the WHERE clause separately. In this case, the LOAD command generates the SQL query. The table name and the column names are case sensitive. Type them in the same case that the database is expecting.
IBM® Db2 Big SQL uses <table name>, <columns>, and <search-condition> to generate a SQL statement such as the following code:
The columns list must match the order of the target table columns.SELECT columns FROM table-name WHERE search-condition
- table-name
- Specifies the name of the source table in the exact same case as defined in
the source database.
The table name can be either a 1- or 2-part name.
- column-list
- Specifies the subset of columns in the same case to select the source data. If
this property is not specified when a table-name is specified, all columns in
the table are imported. The column names must be separated by
commas:
...COLUMNS(id, name, city)...
- search-condition
Surround the value for WHERE with single quotes. If a single quote is needed inside the value then use two single quotes.... WHERE 'id > 99' ...
… WHERE ' CITY=''San Jose'' ' …
- SPLIT COLUMN column-name
- This option specifies the column to use to distribute data across all tasks that are
involved in the data loading process. It is not required if the number of mappers is set to 1. This
option can be omitted when using the FROM TABLE clause if the source table has a primary
key.
This property is valid for database tables only. The split column must not be of type BIGINT....SPLIT COLUMN customers.id...
Specifying this option helps to ensure that load operations are done in parallel. By default, the primary key column is used for splitting the data. Use the split column to select a range of values to attempt to uniformly distribute data across mappers. However, be aware that selecting a split column with few unique values might restrict the LOAD operation to using as many tasks as there are unique values in the column. The performance and efficiency of the split operation is influenced by the variety of the values.
SPLIT COLUMN is required when the number of mappers is greater than one. If SQL QUERY is specified, the split column is one of the columns that are specified in the select column list. If SPLIT COLUMN is not specified, a primary key column is used. If there is no primary key, an error is reported and the LOAD fails.
SPLIT COLUMN is ignored when source data is extracted from DB2 DPF (data-partitioning-feature) or Netezza, and you specify TABLE-COLUMN-WHERE. For DB2 DPF and Netezza, the data is split by the database partitions instead of by value. However, for DB2 z/OS® DPF, the data is split by value, not by database partition. In all the other cases where the number of mappers is greater than one, you must specify SPLIT COLUMN.
- WITH SOURCE PROPERTIES (attribute=value)
-
These properties describe how to interpret the data in the source delimited files. Specify all properties with single quotation marks. For example: PROPERTIES('field.delimiter'=',')
These properties are valid for loading from files only.
The following properties are supported:- 'file.charset'='<character_set_encoding_type>'
- The standard Java™ character set
encodings are supported. The default is
'UTF-8'.
...'file.charset'='UTF-16'...
- 'date.time.format' = '<format_type1>'
- You can repeat this property to define all of the accepted datetime formats
for your input file. For example,
...WITH SOURCE PROPERTIES( 'field.delimiter'=',', 'date.time.format'='yyyy-MM-dd-HH.mm.ss.S', 'date.time.format'='yyyy-MM-dd HH.mm.ss.S', 'date.time.format'='yyyy-MM-dd') INTO ...
These formats are attempted, in order, for each field that corresponds to a target TIMESTAMP or DATE column. Use the following guidelines:- Ensure that the pattern characters follow the rules for the Java DateTimeFormatter class.
- Do not mix ambiguous patterns. For example, mixing the date patterns 'dd-MM-yy' and 'yy-MM-dd' is ambiguous, because date strings like 02-02-02 can be interpreted with either pattern. If you specify a mix of ambiguous patterns, you might see incorrect values.
- Specify the longest patterns first, because the first applicable pattern is used, and if the data is longer than the pattern, the load operation might ignore information that is located beyond the pattern.
- Specify the most detailed formats first. For example, specify a pattern that starts with a four-digit year (yyyy) before a pattern that starts with a two-digit month (MM); otherwise, when the two-digit pattern is parsed, it might be misinterpreted as a short two-digit year.
- If you have values of various lengths and styles, be sure to include multiple patterns to cover them. For example, if a value has fractional seconds, which are parsed by using the 'S' character, ensure that the number of 'S' characters in the pattern matches the number of fractional digits in the value.
- 'field.delimiter'='char'
- Defines the character that you use to separate the source fields. The
default character is a comma (,). The following examples show ways of defining the field.delimiter character:
- char
- 'field.delimiter'='|'
- octal number
- 'field.delimiter'='\001'
- escaped char
- 'field.delimiter'='\t'
- 'quote.char'='char'
- Defines the character that encloses a field which might contain delimited
characters. The default character is the double-quotation
(").
You can use this property to quote the input data string that maps to the ARRAY or ROW data type column. Do not use this property to quote ARRAY or ROW data type field values.
- 'escape.char'='char'
- Specify an escape character that you use in your source data to make the
quote character part of a string data sequence.
You can use this property with ARRAY or ROW data types. If the values in the fields contain the quote or the field delimiter character part of a string data sequence, then you must escape these values.
- skip.lines.count='int'
- Defines the number of lines from the top of each source file to skip before reading the data.
- 'field.indexes' = '<1>,...,<N>'
- Use this property with a comma separated list of field indexes, to specify
the fields that are to be included in the LOAD. Each input field is designated
by an index number, beginning with one. The field indexes are mapped to the
target columns by position. You can use this property to select a subset of the fields and reorder the fields to match the column order of the target table.
...'field.indexes' = '1, 5, 3, 6'...
- 'allow.multiline.record'='true/false'
- Determines if newline characters can appear within a field. Fields that
contain new line characters must be enclosed in quotations. When you use this
option, if an end quote is not found, the entire file is read as a single
field. The value can be either true or
false. The default value is false.
When the value is set to true, the source files are not split, even if they are larger than the block size. Each file is processed by one map task.
- 'ignore.extra.fields'='true/false'
- Defines whether a file can contain more fields than target columns. The
value can be either true or false. The
default is false. If this property is set to false, any
record with more fields than target columns is rejected.
Do not use this property for the elements of ARRAY or ROW data types. If there is additional data, it is discarded for ROW and associative ARRAY data types. For ordinary ARRAY data types, you can specify the array.overflow property to handle extra items in the input.
- 'replace.with.null'='<string>'
- If your source file contains field values that map non-string data type
target columns that you want to change to null, use this parameter to specify a character string to
be replaced by a null value during the LOAD process. For example, if you specify
...WITH SOURCE PROPERTIES ('replace.with.null'='NA')...
, then any field value of 'NA' that is found in the source file mapping to a non-string type column is changed to null in the target table.This property applies to all non-string type columns, such as INTEGER, SMALLINT, TINYINT, DOUBLE, FLOAT, DATE, TIMESTAMP . This property applies to loading from files.
- 'replace.string.with.null'='<string>'
- If your source file contains field values that map to string data type target columns that you
want to change to null, use this parameter to specify a character string to be replaced by a null
value during the LOAD process. For example, if you specify ...WITH SOURCE PROPERTIES
('replace.string.with.null'='NA')..., then any field value of 'NA' that
is found in the source file mapping to a string type column is is changed to null in the target
table. When this property is specified, empty input fields, and empty strings are considered as
empty strings. Only the input string that matches the user specified value will be considered as a
null value. If this property is not specified, then empty fields, and empty quoted strings will be
considered as null values.This property applies to all string type columns, such as CHAR,CHARACTER,VARCHAR, STRING. This property applies to loading from files. If the source file contains a specific string to represent null values for all data type columns, such as 'NULL' string , then you need to specify ...WITH SOURCE PROPERTIES ('replace.with.null'='NULL', 'replace.string.with.null'='NULL' ).... For example, assume that 'replace.strings.with.null'='NA':
Input text Db2 Big SQL table Comments "1","NA","Mike"1,NULL,Mike.Recognized as a BigSQL NULL value "2",,"John"2,,JohnRecognized as Empty string "3","","Smith"3,,SmithRecognized as Empty string - 'field.binary.encoding'
- Specifies the encoding of a binary field in the input file. The valid values for this property are base64 or hex. The default is base64.
- 'collection.items.delimiter'='<char>'
- Defines the character that you use to separate the collection item fields of
an ARRAY or ROW data type. This property can be repeated to specify delimiters
for nested levels. LOAD supports as many levels of nesting as are supported by the CREATE TABLE statement. The order of the characters that are defined must match to the nested levels of the ARRAY or ROW data type column. The default values are ‘\u0002’, ‘\u0003’, and '\u0004'.
- Simple ARRAY and ROW data types
- The default delimiter = '\u0002'.
- Associative ARRAY
- The default delimiter of map items = '\u0002'.
- Nested ordinary ARRAY and ROW data types
- In an ARRAY of ROWs, the default delimiter of ARRAY items = ‘\u0002’, and the default delimiter of the ROW fields = ‘\u0003'.
- 'map.keys.delimiter'='<char>'
- Defines the character that you use to separate the map key and the value. This delimiter is an alias for second level delimiter for collection items of an ARRAY or ROW data type. The default value is ‘\u0003’.
- INTO TABLE table-name
- The
table-name parameter specifies the target table into which your
source data is loaded. You can
optionally qualify the table-name with the schema
(schema-name.table-name). If a schema is not specified as part of
the table-name, the current schema is used. An error is returned if
the table-name does not exist. Create the target table before you run
the LOAD HADOOP USING statement.
The table type can be any Hadoop table. For example, you can load data into an HBase table that is defined by using a Hive HBase storage handler.
- PARTITION (partitions)
- A partition is a list of paired column names and values. A partition column name must
match a partition name in the target table. You specify a partition column value as you would a
predicate specification. The following examples illustrate a partition-value:
- For an integer
-
age = 25
- For a variable character
-
name = 'John Doe'
Tables are partitioned by using the CREATE and ALTER table statements. PARTITION is only applicable for a Hadoop target table type, which is partitioned.-
- If you specify none or some of the target table partitions as LOAD partitions, Db2 Big SQL LOAD discovers missing partition names, assigns values from the source data, to create a new partition for the missing partitions.
- APPEND or OVERWRITE
- Specifies whether to APPEND source data to the data in the target
table or to OVERWRITE the data in the target table with the source
data. The rules for APPEND and OVERWRITE depend on
several factors:
- When loading into HBase tables, if the row with the key already exists, it is overwritten with new data. If the row does not exist, the new row is added to the table. For HBase tables, the OVERWRITE option does not apply. An invalid option error is issued.
- For Hadoop tables that are not partitioned, you can specify APPEND or OVERWRITE.
- For Hadoop tables that are partitioned, and all partition values are specified for all of the partition keys, then the specified partition must be empty.
- If you specify none, or do not specify partition values for of all the partition keys, you can specify APPEND or OVERWRITE. For tables with partitions in default locations (under the table path), Db2 Big SQL LOAD discovers missing partition names and values and then creates the required partitions. However, a Hadoop table can have partitions in different locations than the table location. In this case, when partitions have non-default locations (not under the table path), do not specify the OVERWRITE clause.
- WITH TARGET TABLE PROPERTIES (attribute=value)
- These properties can apply to loading from all sources. The key must be surrounded
with single quotes. Surround string values with single quotes. No quotes are needed for
numerical values.
- Properties that are common to all table types
-
- 'column.names'='col1,...,colN'
-
This target table property applies to all types of tables including HBase tables.
This property defines the list of target columns to be loaded. When this property is set, the source and target columns are matched by position. The number of source columns selected to import should match the number of target columns that are specified here. This option can be used to specify a subset of the columns in the target table. Omitted columns are set to null. This property can apply to files or tables.
This option can be used with the 'field.indexes' option to map specific fields in a file to specific columns in the target table.
This property can apply to files or tables.
...'column.names'='NAME, DEPT, ID'...
For files, you can specify ARRAY or ROW data type columns as target columns. But, you cannot specify the individual elements of the ARRAY or ROW data type column.
- 'array.overflow'= 'keep|reject|truncate'
- Use this property to specify the LOAD actions if the input has more items
than the target ARRAY data type length. The value can be
keep, reject, or
truncate.
- keep
- If you specify 'array.overflow'='keep', then all of the items are kept.
- reject
- If you specify 'array.overflow'='reject', then the row is rejected if the input contains more items than the ARRAY length.
- truncate
- If you specify 'array.overflow'='truncate', extra items in the input are discarded. These items are not loaded into the table.
- 'reject.nulls'='true|false'
-
By using this property, you can reject records that have a value of NULL if the target column does not allow NULL values. The value can be true or false. The default is true. If you specify 'reject.nulls'='true', then the row is rejected. Disable the property by setting 'reject.nulls'='false'.
This applies to NULL values, not values that contain the text "NULL".
- 'decimal.overflow'='reject|round_up|round_down'
-
By using this property, you can enforce the precision and scale for DECIMAL target data types. The value can be reject, round_up, or round_down. The default is reject.
- reject
- If you specify 'decimal.overflow'='reject', then the row is rejected if the precision or scale does not match the target.
- round_up | round_down
- If you specify 'decimal.overflow'='round_up' or 'decimal.overflow'='round_down', the input value is rounded up or down to match the target precision and scale.
After rounding, if the value still does not fit the target precision and scale, then the record is rejected.
You can use this property for ARRAY and ROW fields. For example, if a ROW type field is defined as decimal, and the input value does not fit into the defined scale, the row is rejected, if you specify 'decimal.overflow'='reject'.
- binary.overflow'='reject|keep'
- Specifies whether the row is accepted or rejected if the length exceeds that of the target.
- reject
- Specifies that the row is rejected if the length exceeds that of the target. This is the default.
- keep
- Specifies that the row is accepted even if the length exceeds that of the target. However, the column value is truncated at read time, and a warning message is written to the bigsql.log file during select processing.
- Properties that are specific to Hadoop tables
-
- 'drop.delimiter.chars'='true'|'false'
- Indicates that the characters
\n
,\r
, and\001
are to be dropped from string data.This property is applicable only for tables that use the storage format, TEXTFILE. The LOAD USING fails if you use this property with any storage format that is not TEXTFILE.
You can use this property with ARRAY or ROW data types.
- 'replace.delimiter.chars'
- Indicates that the characters
\n
,\r
, and\001
in the string data are to be replaced by the specified character. This property and drop.delimiter.chars are mutually exclusive.Warning: If you do not specify either property, and if the data includes \n, \r or \001 as delimiters, data can become unreadable as part of the LOAD.This property is applicable only for tables that use the storage format, TEXTFILE. The LOAD USING fails if you use this property with any storage format that is not TEXTFILE.
You can use this property with ARRAY or ROW data types.
- Properties that are specific to HBase tables
-
- hbase.load.method
- The possible values are put or bulkload.
The default value is put. This property can apply to files
or tables.
When bulkload is used, Db2 Big SQL outputs the data in the internal data format of HBase. It then directly loads the generated files. If you import into an empty table and the amount of data is large, consider pre-splitting the data for performance. A temporary directory is created under the bigsql user directory in the distributed file system (DFS). You can change this directory by using the hbase.bulkload.workdir property.
If the put option is used, then individual records are inserted into the HBase table. With this option you can also specify the hbase.disable.wal parameter.
- hbase.disable.wal
- This parameter determines if the write-ahead logging for HBase writes and puts are disabled. By disabling write-ahead logging, you increase the performance of write operations, but you can incur data loss if the region servers fail. Your client application is responsible for ensuring data consistency when you use this option. The value is either true or false. The default is false.
- hbase.bulkload.workdir
- This optional property specifies a temporary directory for creating hfiles during the load into HBase tables using the bulkload option. If this property is not used, a temporary directory is created under the bigsql user directory.
- hbase.timestamp
- This property can be used to specify the version for all the source rows. The specified value must be long integer. By default, the system uses the server current time in milliseconds.
- WITH LOAD PROPERTIES properties
- The key must be surrounded with single quotes. Surround string values with single
quotes. No quotes are needed for numerical values.
- 'rejected.records.dir'
- This property defines a target location for records that are not imported or
loaded into the target table. The directory can be defined in the HDFS file system
only. The default is that no directory is defined. This property can apply to
files or tables.
An example of a reason that records might be rejected is because the data types do not match between the source and the target.
When the LOAD completes, and your source table contains records that cannot be loaded, an informational message is issued that specifies the number of records that were rejected. If the directory is specified, then the reject record files are created in the directory.
- 'num.map.tasks'
-
The maximum number of map tasks to create. If not specified, the default is 4. If the source files are local to the cluster, then the default number of map tasks is 80% of the cluster map task capacity.
- 'num.reduce.tasks'
-
The number of reduce tasks that are used to perform the LOAD operation. If not specified, the default is the same as the number of map tasks that are specified or the default number of map tasks. For a file source that requires reducer tasks, and no value for num.reduce.tasks is specified, then 80% of the cluster reducer task capacity is used.
- 'max.rejected.records'
- Maximum number of rejected records that are allowed before the load process ends with an error. The default is 100.
Usage notes
- For a guide to diagnosing and correcting typical LOAD HADOOP problems, see How to diagnose and correct LOAD HADOOP problems.
- If you are using the Db2 Command Line Processor (CLP) to run the LOAD statement, then you must submit the LOAD via the EXECUTE IMMEDIATE statement. In which case, you need to escape each of the single quotes used in the LOAD statement by prefixing with another single quote. For example, the following is how to submit the LOAD statement in example 20 via the
CLP:
begin execute immediate 'LOAD HADOOP USING FILE URL ''tmp/example.csv'' WITH SOURCE PROPERTIES (''replace.with.null''=''NULL'') INTO HIVE TABLE t1 APPEND'; end
- The SQLCA count is not set by LOAD, so the rows affected that is listed in the SQLCA always indicate 0.
- When the LOAD completes successfully, you receive an informational message (SQL5108) with the number of rows loaded in the target, the number of input records, the number of rejected records, and the number of lines skipped. The message contains the Hadoop JOB ID for reference. You can use the JOB ID to find information about the LOAD Hadoop job, and then refer to the map reduce tasks to find information about the rejected records, and additional details about the job in the bigsql.log file. You can use the log entry identifier [BSL-x-xxxxxxxxx] to refer to the load failure in the bigsql.log.
- When the LOAD
completes with errors, you receive a message that indicates the problem. If the error occurred after
the Hadoop job started, the JOB ID is shown. You can use the JOB ID to retrieve the map task logs,
which have additional information.
If the maximum number of rejected records is exceeded, the Hadoop job terminates and no data is loaded (SQL5109N). The actual number of rejected records might be more than the maximum, depending on the time at which the job terminates.
Because the number of rejected records is checked periodically, it is possible that the job will complete successfully before it can be terminated. In this case, SQL5108 is returned with the number of rejected records.
- When you use the LOAD HADOOP USING statement, the number of columns that are specified in the source table must be less than or equal to the number of columns in the target table. In general, the columns in the target table should exactly match the columns in the source relational table, such that they have the same name and they are in the same order.
- If 'num.map.tasks' is greater than one, and if you specify the SQL QUERY clause, then you must also specify SPLIT COLUMN. If 'num.map.tasks' is more than 1 and if you specify the TABLE clause, then you must also specify either a SPLIT COLUMN or the table must have a primary key. Split columns and primary keys are used by Db2 Big SQL to parallelize the load operation.
- When you specify table names and column names, the case (upper, lower, mixed) of each character of the name must match the corresponding character with which the name is specified in the source database.
- The use of single or double quotations:
- Properties key-value pairs
- The key needs to be surrounded with single quotes. For string values, surround
with single quotes. For numerical values, no quotes are
needed.
'num.map.tasks' = 20 'date.time.format'='dd-MM-yy'
- Identifiers for table or column names
- Quotations are optional for table and column names. Each part of the name can be surrounded with double quotes.
- String values (URL, QUERY, WHERE)
- Use single quotes surrounding the value for the URL, QUERY and WHERE. If a
single quote is needed inside the value, then use two single
quotes.
... WHERE ' CITY=''San Jose''' ... … SQL QUERY 'SELECT ID, NAME, STATE, CITY FROM CITIES WHERE CITY=''San Jose'' AND $CONDITIONS' ...
- Interpreting data values
- BOOLEAN
- When you load data from a file, the following case insensitive Boolean string
values are interpreted as true:
- “1”
- “yes”
- “true”
- “on”
- DATE and TIMESTAMP
- When you load data from a file, the date.time.format pattern can be used to interpret the format of the data.
- When you load source files that are in a compression format, the files cannot be split. If the file is not compressed and is on the cluster, then the file can be split by block size. When data is loaded from a remote file system, the source files are not split even if they are larger then the block size. The source files are distributed across the map tasks by file size to evenly distribute the work.
- Target table data is compressed based on the following Hadoop compression properties:
- mapreduce.output.fileoutputformat.compress
- mapreduce.output.fileoutputformat.compress.codecSet these properties before you run the LOAD HADOOP USING command by using the following Db2 Big SQL command:
SET HADOOP PROPERTY mapreduce.output.fileoutputformat.compress='true'; SET HADOOP PROPERTY mapreduce.output.fileoutputformat.compress.codec='com.ibm.biginsights.compress.CmxCodex';
- After a LOAD successfully completes, the table automatically runs the ANALYZE command. With the
auto analyze feature turned on, which is the default, the following command is
run:
If you want to modify the auto analyze after LOAD feature, then set the configuration property biginsights.stats.auto.analyze.post.load to ONCE, NEVER, or ALWAYS in the bigsql-conf.xml file, or set a session property by using SET HADOOP PROPERTY. For more information about the auto analyze feature of ANALYZE, see Enabling or disabling analyze after load.ANALYZE TABLE schema.table COMPUTE STATISTICS FOR ALL COLUMNS TABLESAMPLE SYSTEM(10);
Restrictions
- If only the /apps/hbase/data directory is encrypted, INSERT .. SELECT or bulk LOAD INTO HBASE tables does not work.
- This LOAD HADOOP USING statement populates Hadoop and HBase tables only. It cannot populate native database tables.
- If global impersonation is enabled, user username can run LOAD HADOOP
successfully only if the following are true:
- The HDFS directory, /user/<username>/, must exist with READ, WRITE, and EXECUTE permissions for the user.
- <username> must exist on each node of the cluster.
<schemaName>
,<tableName>
, and<colName>
must be specified in the exact case as defined in the database.- You cannot issue concurrent LOAD HADOOP USING statements to load a single table. Running multiple LOAD HADOOP USING statements in parallel against one table results in an error.
- A load Hadoop operation does not automatically generate values for the partition-expression column when a table is created with the PARTITIONED BY partition-expression clause.
Adding new JDBC drivers
- By copying the JDBC driver to the directory and restarting the Db2 Big SQL service.
- By setting the bigsql.load.jdbc.jars parameter in the
bigsql-conf.xml file. For
example:
<property> <name>bigsql.load.jdbc.jars</name> <value>/tmp/jdbcdrivers</value> <description>Directory to store JDBC driver JAR files for LOAD</description> </property>
- By setting the driver.location parameter on the LOAD HADOOP statement. For
example:
LOAD HADOOP USING JDBC CONNECTION URL 'jdbc:db2://hostname.company.com:50000/data_base_name' WITH PARAMETERS ( 'user'='myname', 'password'='mypass', 'driver.location'='/tmp/myjdbcjars' ) FROM TABLE myDB2Table SPLIT COLUMN COLA INTO TABLE myBITable APPEND
Mapping source JDBC type to the target Db2 Big SQL type
The following table shows the mapping from the source JDBC type to the target Db2 Big SQL type. The native data type to JDBC type is JDBC driver dependent.
Source JDBC type | Db2 Big SQL data type |
---|---|
INTEGER
|
INTEGER
|
VARCHAR
LONGVARCHAR NVARCHAR LONGNVARCHAR |
VARCHAR
|
CHAR
NCHAR |
CHAR
|
NUMERIC
TIME |
STRING
|
DECIMAL
|
DECIMAL
|
BOOLEAN
BIT |
BOOLEAN
|
TINYINT
|
TINYINT
|
SMALLINT
|
SMALLINT
|
BIGINT
|
BIGINT
|
FLOAT
DOUBLE |
DOUBLE
|
REAL
|
FLOAT
|
TIMESTAMP
DATE |
TIMESTAMP
|
BINARY
|
BINARY
|
VARBINARY
|
VARBINARY
|
BLOB
CLOB LONGVARBINARY ROWID REF DATALINK database specific JDBC types such as INTERVAL, SERIAL ARRAY STRUCT MAP |
not supported
|
Access restrictions to rejected records
Some details about accessing rejected records:
- If the rejected records folder does exist prior to executing LOAD, both you and the
bigsql user must have READ-WRITE-EXECUTE access to the folder. Otherwise LOAD
fails. If you can access the folder, the rejected records file is created with the same permissions
as those of the existing folder.
- With impersonation, the ACL entries on the rejected records file always set exclusive READ-WRITE-EXECUTE access to you only (i.e. the user initiating the LOAD).
- Without impersonation (disabled), the ACL entries on the rejected records file set exclusive READ-WRITE-EXECUTE access to bigsql user only. The ACL entries on the rejected records folder remain unchanged.
- If the rejected records folder does not exist:
- Without impersonation, when you run LOAD it creates the rejected records folder (specified in rejected.records.dir) with exclusive READ-WRITE-EXECUTE permissions for the bigsql user only. No other user has access to that file or folder. If you require access, you must create the folder manually (outside of LOAD) and set RWX permissions for both yourself and the bigsql user (for example, by setting group access) for the folder. The new rejected records file always inherits the folder permissions and is created with RWX permissions for the bigsql user only.
- With impersonation, when you run LOAD it creates the rejected records folder with READ-WRITE-EXECUTE permissions exclusively for you. No other user has access to the folder and file. The new rejected records file always inherits permissions from the folder. With impersonation, the rejected records is accessible to you after LOAD runs.
With impersonation, the ACL entries on the rejected records folder and file are always set exclusive READ-WRITE-EXECUTE access to you only (with the user initiating the LOAD). Without impersonation (disabled), the ACL entries on the rejected records folder and file are set with exclusive READ-WRITE-EXECUTE access to the bigsql user only.
When you run LOAD and specify rejected.records.dir, the rejected records (column values) are not written into the map-reduce logs. The rejected records are only written into the logs when rejected.records.dir is not specified.
A common practice is for you to create a rejected records folder with RWX permissions for both yourself and the group, and have yourself and bigsql in same group.
Examples
- Loading from a DB2 table.This example shows the SQL QUERY clause.
CREATE HADOOP TABLE STAFF_Q ( ID SMALLINT, NAME VARCHAR(9), DEPT SMALLINT, JOB VARCHAR(5), YEARS SMALLINT, SALARY STRING, COMM STRING); LOAD HADOOP USING JDBC CONNECTION URL 'jdbc:db2://myhost:50000 /SAMPLE' WITH PARAMETERS ( 'user' = 'myuser','password'='mypassword') FROM SQL QUERY 'SELECT * FROM STAFF WHERE YEARS > 5 AND $CONDITIONS' SPLIT COLUMN ID INTO TABLE STAFF_Q APPEND ;
- Loading from a DB2 table into a partitioned Db2 Big SQL table.This example shows the SPLIT COLUMN clause. It uses target table properties and multiple load properties.
CREATE HADOOP TABLE staff ( id SMALLINT , name VARCHAR (9), job VARCHAR(5)) partitioned by (years SMALLINT ); LOAD HADOOP USING JDBC CONNECTION URL 'jdbc:db2://myhost:50000 /SAMPLE' FROM TABLE STAFF COLUMNS(ID,NAME,JOB,YEARS) WHERE 'YEARS is not null' SPLIT COLUMN ID INTO TABLE staff APPEND WITH TARGET TABLE PROPERTIES ( 'delimited.drop.loaddelims'='true') WITH LOAD PROPERTIES ( 'max.rejected.records'=50, 'num.map.tasks' = 2) ;
- Load from a file that is on the distributed file system in the directory
/tmp/data/:
If no protocol is used in the URL value, the path is assumed to be a distributed file systm (DFS) file.CREATE HADOOP TABLE STAFF_F ( ID SMALLINT, NAME VARCHAR(9), DEPT SMALLINT, JOB VARCHAR(5), YEARS SMALLINT, SALARY VARCHAR(20), COMM VARCHAR(20)) ; LOAD HADOOP USING FILE URL '/tmp/data/staff.csv' WITH SOURCE PROPERTIES('field.delimiter'=':') INTO TABLE TEST.STAFF_F APPEND ;
- Loading from a teradata table. This example shows how to load a table from Teradata. The example assumes that table
country_info
exists.LOAD HADOOP USING JDBC CONNECTION URL 'jdbc:teradata://myhost/database=GOSALES' WITH PARAMETERS ( 'user' ='myuser',password='mypass') FROM SQL QUERY 'SELECT * FROM COUNTRY WHERE SALESCOUNTRYCODE > 6 AND $CONDITIONS' SPLIT COLUMN SALESCOUNTRYCODE INTO TABLE country_info APPEND;
- Load from the contents of a
directory.This example shows how to load the contents of files in a directory that is part of your local file system into a Db2 Big SQL table:
LOAD HADOOP USING FILE URL 'sftp://biadmin:biadmin@myserver.abc.com:22/home/biadmin/mydir' WITH SOURCE PROPERTIES ( 'field.delimiter'=' ', 'file.charset'='UTF-8', 'skip.lines.count'='1') INTO TABLE staff_table PARTITION ( years='10') APPEND ;
- Load from a database source that does not have the installed drivers.In this example, you must explicitly specify the driver name.
LOAD HADOOP USING JDBC CONNECTION URL 'jdbc:informix-sqli://myhost:1550/TESTDB:INFORMIXSERVER=infxservername;DELIMIDENT=y;' WITH PARAMETERS ( 'user'='dbuser','password'='dbpw', 'driver'='com.informix.jdbc.IfxDriver' ) FROM SQL QUERY 'select * from a_table WHERE $CONDITIONS' SPLIT COLUMN a_key INTO TABLE a_table_infx APPEND;
- Load from a file and include a path for the rejected records.
LOAD HADOOP USING FILE URL 'sftp://user1:password1@host1/path/to/file.data' WITH SOURCE PROPERTIES ( 'date.time.format' = 'yyyy-MM-dd HH:mm:ss', 'date.time.format' = 'yyyy-MM-dd', 'field.delimiter'=',', 'skip.lines.count'=1) INTO TABLE base_staff1 APPEND WITH TARGET TABLE PROPERTIES ('column.names' = 'c1,c2') WITH LOAD PROPERTIES ('num.map.tasks' = 1, 'max.rejected.records'=500, 'rejected.records.dir' = '/temp/reject_records') ;
- Load from a file on the distributed file system and insert a character string in place
of the null value.
LOAD HADOOP USING FILE URL '/tmp/input/staff_na.csv' WITH SOURCE PROPERTIES ( 'replace.with.null'='NA', 'ignore.extra.fields'='true') INTO TABLE staff_file OVERWRITE WITH LOAD PROPERTIES ('num.map.tasks' = 1);
- Granting privileges to LOAD.If you do not have database administrative authority, you cannot load data into a table. If you created the table, you automatically have the proper authority. If you are the DBADMIN, you can grant authority to other users by using the following command:
GRANT INSERT, SELECT, DELETE ON myschema.SALES TO USER user1
- Dynamically loading into a partitioned table.This example adjusts both the number of map tasks and the number of reduce tasks.
LOAD HADOOP USING FILE URL '/tmp/lineitem/' WITH SOURCE PROPERTIES ( 'field.delimiter'='|', 'ignore.extra.fields'='true', 'field.indexes'='1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,11' ) INTO TABLE lineitem_parquet APPEND WITH LOAD PROPERTIES ( 'num.map.tasks'='20', 'num.reduce.tasks' = '10' ) ;
- Load from a directory into an HBASE table by using the PUT method.
In this example, you use the PUT load method, which is compatible with the property to disable write-ahead logging. You specify the column.names property to indicate where the data is loaded in the target table. By using multiple date.time.format source properties, LOAD can recognize different date format values in the input.
LOAD HADOOP USING FILE URL '/input/orderdata' WITH SOURCE PROPERTIES ( ‘date.time.format' = ‘yyyy-MM-dd hh:mm:ss’, ‘date.time.format’ = ‘yy-MM-dd’ ) INTO TABLE staff WITH TARGET TABLE PROPERTIES ( ‘hbase.load.method’ = 'put', ‘hbase.disable.wal’ = 'true', ‘column.names’ = ‘id,name,purchasedate’ ) WITH LOAD PROPERTIES ( 'num.map.tasks' = 10, 'max.rejected.records'=500, 'rejected.records.dir' = '/tmp/reject_records' ) ;
- Load from a Teradata table into an HBase table by using target
table properties and load properties.This example shows how to load a table from Teradata. The statement specifies a list of table properties and load properties in HBase.
CREATE HBASE TABLE country_hbase ( salescountrycode INTEGER, country VARCHAR(50),currencyname VARCHAR(50)) COLUMN MAPPING ( KEY MAPPED BY (salescountrycode), cf:country MAPPED BY (country), cf:currencyname MAPPED BY (currencyname) ); LOAD HADOOP USING JDBC CONNECTION URL 'jdbc:teradata://myhost/database=GOSALES' WITH PARAMETERS (user ='myuser',password='mypassword') FROM TABLE 'GOSALES'.'COUNTRY' COLUMNS (SALESCOUNTRYCODE, COUNTRY, CURRENCYNAME) WHERE 'SALESCOUNTRYCODE > 9' SPLIT COLUMN 'SALESCOUNTRYCODE' INTO TABLE country_hbase WITH TARGET TABLE PROPERTIES( 'hbase.timestamp' = 1366221230799, 'hbase.load.method'='bulkload' ) WITH LOAD PROPERTIES ( 'num.map.tasks' = 10, 'num.reduce.tasks'=5 ) ;
- Load HBase data into a Hive table that was defined with the HBase
storage handler:This example begins by creating a table with the HBase storage handler and the hbase.columns.mapping parameters:
Then, you can load the HBase data by using the hbase.load.method parameter:CREATE HADOOP TABLE staff_hbase ( id SMALLINT, name VARCHAR(100), dept SMALLINT, job STRING, years SMALLINT, salary STRING, comm STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = ':key,info:name,info:dept,role:job,role:years,money:salary,money:comm');
LOAD HADOOP USING FILE URL '/testdata/staff' WITH SOURCE PROPERTIES ( 'field.delimiter' = ':' ) INTO TABLE staff_hbase WITH TARGET TABLE PROPERTIES ( 'hbase.load.method' = 'bulkload' ) ;
- Load from a file and include a control for matching the precision and scale of DECIMAL values:
LOAD HADOOP USING FILE URL '/tmp/file.data' WITH SOURCE PROPERTIES ( 'date.time.format' = 'yyyy-MM-dd HH:mm:ss', 'date.time.format' = 'yyyy-MM-dd', 'field.delimiter'=',', 'skip.lines.count'=1) INTO TABLE base_staff1 APPEND WITH TARGET TABLE PROPERTIES ('column.names' = 'c1,c2', 'decimal.overflow'='round_up') WITH LOAD PROPERTIES ('num.map.tasks' = 1, 'max.rejected.records'=500, 'rejected.records.dir' = '/temp/reject_records') ;
- Load data into a table that contains an ARRAY and a ROW data type.Define the table with three columns. Column id is an INTEGER. Column c1 is an array of INTEGERs. Column c2 is a ROW type that contains an INTEGER data type and a VARCHAR data type. The LOCATION keyword defines a table with pre-existing data.
CREATE HADOOP TABLE t1( id INT, c1 INT ARRAY[4], c2 ROW(f1 int, f2 VARCHAR(8)) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' LOCATION '/users/user1/tables/t1' ;
The data is in file table1.csv, and contains the following contents:0,0|1|1|2,25|Data0025 1,1|1|2|3,30|Data0030 2,2|3|5|8,50|Data0050
Run the LOAD statement to load the data into the appropriate columns:LOAD HADOOP USING FILE URL 'sftp://biadmin:biadmin@myserver.com:22//home/user1/data/table1.csv' WITH SOURCE PROPERTIES ( 'field.delimiter' = ',', 'file.charset' ='UTF-8' ) INTO HIVE TABLE t1 APPEND WITH LOAD PROPERTIES ( 'num.map.tasks' = 1 ) ;
Get the results from the id column, the fourth entry in the array from the column c1, and the field called f2 from the column c2.SELECT id, c1[4], c2.f2 FROM t1
- Load a table by using the "round_up" property to match the target precision and scale.
CREATE HADOOP TABLE hvalidate_data_tunc_roundup_nn ( DEC52_COL DECIMAL(5,2), VAR3_COL VARCHAR(3), CHAR1NN_COL VARCHAR(1) NOT NULL, DEC100_COL DECIMAL ); LOAD HADOOP USING FILE URL '/tmp/data/validate_data.txt' INTO TABLE hvalidate_data_tunc_roundup_nn APPEND WITH TARGET TABLE PROPERTIES( 'decimal.overflow'='round_up', 'reject.nulls'='true' ) ;
- Load from a delimited data file into a table that is defined with a ROW data type. The
fields of the ROW are delimited by a colon (:). Create the table:
The data to input looks like the following text:CREATE HADOOP TABLE customers( name VARCHAR(10), address ROW( street VARCHAR(30), city VARCHAR(20), zip INT ) ) ;
Then, load the data:John Doe, "555, Bailey Avenue:San Jose:95148" Daniel Smith , "1001 First Avenue:New York:10022"
LOAD HADOOP USING FILE URL '/testdata/customers.csv' WITH SOURCE PROPERTIES ( 'collection.items.delimiter' = ':' ) INTO TABLE customers;
- Load from a delimited data file into a table defined using an ARRAY data
type.Create the three-column table. The third column is an ARRAY of VARCHAR data.
The following is the data that will be loaded:CREATE HADOOP TABLE product_usage( customer_name VARCHAR(100), product_name VARCHAR(100), versions VARCHAR(4) ARRAY[3]);
Load the data:Aerospace systems, BigInsights,2.1 | 3.0 | 3.1 Advanced Technologies, DB2,10.1 | 10.2 | 9.1
LOAD HADOOP USING FILE URL '/testdata/product_usage.csv' WITH SOURCE PROPERTIES ( 'collection.items.delimiter' = '|' ) INTO TABLE product_usage WITH TARGET TABLE PROPERTIES ( 'array.overflow' = 'keep' ) ;
- Load from a delimited data file into a table. The table is defined with ROW data type
nested inside of an associative ARRAY data type. Create the table with two columns. The first column is a VARCHAR(100), and the second columns is an ARRAY of ROWs.
The following text is the data that is loaded.CREATE HADOOP TABLE social_data ( personname VARCHAR(100), FRIENDS ROW(CITY STRING, ZIP INT, EMAIL STRING) ARRAY[VARCHAR(20)]);
The personname value in the first row is John. The following flow describes the data in terms of the SOURCE PROPERTIES that will be used in the LOAD statement:John, Mike:New York#10001#mike@example.com| Smith: Denver#80123#smith@yahoo.com David, Jennifer: San Francisco#94102#jenn@example.com
- field.delimiter
- The columns or fields in the data are separated by a comma for the purposes of the LOAD statement.
- collection.items.delimiter
- The second column contains the FRIENDS of John. This second column is a collection of friends that includes a collection of information about the friends. Each item or friend is separated by the '|' symbol for the purposes of the LOAD statement.
- map.keys.delimiter
- Each item or friend begins with a key, because this table defines the collection as an associative ARRAY. The data shows that John has 2 friends: Mike and Smith. The keys are separated from the remainder of the information by a colon (:).
- collection.items.delimiter
- The second collection contains items within it that provide information about the friends. The items in the second collection, defined as CITY, ZIP, and EMAIL, are separated from each other by a pound sign (#).
Load the data. Notice the order of the SOUCE PROPERTIES, where the first property is the outermost reference, and the last property is the innermost reference.LOAD HADOOP USING FILE URL '/testdata/social_data.csv' WITH SOURCE PROPERTIES( 'field.delimiter' = ',' , 'collection.items.delimiter' = '|' , 'map.keys.delimiter' = ':' , 'collection.items.delimiter' = '#' ) INTO TABLE social_data ;
Remember that you cannot select a whole ARRAY or ROW column, but you can select results in several different ways:- Return the elements of the
ROW:
SELECT personname, FRIENDS[1].CITY, FRIENDS[1].ZIP, FRIENDS[1].EMAIL FROM social_data;
- Return all of the elements in the
ARRAY:
SELECT personname, FRIEND.CITY, FRIEND.ZIP, FRIEND.EMAIL FROM social_data, UNNEST(FRIENDS) AS F(FRIEND);
- Return elements based on
predicates:
SELECT personname, FRIENDS[1].CITY, FRIENDS[1].ZIP, FRIENDS[1].EMAIL FROM social_data WHERE FRIENDS[1].CITY = 'New York';
- CMX compression is supported on Db2 Big SQL LOAD. The following example uses CMX on the LOAD after setting the compression properties:
SET HADOOP PROPERTY 'mapreduce.output.fileoutputformat.compress'='true'; SET HADOOP PROPERTY 'mapreduce.output.fileoutputformat.compress.type'='BLOCK'; SET HADOOP PROPERTY 'mapreduce.output.fileoutputformat.compress.codec'='org.apache.hadoop.io.compress.CmxCodex';
LOAD HADOOP USING FILE URL 'tmp/example.csv' WITH SOURCE PROPERTIES ('replace.with.null'='NULL') INTO HIVE TABLE t1 APPEND ;
- Change the auto analyze feature, so that after a successful LOAD statement, ANALYZE is never
run:
SET HADOOP PROPERTY biginsights.stats.auto.analyze.post.load=NEVER;