Tips to help load successfully

Use the information in this topic to help load data successfully.

Choose a distribution key and declare columns correctly

Before you create your table, perform the following tasks:
  • Choose a distribution key. If you know the primary key or a column that is used frequently in joins, use that one. Use a distribution key with the highest selectivity. For more information about distribution keys, see the IBM® Netezza® System Administrator’s Guide.
  • Declare any column that does not contain null data (or that should not contain null data) as not null. The system processes not null columns more quickly.
  • Check whether you have number fields. Did you declare them as int8, int4, smallint, byteint, or numeric(s,p)? For large tables, the smaller the amount of storage, the better.

Determine your data format

Check the following items when you are determining the format of your data:
  • Check the number of data fields in each input line of the data file. Is the same number of columns in the target table definition?
    • If there are fewer columns than fields, is it acceptable to extend the schema to have filler columns? If not, the load will not succeed.
    • If there are more columns than fields, is it acceptable to insert null values into those columns? If it is acceptable, specify the -fillRecord option.
  • Check the field delimiter. It should be a character that is used to separate one field value from another. This field delimiter must be unique and must not appear in a field value, especially in a char or varchar string. To specify the field delimiter, use the -delim option.
  • Check whether there are any null values in the data source. How is the null value expressed in the data file? The RDBMS industry convention is to use the string “null” to represent a null value. If the data file uses a different representation, use the -nullValue option to override the default null value. The new value can be an empty string or a value in the range of a - z or A - Z and no longer than four characters.
  • Check whether there are any date, time, time with time zone, or timestamp data types in the table schema. If there are, what style is the date value? The style of these data type values must be consistent throughout the table.
  • Check the handling of string fields for char() or varchar() data types:
    • Does the longest value fit into the storage of the char() or varchar() declaration? If not, can you alter the schema to accommodate the longest string? If you cannot alter the schema, is truncating the string acceptable? If truncation is acceptable, specify the -truncString option. If you cannot alter the schema or truncate the string, the nzload command treats the record with the long string as an error record. The nzload command discards the record into the nzbad file and logs an error with the record and column numbers in the nzlog file.
    • See whether there are any special characters in the string fields, for example, CR, CRLF, or a character in a string that is the same as the field delimiter. This violates the unique character rule. If there are special characters, can you regenerate the data file to add an escape character to these special characters? If so, use the -escapeChar '\\' option to process the strings. If you cannot regenerate the data file, the load will contain incomplete and invalid records.

Consider the load source

If you are using pipes, are they from another local feed or from across a network? The preferred method is to read from a named pipe, rather than to read from stdin.

Is the file on an NFS-mounted directory? If so, remember that your load performance is constrained by the speed of the network.

Run the job

If you are running on a production system, make a copy of your source table before you start the load. Making a backup is fast within the system and is better than reloading from a backup. An example of the syntax for making a copy is as follows:
CREATE TABLE loan_backup AS SELECT * FROM loan;
Stage the data before you move it to a production system. Create a table, load it, validate it, and then use the ALTER TABLE command to move the tables to production. An example follows:
ALTER TABLE loan RENAME TO loan_lastmonth; 
ALTER TABLE loan_stage RENAME TO loan;

If you are running multiple nzload jobs to load into a table, use unique names for your nzbad files. The nzload command generates the default file name by using the <tablename>.<databasename> format and appending the extension .nzbad. For example, by default, loading into the data table of the dev database uses the data.dev.nzbad file name. Each instance of the nzload command overwrites the existing nzbad file. If you want to preserve the bad records that are stored in this file, use the -bf <file name> option to specify a different name for each nzload job. Keep in mind that if your default system case is uppercase, the system displays lowercase table names as uppercase in nzlog file names, for example, DATA.DEV.nzlog and DATA.DEV.nzbad.

To monitor CPU resources, run the Linux® top command on the host. If resources are available, consider running more loads concurrently.

Troubleshoot

If you see the error message Too many data fields for table, use the Linux head d-1 command against the data file to get the first row, which might contain the extracted names of the columns. Compare these names to the names in the DDL statement of the table that you created to see whether their physical positions match.

If you see the error message Data type mismatch on column n, use the Linux cut -d^ -f n inputfile | more command to look at the individual data values in the source file. Compare these values to the DDL statement of the table that you created to see whether the data types of those values match the data types of the corresponding columns in the table.

Handle exceptions

Repeat the load by using the nzbad file. If there are many exceptions, fix them and re-extract from the source system. If there are only a few exceptions, use a text editor to change data. To make large substitutions, use the Linux sed or awk command.

Validate the results

After the load is complete, validate the results by comparing them with the source system. Count the number of rows, and select the minimum, maximum, and sum of each numeric column in the table and the minimum and maximum of each date column in the table.

Generate statistics

Remember to run the GENERATE STATISTICS command on your tables or databases after you load new data.

Test performance

If your data is evenly distributed, you should see peak loading performance of at least 75% CPU utilization on the host. You can monitor utilization by running the Linux top command during the load. If you see less CPU utilization, either the data is skewed so that all SPUs are not sharing the workload or the parser is waiting for data. If your input data is skewed, which means that all records are being sent to only a few SPUs, those SPUs become the performance bottleneck. If your CPU utilization is less than 75% and the data is evenly distributed, you might have a streaming problem. Take the following steps:
  • If the load is running from the local host, determine the source of the data. Look for other concurrent database activities such as activities that are SPU-to-SPU broadcast intensive or SPU disk I/O intensive.
  • If the data is not locally staged or is on a SAN or NFS mount, determine whether the bottleneck is the remote source of the data or the network.

    The performance of the system depends on the number of SPUs. If, however, data is being streamed across an external network, the performance is limited by the speed of the network.

    Test the network by using the FTP command to send a file between the source and the local host, and measure the transfer rate. Under optimal conditions, a Gig-E network transfers data at a rate of approximately 1000 Mb/second, or approximately 125 MB/second or approximately 450 GB/hour.