Tips for using VARCHAR and VARGRAPHIC data types in databases

Variable-length column (VARCHAR or VARGRAPHIC) support allows you to define any number of columns in a table as variable length. If you use VARCHAR or VARGRAPHIC support, the size of a table can typically be reduced.

Data in a variable-length column is stored internally in two areas: a fixed-length or ALLOCATE area and an overflow area. If a default value is specified, the allocated length is at least as large as the value. The following points help you determine the best way to use your storage area.

When you define a table with variable-length data, you must decide the width of the ALLOCATE area. If the primary goal is:

  • Space saving: use ALLOCATE(0).
  • Performance: the ALLOCATE area must be wide enough to incorporate at least 90% to 95% of the values for the column.

It is possible to balance space savings and performance. In the following example of an electronic telephone book, the following data is used:

  • 8600 names that are identified by: last, first, and middle name
  • The Last, First, and Middle columns are variable length.
  • The shortest last name is two characters; the longest is 22 characters.

This example shows how space can be saved by using variable-length columns. The fixed-length column table uses the most space. The table with the carefully calculated allocate sizes uses less disk space. The table that was defined with no allocate size (with all the data stored in the overflow area) uses the least disk space.

Table 1. Disk space used with variable-length columns
Variety of Support Last Name Max/Alloc First Name Max/Alloc Middle Name Max/Alloc Total Physical File Size Number of Rows in Overflow Space
Fixed Length 22 22 22 567 K 0
Variable Length 40/10 40/10 40/7 408 K 73
Variable-Length Default 40/0 40/0 40/0 373 K 8600

In many applications, performance must be considered. If you use the default ALLOCATE(0), it doubles the disk unit traffic. ALLOCATE(0) requires two reads; one to read the fixed-length portion of the row and one to read the overflow space. A variable-length implementation with a carefully chosen ALLOCATE minimizes overflow and space and maximizes performance. The size of the table is 28% smaller than the fixed-length implementation. Because 1% of rows are in the overflow area, the access requiring two reads is minimized. The variable-length implementation performs about the same as the fixed-length implementation.

Start of changeVarying length columns with a length less than 30 may be implemented by the database using a non-zero allocate value. As a result, changing the allocated length for these columns may not impact the disk I/O performance efficiency. End of change

To create the table using the ALLOCATE keyword:

CREATE TABLE PHONEDIR
         (LAST    VARCHAR(40) ALLOCATE(10),
          FIRST   VARCHAR(40) ALLOCATE(10),
          MIDDLE  VARCHAR(40) ALLOCATE(7))

If you are using host variables to insert or update variable-length columns, use variable length host variables. Because blanks are not truncated from fixed-length host variables, using fixed-length host variables can cause more rows to spill into the overflow space. This increases the size of the table.

In this example, fixed-length host variables are used to insert a row into a table:

01  LAST-NAME PIC X(40).
   …
    MOVE "SMITH" TO LAST-NAME.
    EXEC SQL
      INSERT INTO PHONEDIR
       VALUES(:LAST-NAME, :FIRST-NAME, :MIDDLE-NAME, :PHONE)
    END-EXEC.

The host-variable LAST-NAME is not variable length. The string “SMITH”, followed by 35 blanks, is inserted into the VARCHAR column LAST. The value is longer than the allocate size of 10. 30 of 35 trailing blanks are in the overflow area.

In this example, variable-length host variables are used to insert a row into a table:

01  VLAST-NAME.
    49 LAST-NAME-LEN PIC S9(4) BINARY.
    49 LAST-NAME-DATA PIC X(40).
   …
    MOVE "SMITH" TO LAST-NAME-DATA.
    MOVE 5 TO LAST-NAME-LEN.
    EXEC SQL
      INSERT INTO PHONEDIR
     VALUES(:VLAST-NAME, :VFIRST-NAME, :VMIDDLE-NAME, :PHONE)
    END-EXEC.

The host variable VLAST-NAME is variable length. The actual length of the data is set to 5. The value is shorter than the allocated length. It can be placed in the fixed portion of the column.

Start of changeFor columns which have column statistics collected, information about the length of the actual data in the columns is available to help guide the setting of the allocated length. This information is available from the QSYS2.SYSCOLUMNSTAT view or from the QdbstListStatistics API. These interfaces will report the maximum and average lengths of the data, the number of rows for which the column’s data is stored in the overflow space, and the allocated length that would enable 90% of the values in the column to be stored in the fixed-length area. In many cases, the system will automatically collect and maintain these column statistics as part of the query optimization process. However, if statistics have not been automatically collected or are out of date, you can manually update them for the columns of interest: Manually collecting and refreshing statistics. Note that altering a column to change the allocated length requires an exclusive lock on the table and can take a significant amount of time.End of change

Running the Reorganize Physical File Member (RGZPFM) command against tables that contain variable-length columns can improve performance. The fragments in the overflow area that are not in use are compacted by the RGZPFM command. This technique reduces the read time for rows that overflow, increases the locality of reference, and produces optimal order for serial batch processing.

Choose the appropriate maximum length for variable-length columns. Selecting lengths that are too long increases the process access group (PAG). A large PAG slows performance. A large maximum length makes SEQONLY(*YES) less effective. Variable-length columns longer than 2000 bytes are not eligible as key columns.

Using LOBs and VARCHAR in the same table

Storage for LOB columns is allocated in the same manner as for VARCHAR columns. When a column stored in the overflow storage area is referenced, all the columns in that area are paged into memory. A reference to a "smaller" VARCHAR column that is in the overflow area can potentially force extra paging of LOB columns. For example, A VARCHAR(256) column retrieved by an application can have a side effect of paging in two 5 MB BLOB columns that are in the same row. In order to prevent this side effect, you might want to use the ALLOCATE keyword to ensure that only LOB columns are stored in the overflow area.