Defining UTF-8 data items

Enterprise COBOL provides native support for defining, moving, and comparing UTF-8 data items.

Three different types of UTF-8 data items are supported in Enterprise COBOL for z/OS®:

Fixed character-length UTF-8 data items
This type of UTF-8 data item is defined when the PICTURE clause contains one or more 'U' characters, or a single 'U' character followed by a repetition factor, and neither the BYTE-LENGTH phrase of the PICTURE clause nor the DYNAMIC LENGTH clause is specified.

Each 'U' character in the PICTURE clause corresponds to one UTF-8 character, which in Enterprise COBOL is treated as the equivalent of a single Unicode code point. The UTF-8 encoding of a character varies in length and it is always between one and four bytes.
The following code fragment illustrates two different fixed character-length UTF-8 data item definitions:
```
01 u1 pic u(10).  *> fixed character-length UTF-8 data item holding 10 UTF-8 characters (40 bytes reserved)
01 u2 pic uuuu.   *> fixed character-length UTF-8 data item holding 4 UTF-8 characters (16 bytes reserved)
```
For fixed character-length UTF-8 data items, the number of bytes reserved for the data item in memory is 4 × n, where n is the number of characters specified in the definition of the item. Note that, due to the varying length nature of the UTF-8 encoding, even after moving n characters to a UTF-8 data item of length n, it is not necessarily the case that all 4 × n reserved bytes are needed to hold the data. It depends on the size of each character in the data.

During moves, the fixed character-length UTF-8 data items are always padded with UTF-8 blanks (x'20') to the maximum byte-length of the data item. When truncation is performed on the fixed character-length UTF-8 data item, it is done on a character boundary.

Whenever a fixed character-length UTF-8 data item is used as a sender, the byte-length of the item is computed at run time based on its known fixed-character length so that the number of characters used in the operation is the same as the number of characters indicated in the item definition.
Fixed byte-length UTF-8 data items
This type of UTF-8 data item is defined when the PICTURE clause contains a single 'U' character and the BYTE-LENGTH phrase of the PICTURE clause is specified. This phrase indicates that the data item is a fixed byte-length UTF-8 item consisting of exactly n valid UTF-8 bytes. Note that, due to the varying length nature of the UTF-8 encoding, the number of characters in the data item at any time is variable and depends on the size of each character, but should always be in the range [ceil(4/n), n] , where ceil is the mathematical function that returns the least integer greater than its argument.
The following code fragment illustrates two different fixed byte-length UTF-8 data item definitions:
```
01 u1 pic u byte-length 10.  *> fixed byte-length UTF-8 data item
```
When truncation is needed for a fixed byte-length UTF-8 data item, it is done at a character boundary and the data item is always padded out with UTF-8 spaces (x'20) to the specified byte length of n.

Whenever a fixed byte-length UTF-8 data item is used as a sender, the byte-length of the item is always taken to be n.

Fixed byte-length UTF-8 data items are provided for compatibility with the Db2®, DFSORT, and MQ products, all of which provide support for fixed byte-length UTF-8 data strings. For example, to use a COBOL UTF-8 data item as a Db2 host variable that corresponds to a CHAR(n) column in a Unicode table, the data item must contain the BYTE-LENGTH phrase of the PICTURE clause. Similarly, DSFSORT only supports UTF-8 sort key parts that have a fixed byte-length and therefore fixed byte-length UTF-8 data items in COBOL are strongly recommended for sort and merge keys.
Dynamic-length UTF-8 data items
UTF-8 data items can be declared with the DYNAMIC LENGTH clause, which is a natural fit for UTF-8 data items since they vary in byte length. In this case, there is no restriction on the number of characters in the data item and the number of bytes is only limited when there is a LIMIT phrase of the DYNAMIC LENGTH clause.

When a UTF-8 data item defined with the DYNAMIC LENGTH clause requires truncation due to the LIMIT phrase, truncation is performed at the UTF-8 character level.

Truncation for these data items, when it is needed due to the LIMIT phrase, is done at a character boundary.

Whenever a dynamic-length UTF-8 data item is used as a sender, the byte-length of the item is always the current, runtime byte-length of the item.
For example:
```
01 u1 pic u dynamic length limit 10.  *> dynamic-length UTF-8 data item
```

The following rules apply to data items declared with the 'U' pic symbol:

UTF-8 data items can be elementary data items.
UTF-8 data items can appear in groups, including file records, except where the data item is dynamic length, which is not supported in file records.
Group-items can be UTF-8 items via the UTF-8 phrase of the GROUP-USAGE clause.
Note: The groups defined with the GROUP-USAGE UTF-8 clause can only contain UTF-8 items defined with the BYTE-LENGTH phrase of the PICTURE clause. No other classes of data items are permitted.
Condition variables associated with condition names (level 88 items) can be UTF-8.
The VALUE clause for UTF-8 data items accepts alphanumeric, national and UTF-8 literals. In the alphanumeric case, the literal is automatically converted from EBCDIC to UTF-8.

UTF-8 support does not include support for the following types of data items:

UTF-8 edited, UTF-8 numeric-edited
UTF-8 decimal
UTF-8 external float

Note: The USAGE UTF-8 clause can only appear in data definitions for data items declared with the 'U' pic symbol, i.e., numeric items cannot be defined with USAGE UTF-8.

The following code fragment illustrates valid and invalid UTF-8 data item definitions:

Examples of valid UTF-8 data item definitions:

01 u1 pic u(10).                              *> fixed character-length item, usage utf-8 implied
01 u2 pic u(10) usage utf-8.                  *> fixed character-length item, usage utf-8 specified explicitly
77 u3 pic uuu usage utf-8.                    *> fixed character-length item
77 u4 pic u dynamic length usage utf-8.       *> dynamic length item 
01 u5 pic u(5) value u'abcde'.                *> fixed character-length item, VALUE clause applied
01 u6 pic u byte-length 5.                    *> fixed byte-length

Examples of invalid UTF-8 data item definitions:

01 u1 pic uuu,uuu,uuu usage utf-8.            *> utf-8 edited not supported
01 u2 pic 9(9) usage utf-8.                   *> utf-8 decimal not supported 
77 u3 pic 999,999 usage utf-8.                *> utf-8 numeric edited not supported 
77 u4 pic 999e+99 usage utf-8.                *> utf-8 external float not supported 
01 u5 pic u(10) byte-length 20.               *> cannot include repetition factor and byte-length phrase
01 u6 pic uuu byte-length 5.                  *> cannot include repetition factor and byte-length phrase
01 u7 pic u byte-length 5 dynamic length.     *> only one of byte-length and dynamic length allowed