![Start of change](../delta.gif)
Defining UTF-8 data items
Enterprise COBOL provides native support for defining, moving, and comparing UTF-8 data items.
- Fixed character-length UTF-8 data items
This type of UTF-8 data item is defined when the
PICTURE
clause contains one or more 'U' characters, or a single 'U' character followed by a repetition factor, and neither theBYTE-LENGTH
phrase of thePICTURE
clause nor theDYNAMIC LENGTH
clause is specified.Each 'U' character in the
PICTURE
clause corresponds to one UTF-8 character, which in Enterprise COBOL is treated as the equivalent of a single Unicode code point. The UTF-8 encoding of a character varies in length and it is always between one and four bytes.The following code fragment illustrates two different fixed character-length UTF-8 data item definitions:01 u1 pic u(10). *> fixed character-length UTF-8 data item holding 10 UTF-8 characters (40 bytes reserved) 01 u2 pic uuuu. *> fixed character-length UTF-8 data item holding 4 UTF-8 characters (16 bytes reserved)
For fixed character-length UTF-8 data items, the number of bytes reserved for the data item in memory is 4 × n, where n is the number of characters specified in the definition of the item. Note that, due to the varying length nature of the UTF-8 encoding, even after moving n characters to a UTF-8 data item of length n, it is not necessarily the case that all 4 × n reserved bytes are needed to hold the data. It depends on the size of each character in the data.
During moves, the fixed character-length UTF-8 data items are always padded with UTF-8 blanks (x'20') to the maximum byte-length of the data item. When truncation is performed on the fixed character-length UTF-8 data item, it is done on a character boundary.
Whenever a fixed character-length UTF-8 data item is used as a sender, the byte-length of the item is computed at run time based on its known fixed-character length so that the number of characters used in the operation is the same as the number of characters indicated in the item definition.
- Fixed byte-length UTF-8 data items
This type of UTF-8 data item is defined when the
PICTURE
clause contains a single 'U' character and theBYTE-LENGTH
phrase of thePICTURE
clause is specified. This phrase indicates that the data item is a fixed byte-length UTF-8 item consisting of exactly n valid UTF-8 bytes. Note that, due to the varying length nature of the UTF-8 encoding, the number of characters in the data item at any time is variable and depends on the size of each character, but should always be in the range [ceil(4/n), n], where ceil is the mathematical function that returns the least integer greater than its argument
.
The following code fragment illustrates two different fixed byte-length UTF-8 data item definitions:01 u1 pic u byte-length 10. *> fixed byte-length UTF-8 data item
When truncation is needed for a fixed byte-length UTF-8 data item, it is done at a character boundary and the data item is always padded out with UTF-8 spaces (x'20) to the specified byte length of n.
Whenever a fixed byte-length UTF-8 data item is used as a sender, the byte-length of the item is always taken to be n.
Fixed byte-length UTF-8 data items are provided for compatibility with the Db2®, DFSORT, and MQ products, all of which provide support for fixed byte-length UTF-8 data strings. For example, to use a COBOL UTF-8 data item as a Db2 host variable that corresponds to a CHAR(n) column in a Unicode table, the data item must contain the
BYTE-LENGTH
phrase of thePICTURE
clause. Similarly, DSFSORT only supports UTF-8 sort key parts that have a fixed byte-length and therefore fixed byte-length UTF-8 data items in COBOL are strongly recommended for sort and merge keys. - Dynamic-length UTF-8 data items
UTF-8 data items can be declared with the
DYNAMIC LENGTH
clause, which is a natural fit for UTF-8 data items since they vary in byte length. In this case, there is no restriction on the number of characters in the data item and the number of bytes is only limited when there is aLIMIT
phrase of theDYNAMIC LENGTH
clause.When a UTF-8 data item defined with the
DYNAMIC LENGTH
clause requires truncation due to theLIMIT
phrase, truncation is performed at the UTF-8 character level.Truncation for these data items, when it is needed due to the
LIMIT
phrase, is done at a character boundary.Whenever a dynamic-length UTF-8 data item is used as a sender, the byte-length of the item is always the current, runtime byte-length of the item.
For example:01 u1 pic u dynamic length limit 10. *> dynamic-length UTF-8 data item
- UTF-8 data items can be elementary data items.
- UTF-8 data items can appear in groups, including file records, except where the data item is dynamic length, which is not supported in file records.
- Group-items can be UTF-8 items via the UTF-8 phrase of the
GROUP-USAGE
clause.Note: The groups defined with theGROUP-USAGE UTF-8
clause can only contain UTF-8 items defined with the BYTE-LENGTH phrase of thePICTURE
clause. No other classes of data items are permitted. - Condition variables associated with condition names (level 88 items) can be UTF-8.
- The
VALUE
clause for UTF-8 data items accepts alphanumeric, national and UTF-8 literals. In the alphanumeric case, the literal is automatically converted from EBCDIC to UTF-8. - UTF-8 support does not include support for the following types
of data items:
- UTF-8 edited, UTF-8 numeric-edited
- UTF-8 decimal
- UTF-8 external float
Note: The USAGE UTF-8 clause can only appear in data definitions for data items declared with the 'U' pic symbol, i.e., numeric items cannot be defined with USAGE UTF-8.The following code fragment illustrates valid and invalid UTF-8 data item definitions:
Examples of valid UTF-8 data item definitions:01 u1 pic u(10). *> fixed character-length item, usage utf-8 implied 01 u2 pic u(10) usage utf-8. *> fixed character-length item, usage utf-8 specified explicitly 77 u3 pic uuu usage utf-8. *> fixed character-length item 77 u4 pic u dynamic length usage utf-8. *> dynamic length item 01 u5 pic u(5) value u'abcde'. *> fixed character-length item, VALUE clause applied 01 u6 pic u byte-length 5. *> fixed byte-length
Examples of invalid UTF-8 data item definitions:01 u1 pic uuu,uuu,uuu usage utf-8. *> utf-8 edited not supported 01 u2 pic 9(9) usage utf-8. *> utf-8 decimal not supported 77 u3 pic 999,999 usage utf-8. *> utf-8 numeric edited not supported 77 u4 pic 999e+99 usage utf-8. *> utf-8 external float not supported 01 u5 pic u(10) byte-length 20. *> cannot include repetition factor and byte-length phrase 01 u6 pic uuu byte-length 5. *> cannot include repetition factor and byte-length phrase 01 u7 pic u byte-length 5 dynamic length. *> only one of byte-length and dynamic length allowed
![End of change](../deltaend.gif)