Unicode and the encoding of language characters

Enterprise COBOL provides basic runtime support for Unicode, which can handle tens of thousands of characters that cover all commonly used characters and symbols in the world.

A character set is a defined set of characters, but is not associated with a coded representation. A coded character set (also referred to in this documentation as a code page) is a set of unambiguous rules that relate the characters of the set to their coded representation. Each code page has a name and is like a table that sets up the symbols for representing a character set; each symbol is associated with a unique bit pattern, or code point. Each code page also has a coded character set identifier (CCSID), which is a value from 1 to 65,536.

Unicode has several encoding schemes, called Unicode Transformation Format (UTF), such as UTF-8, UTF-16, and UTF-32. Enterprise COBOL uses UTF-16 (CCSID 1200) in big-endian format as the representation for national literals and data items that have USAGE NATIONAL.

UTF-8 represents ASCII invariant characters a-z, A-Z, 0-9, and certain special characters such as ' @ , . + - = / * ( ) the same way that they are represented in ASCII. UTF-16 represents these characters as NX'00nn', where X'nn' is the representation of the character in ASCII.

For example, the string 'ABC' is represented in UTF-16 as NX'004100420043'. In UTF-8, 'ABC' is represented as X'414243'.

One or more encoding units are used to represent a character from a coded character set. For UTF-16, an encoding unit takes 2 bytes of storage. Any character defined in any EBCDIC, ASCII, or EUC code page is represented in one UTF-16 encoding unit when the character is converted to the national data representation.

Cross-platform considerations: Enterprise COBOL and COBOL for AIX® support UTF-16 in big-endian format in national data. COBOL for Windows supports UTF-16 in little-endian format (UTF-16LE) in national data. If you are porting Unicode data that is encoded in UTF-16LE representation to Enterprise COBOL from another platform, you must convert that data to UTF-16 in big-endian format to process the data as national data.

related references
Storage of character data
Character sets and code pages (Enterprise COBOL for z/OS® Language Reference)