Character sets

A character set is an element of internationalization that maps and translates an alphabet; that is, the characters that are used in a particular language. A character set is made up of a series of code points, or the numeric representation of a character. For example, the code point for the letter A in international EBCDIC is 0xC1. A character set can also be called a coded character set, a code set, a code page, or an encoding. Examples of character sets include International EBCDIC, Latin 1, and Unicode. Character sets are chosen on the basis of the letters and symbols required.

Character sets are referred to by a name or by an integer identifier called the coded character set identifier (CCSID). For example, Latin 1 might be called ISO-8859-1 or CCSID 819. The CCSID determines the character set name that is used with the iconv functions. A CCSID table associates the CCSID with the character set name. The entries in the CCSID table must conform to the standards outlined in the Character Data Representation Architecture Reference and Registry. See Add coded character set identifiers for more information about the CCSIDs, and The iconv functions for more information about the iconv functions.

The character set definition defaults that are on the z/TPF system support all aliases for the character sets that are supported on the GNU C library (glibc). Not all glibc translations are included on the z/TPF system.

The z/TPF system supports multibyte character sets that use shift-out and shift-in (shift out of a regular character set mode; shift back to regular character set mode). Wide characters are 4 bytes and encoded using the UCS-4 character set, which is a Unicode-based character set and ASCII compatible. For more information about multibyte character sets and wide characters, see the GNU website.

The z/TPF system supports translations using the iconv functions among the following character sets listed in Table 1. These character sets are supported by glibc and are single-byte character sets (SBCSs) unless otherwise noted.

Table 1. z/TPF-supported character sets
Character set	Description	Encoding
ANSI_X3.4-1968	Standard 7-bit ASCII	ASCII (X'00'-X'7F')
CP1250	MS Windows Latin 2	ASCII
CP1252	MS Windows Latin 1	ASCII
EUC-JP	Japanese characters	ASCII
GB18030	Chinese multibyte	ASCII
IBM037	US/Canada Latin 1	EBCDIC
IBM290	Japanese Katakana	EBCDIC
IBM500	Multinational	EBCDIC
IBM819	Alias for ISO8859-1	ASCII
IBM850	Latin 1 PC Data	ASCII
IBM875	Greek	EBCDIC
IBM924	IBM500/IBM1047 with euro	EBCDIC
IBM930	Japanese Katakana/Kanji multibyte character set	EBCDIC
IBM932	Japanese PC Data	ASCII
IBM939	Japanese Latin/Kanji multibyte character set	EBCDIC
IBM1026	Turkey Latin 5	EBCDIC
IBM1047	Open Systems Latin 1	EBCDIC
IBM1140	Latin 1; IBM037 with euro for US	EBCDIC
IBM1141	Latin 1; IBM273 with euro for Austria/Germany	EBCDIC
IBM1142	Latin 1; IBM277 with euro for Denmark/Norway	EBCDIC
IBM1143	Latin 1; IBM278 with euro for Finland/Sweden	EBCDIC
IBM1144	Latin 1; IBM280 with euro for Italy	EBCDIC
IBM1145	Latin 1; IBM284 with euro for Spain	EBCDIC
IBM1146	Latin 1; IBM285 with euro for UK	EBCDIC
IBM1147	Latin 1; IBM297 with euro for France	EBCDIC
IBM1148	Latin 1; IBM500 with euro for Belgium/Canada/Switzerland (Multinational)	EBCDIC
IBM1149	Latin 1; IBM871 with euro for Iceland	EBCDIC
ISO8859-1	Latin 1, Standard 8-bit	ASCII
ISO8859-2	Latin 2	ASCII
ISO8859-3	Latin 3	ASCII
ISO8859-4	Latin 4	ASCII
ISO8859-9	Latin 5, Turkey/Western Europe	ASCII
ISO8859-10	Latin 6, Baltic/Scandanavian	ASCII
ISO8859-15	Latin 9, ISO8859-1 with euro	ASCII
UCS-2	2-byte normalized Unicode	Unicode
UCS-4	4-byte normalized Unicode	Unicode
UTF-8	Multibyte Unicode (a range of 1-6 bytes per character)	Unicode
UTF-16	Multibyte Unicode (a range of 1-6 bytes per character)	Unicode
UTF-32	Multibyte Unicode (a range of 1-6 bytes per character)	Unicode