Character sets
A character set is an element of internationalization that maps and translates an alphabet; that is, the characters that are used in a particular language. A character set is made up of a series of code points, or the numeric representation of a character. For example, the code point for the letter A in international EBCDIC is 0xC1. A character set can also be called a coded character set, a code set, a code page, or an encoding. Examples of character sets include International EBCDIC, Latin 1, and Unicode. Character sets are chosen on the basis of the letters and symbols required.
Character sets are referred to by a name or by an integer identifier called the coded character set identifier (CCSID). For example, Latin 1 might be called ISO-8859-1 or CCSID 819. The CCSID determines the character set name that is used with the iconv functions. A CCSID table associates the CCSID with the character set name. The entries in the CCSID table must conform to the standards outlined in the Character Data Representation Architecture Reference and Registry. See Add coded character set identifiers for more information about the CCSIDs, and The iconv functions for more information about the iconv functions.
The character set definition defaults that are on the z/TPF system support all aliases for the character sets that are supported on the GNU C library (glibc). Not all glibc translations are included on the z/TPF system.
The z/TPF system supports multibyte character sets that use shift-out and shift-in (shift out of a regular character set mode; shift back to regular character set mode). Wide characters are 4 bytes and encoded using the UCS-4 character set, which is a Unicode-based character set and ASCII compatible. For more information about multibyte character sets and wide characters, see the GNU website.
The z/TPF system supports translations using the iconv functions among the following character sets listed in Table 1. These character sets are supported by glibc and are single-byte character sets (SBCSs) unless otherwise noted.
| Character set | Description | Encoding |
|---|---|---|
| ANSI_X3.4-1968 | Standard 7-bit ASCII | ASCII (X'00'-X'7F') |
| CP1250 | MS Windows Latin 2 | ASCII |
| CP1252 | MS Windows Latin 1 | ASCII |
| EUC-JP | Japanese characters | ASCII |
| GB18030 | Chinese multibyte | ASCII |
| IBM037 | US/Canada Latin 1 | EBCDIC |
| IBM290 | Japanese Katakana | EBCDIC |
| IBM500 | Multinational | EBCDIC |
| IBM819 | Alias for ISO8859-1 | ASCII |
| IBM850 | Latin 1 PC Data | ASCII |
| IBM875 | Greek | EBCDIC |
| IBM924 | IBM500/IBM1047 with euro | EBCDIC |
| IBM930 | Japanese Katakana/Kanji multibyte character set | EBCDIC |
| IBM932 | Japanese PC Data | ASCII |
| IBM939 | Japanese Latin/Kanji multibyte character set | EBCDIC |
| IBM1026 | Turkey Latin 5 | EBCDIC |
| IBM1047 | Open Systems Latin 1 | EBCDIC |
| IBM1140 | Latin 1; IBM037 with euro for US | EBCDIC |
| IBM1141 | Latin 1; IBM273 with euro for Austria/Germany | EBCDIC |
| IBM1142 | Latin 1; IBM277 with euro for Denmark/Norway | EBCDIC |
| IBM1143 | Latin 1; IBM278 with euro for Finland/Sweden | EBCDIC |
| IBM1144 | Latin 1; IBM280 with euro for Italy | EBCDIC |
| IBM1145 | Latin 1; IBM284 with euro for Spain | EBCDIC |
| IBM1146 | Latin 1; IBM285 with euro for UK | EBCDIC |
| IBM1147 | Latin 1; IBM297 with euro for France | EBCDIC |
| IBM1148 | Latin 1; IBM500 with euro for Belgium/Canada/Switzerland (Multinational) | EBCDIC |
| IBM1149 | Latin 1; IBM871 with euro for Iceland | EBCDIC |
| ISO8859-1 | Latin 1, Standard 8-bit | ASCII |
| ISO8859-2 | Latin 2 | ASCII |
| ISO8859-3 | Latin 3 | ASCII |
| ISO8859-4 | Latin 4 | ASCII |
| ISO8859-9 | Latin 5, Turkey/Western Europe | ASCII |
| ISO8859-10 | Latin 6, Baltic/Scandanavian | ASCII |
| ISO8859-15 | Latin 9, ISO8859-1 with euro | ASCII |
| UCS-2 | 2-byte normalized Unicode | Unicode |
| UCS-4 | 4-byte normalized Unicode | Unicode |
| UTF-8 | Multibyte Unicode (a range of 1-6 bytes per character) | Unicode |
| UTF-16 | Multibyte Unicode (a range of 1-6 bytes per character) | Unicode |
| UTF-32 | Multibyte Unicode (a range of 1-6 bytes per character) | Unicode |