Wide character data representation

Edit online

The wide character code was developed so that multibyte characters can be processed more efficiently internally in the system. A multibyte character representation is converted into a uniform internal representation (wide character code) so that internally all characters have the same length. Using this internal form, character processing can be done in a code set-independent fashion. The wide character code refers to this internal representation of characters.

The wchar_t data type is used to represent the wide character code of a character. The size of the wchar_t data type is implementation-specific. It is a typedef definition and can be found in the ctype.h, stddef.h, and stdlib.h files. No program must assume a particular size for the wchar_t data type, enabling programs to run under implementations that use different sizes for the wchar_t data type.

On the AIX® operating system, the wchar_t data type is 32–bit in the 64–bit environment and 16–bit in the 32–bit environment. The locale methods are standardized such that in most locales, the value that is stored in the wchar_t for a particular character is always its Unicode data value. For applications that are intended to run only on AIX, it allows certain applications to handle the wchar_t data type in a consistent fashion, even if the underlying code set is unknown. All locales use Unicode for their wide character code values (process code), except the IBM-eucTW code set. The IBM-eucTW code set (LANG =zh_TW) contains many characters that are not contained in the Unicode standard. As a result, it is impossible to represent these characters with a Unicode-wide character value. Applications that are required to have Unicode-based wchar_t data for Traditional Chinese must use the Zh_TW locale (big5 code set) instead.

Do not assume that the char data type is either signed or unsigned. It is platform-specific. If the particular system that is used defines char to be signed, comparisons with full 8-bit quantity yield incorrect results. As all the 8-bits are used in encoding a character, be sure to declare char as unsigned char wherever necessary. Also, if a signed char value is used to index an array, it might yield incorrect results. To make programs portable, define 8-bit characters as unsigned char.