Unicode in the 16-bit UTF-16 form has no prescribed endian orientation
for interchange. This requires communication processes to evaluate
the endian orientation correctly. To aid in this, the character U+FEFF
ZERO WIDTH NO-BREAK SPACE can be used as a Byte Order Mark (BOM).
When interpreted in the incorrect endian orientation, it evaluates
to U+FFFE, which is defined as NOT A CHARACTER.
Some applications, particularly on Windows systems,
write a BOM character to the start of a file. In UTF-8, the BOM is
the sequence of bytes EF BB BF. As a byte-oriented encoding, there
are no endian issues with UTF-8, but some applications (primarily
on Windows) write the BOM
to the start of a UTF-8 encoded file. An IBM® Netezza® system
does not load the BOM code point; you can use the -bom switch
to remove an initial BOM code point.
You can remove a BOM from the start of a UTF-8 file by using the
nzconvert command,
as in the following example:
nzconvert -f utf8 -t utf8 -bom -df input_file -of output_file
When you are converting from or to UTF-16, you can use one of three
converters: UTF16, UTF16be, or UTF16le as the input (
-f option)
and output (
-t option):
- UTF16
- As input, Netezza checks
for a BOM to indicate endianness; otherwise, Netezza interprets
the input as big-endian. As output, Netezza writes
a BOM and outputs in the native endianness of the machine. When converting
from UTF-16 to any other encoding, such as UTF-8, the BOM is removed.
- UTF16le
- As input, interprets the input as little-endian. As output, Netezza outputs
as little-endian without a BOM. Any BOM is treated as data and converted,
such as to UTF-8.
- UTF16be
- As input, interprets all input as big-endian. As output, Netezza converts
as big-endian without a BOM. Any BOM is treated as data and converted,
such as to UTF-8.