IBM PureData System for Analytics, Version 7.1

Byte Order Mark

Unicode in the 16-bit UTF-16 form has no prescribed endian orientation for interchange. This requires communication processes to evaluate the endian orientation correctly. To aid in this, the character U+FEFF ZERO WIDTH NO-BREAK SPACE can be used as a Byte Order Mark (BOM). When interpreted in the incorrect endian orientation, it evaluates to U+FFFE, which is defined as NOT A CHARACTER.

Some applications, particularly on Windows systems, write a BOM character to the start of a file. In UTF-8, the BOM is the sequence of bytes EF BB BF. As a byte-oriented encoding, there are no endian issues with UTF-8, but some applications (primarily on Windows) write the BOM to the start of a UTF-8 encoded file. An IBM® Netezza® system does not load the BOM code point; you can use the -bom switch to remove an initial BOM code point.

You can remove a BOM from the start of a UTF-8 file by using the nzconvert command, as in the following example:

nzconvert -f utf8 -t utf8 -bom -df input_file -of output_file

When you are converting from or to UTF-16, you can use one of three converters: UTF16, UTF16be, or UTF16le as the input (-f option) and output (-t option):

UTF16: As input, Netezza checks for a BOM to indicate endianness; otherwise, Netezza interprets the input as big-endian. As output, Netezza writes a BOM and outputs in the native endianness of the machine. When converting from UTF-16 to any other encoding, such as UTF-8, the BOM is removed.
UTF16le: As input, interprets the input as little-endian. As output, Netezza outputs as little-endian without a BOM. Any BOM is treated as data and converted, such as to UTF-8.
UTF16be: As input, interprets all input as big-endian. As output, Netezza converts as big-endian without a BOM. Any BOM is treated as data and converted, such as to UTF-8.