Thai and Unicode collation algorithm differences
The collation algorithm used in a Thai Industrial Standard (TIS) TIS620-1 (code page 874) Thai database with the NLSCHAR collation option is similar, but not identical to, the collation algorithm used in a Unicode database with the locale-sensitive UCA-based collation option specifying the Thai locale attribute, such as CLDR181_LTH.
The differences are as follows:
- When sorting TIS620-1 data, each character only has one weight, and that weight is used to compare with the weight of another character during collation. When sorting Unicode data, each character has several weights, and all the weights of that character can be used during collation.
- When sorting TIS620-1 data, the space character X'20', hyphen character X'2D', and full stop character X'2E' all have smaller weights than all the Thai characters. When sorting Unicode data, however, those three characters are considered as punctuation marks; and are used for comparison only when all other characters in the two strings being compared are equal.
- The Paiyannoi character X'CF' and the Maiyamok character X'E6' in a TIS620-1 database are treated as punctuation marks when they follow other Thai characters, and as normal characters, with their own weights, when they appear at the beginning of a string. The same two characters in a Unicode database, U+0E2F and U+0E46, are always treated as punctuation marks, and are used for comparison when all other characters in the two strings being compared are equal.
More information about Thai characters can be found in the Southeast Asian Scripts chapter of The Unicode Standard book.