Locale-sensitive collations are based on the full Unicode
Collation Algorithm (UCA) specification and provide full cultural
correctness.
Strings are ordered according to the Unicode Collation
Algorithm. The collation can be tailored to account for features such
as language or case and accent insensitivity. For more information
about UCA, see Unicode Collation Algorithm based collations.
This
algorithm uses multiple weights per character as well as extra processing
to handle special cases such as contractions and combining accents.
The complexity of the algorithm adds significantly more processing
time.
Substring matching is done using the collation. Substrings
are matched in a linguistically meaningful manner.
- Advantages
- Full support of the UCA, including contractions and combining
accents.
- Provides support for case and accent insensitive collations.
- Handles all Unicode code points.
- Allows collations to be tailored to suit different languages.
- Same order for character and graphic types.
- Substring matching is done using the collation.
- Disadvantages
- Substantial performance penalty.
Locale-sensitive UCA-based collations are suitable
when fully linguistic ordering is needed and the extra performance
time required can be tolerated.
Example
To demonstrate the behavior of this collation,
the following list of Czech words is used.
- chleb1
- Čech
- C◌̌̌ech2
- Jana
- hlava
- Jaroslav
- holub
- cena
- jaro
- čas
- c◌̌as3
The database with the locale-sensitive collation
was created using the following command: CREATE DATABASE TESTDB COLLATE
USING CLDR181_LCS.
Sorting:
SELECT WORD FROM TESTDATA ORDER BY WORD
WORD
----------
cena
čas
c◌̌as
Čech
C◌̌ech
hlava
holub
chleb
Jana
jaro
Jaroslav
In the results of the ORDER BY command,
notice:
- The result is linguistically correct.
- Case and accent differences are treated as less significant than
the base character.
- Combining accents are equal to the equivalent accented character.
- The word chleb is correctly ordered after the word holub.
Substring matching:
SELECT WORD FROM TESTDATA WHERE WORD LIKE 'c%'
WORD
----------
cena
In the results of the LIKE command, notice:
- Neither c◌̌as nor chleb are selected, since linguistically
they do not start with the letter c.
2 In Unicode, the accented character
Č can be entered as a single Unicode code point, U+010C (Latin
capital letter C with caron) or as two code points, U+0043 U+030C
(Latin capital letter C, combining caron). The two representations
appear the same on a computer screen or a printout, but they have
different internal representations. For the purposes of the examples,
however, the characters will be drawn differently; U+010C will be
drawn as
Č and U+0043 U+030C will be drawn as
C◌̌. To demonstrate combining accents, both forms are included
in the word list.
3 In Unicode, the accented character
č can be entered as a single Unicode code point, U+010D (Latin
small letter c with caron) or as two code points, U+0063 U+030C (Latin
small letter c, combining caron). The two representations appear the
same on a computer screen or a printout, but they have different internal
representations. For the purposes of the examples, however, the characters
will be drawn differently; U+010D will be drawn as
č and U+0063 U+030C will be drawn as
c◌̌. To demonstrate combining accents, both forms are included
in the word list.