Character
Watson NLP uses Unicode Standard (ISO/IEC 10646) to represent characters. It is maintained by the Unicode Consortium, where IBM is also a member. This section describes the points to be aware of regarding character index and length.
Overview
To calculate the length and position of characters, Watson NLP internally uses the code unit
(UTF-16) of character; the API returns that value (e.g., Annotation.length()
, Annotation.beginIndex()
, Annotation.endIndex()
).
Most Java libraries use code unit (UTF-16) historically, while Python3 primarily uses code point
; they do not match in some cases.
Character, Code Point, and Code Unit
In Unicode,
-
Character
(or grapheme) is assigned a unique code point (or a unique code point sequence). -
Code point
is an identification number. It is 21-bit (U+000000
–U+10FFFF
).U+
prefix indicates that it is a code point. -
Code unit
is a value that encoded code point to 8/16/32 bits to store/transmit Unicode text efficiently on a computer.
Unicode defines 3 major encoding forms from code point to code unit.
Encoding form | Code unit size | Code unit |
---|---|---|
UTF-32 | 32-bit | Code point is encoded with one 32-bit code unit - U+000000 – U+10FFFF (1 unit) |
UTF-16 | 16-bit | Code point is encoded with one or two 16-bit code units - U+000000 – U+00FFFF (1 unit) - U+010000 – U+10FFFF (2 units, surrogate pair ) |
UTF-8 | 8-bit | Code point is encoded with one to four 8-bit code units - U+000000 – U+00007F (1 unit) - U+000080 – U+0007FF (2 units) - U+000800 – U+00FFFF (3
units) - U+010000 – U+10FFFF (4 units) |
The following table shows an example of that. In this case, the length is 1 for all of them.
Value | Length | |
---|---|---|
Character (grapheme) | a |
1 |
Code point (21-bit) | U+0061 (LATIN SMALL LETTER A) |
1 |
Code unit (UTF-32) | 0x00000061 |
1 |
Code unit (UTF-16) | 0x0061 |
1 |
Code unit (UTF-8) | 0x61 |
1 |
Examples
As Unicode grows, the mapping between character (grapheme), code point, and code unit has become complicated.
Latin Alphabet
In general, Latin alphabet is represented by a single code point. In addition to that, Unicode can compose Latin alphabet and diacritic marks to represent a single character.
For example, ä U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS)
is equivalent to a U+0061 (LATIN SMALL LETTER A) + ̈ U+0308 (COMBINING DIAERESIS)
.
Value | Length | |
---|---|---|
Character (grapheme) | ä |
1 |
Code point | U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) |
1 |
Code unit (UTF-32) | 0x000000E4 |
1 |
Code unit (UTF-16) | 0x00E4 |
1 |
Code units (UTF-8) | [0xC3, 0xA4] |
2 |
Value | Length | |
---|---|---|
Character (grapheme) | ä |
1 |
Code points | [a U+0061 (LATIN SMALL LETTER A), ̈ U+0308 (COMBINING DIAERESIS)] |
2 |
Code units (UTF-32) | `[0x00000061, 0x00000308] | 2 |
Code units (UTF-16) | `[0x0061, 0x0308] | 2 |
Code units (UTF-8) | `[0x61, 0xCC, 0x88] | 3 |
Emoji
Unicode defines quite a lot of Emoji characters. Most of those code points are > U+FFFF
(16-bit).
Value | Length | |
---|---|---|
Character (grapheme) | 🎉 |
1 |
Code point | U+1F389 (PARTY POPPER) |
1 |
Code unit (UTF-32) | 0x0001F389 |
1 |
Code units (UTF-16) | [0xD83C, 0xDF89] |
2 |
Code units (UTF-8) | [0xF0, 0x9F, 0x8E, 0x89] |
4 |
Emoji With Modifier
Several Emoji characters have variants. For example, 👏 U+1F44F (CLAPPING HANDS SIGN)
has 6 variants according to skin tone.
- 👏 Clapping Hands
- 👏🏻 Clapping Hands: Light Skin Tone
- 👏🏼 Clapping Hands: Medium-Light Skin Tone
- 👏🏽 Clapping Hands: Medium Skin Tone
- 👏🏾 Clapping Hands: Medium-Dark Skin Tone
- 👏🏿 Clapping Hands: Dark Skin Tone
They are represented by code point sequences of 👏 U+1F44F (CLAPPING HANDS SIGN)
+ EMOJI MODIFIER (U+1F3FB - U+1F3FF)
that specify skin tone.
Value | Length | |
---|---|---|
Character (grapheme) | 👏 |
1 |
Code point | U+1F44F (CLAPPING HANDS SIGN) |
1 |
Code unit (UTF-32) | 0x0001F44F |
1 |
Code units (UTF-16) | [0xD83D, 0xDC4F] |
2 |
Code units (UTF-8) | [0xF0, 0x9F, 0x91, 0x8F] |
4 |
Value | Length | |
---|---|---|
Character (grapheme) | 👏🏽 |
1 |
Code points | [👏 U+1F44F (CLAPPING HANDS SIGN), U+1F3FD (EMOJI MODIFIER FITZPATRICK TYPE-4)] |
2 |
Code units (UTF-32) | [0x0001F44F, 0x0001F3FD] |
2 |
Code units (UTF-16) | [0xD83D, 0xDC4F, 0xD83C, 0xDFFD] |
4 |
Code units (UTF-8) | [0xF0, 0x9F, 0x91, 0x8F, 0xF0, 0x9F, 0x8F, 0xBD] |
8 |
Emoji With Zero Width Joiner
Unicode has a special control character U+200D (ZERO WIDTH JOINER)
to combine several characters to one. It is used in Emoji characters too. For example, 👨👩👦👦 (Family: Man, Woman, Boy, Boy)
is assigned a
code point sequence of 👨 U+1F468 (MAN) + U+200D (ZWJ) + 👩 U+1F469 (WOMAN) + U+200D (ZWJ) + 👦 U+1F466 (BOY) + U+200D (ZWJ) + 👦 U+1F466 (BOY)
.
Value | Length | |
---|---|---|
Character (grapheme) | 👨👩👦👦 (Family: Man, Woman, Boy, Boy) |
1 |
Code points | [👨 U+1F468 (MAN), U+200D (ZWJ), 👩 U+1F469 (WOMAN), U+200D (ZWJ), 👦 U+1F466 (BOY), U+200D (ZWJ), 👦 U+1F466 (BOY)] |
7 |
Code units (UTF-32) | [0x0001F468, 0x0000200D, 0x0001F469, 0x0000200D, 0x0001F466, 0x0000200D, 0x0001F466] |
7 |
Code units (UTF-16) | [0xD83D, 0xDC68, 0x200D, 0xD83D, 0xDC69, 0x200D, 0xD83D, 0xDC66, 0x200D, 0xD83D, 0xDC66] |
11 |
Code units (UTF-8) | [0xF0, 0x9F, 0x91, 0xA8, 0xE2, 0x80, 0x8D, 0xF0, 0x9F, 0x91, 0xA9, 0xE2, 0x80, 0x8D, 0xF0, 0x9F, 0x91, 0xA6, 0xE2, 0x80, 0x8D, 0xF0, 0x9F, 0x91, 0xA6] |
25 |
Note that not all of family Emoji characters use ZWJ. 👪 (Family: Man, Woman, Boy)
is assigned a single code point without ZWJ.
Value | Length | |
---|---|---|
Character (grapheme) | 👪 (Family: Man, Woman, Boy) |
1 |
Code point | U+1F46A (Family: Man, Woman, Boy) |
1 |
Code unit (UTF-32) | [0x01F46A] |
1 |
Code units (UTF-16) | [0xD83D, 0xDC6A] |
2 |
Code units (UTF-8) | [0xF0, 0x9F, 0x91, 0xAA] |
4 |
Ideographic Character
Ideographic Variant Selector
Chinese ideographic characters have many variants. It is important to distinguish them because they are often used in proper nouns (e.g., person names, place names).
For example, 邉 U+9089
has 16 variants as follows. They are represented by the code point sequence of 邉 U+9089
+ Ideographic Variant Selector (U+E0100 - U+E01EF)
.
Value | Length | |
---|---|---|
Character (grapheme) | 邉 |
1 |
Code point | U+9089 |
1 |
Code unit (UTF-32) | 0x00009089 |
1 |
Code unit (UTF-16) | 0x9089 |
1 |
Code units (UTF-8) | [0xE9, 0x82, 0x89] |
3 |
Value | Length | |
---|---|---|
Character (grapheme) | 邉󠄄 |
1 |
Code points | [U+9089, U+E0104] |
2 |
Code units (UTF-32) | [0x00009089, 0x000E0104] |
2 |
Code units (UTF-16) | [0x9089, 0xDB40, 0xDD04] |
3 |
Code units (UTF-8) | [0xE9, 0x82, 0x89, 0xF3, 0xA0, 0x84, 0x84] |
7 |
Python Strings
Strings are stored differently in Java and Python. It needs to be taken care of when creating test cases to estimate their memory consumptions. Python uses code point
(21 bits) as a base unit to represent characters. Python strings
can be stored by an array of 32 bits (There is no data type of 21 bits in Python). However this is not memory efficient. For example, most Latin characters are in the range of U+00
- U+FF
(Only lower 8 bits have non-zero
values; the rest of the bits are always zero). To address this issue, Python 3.3 introduced flexible string representation (PEP 393). It stores strings by an array of 8, 16, or 32 bits.
The size is determined by the largest code point in the string. The following shows examples.
Empty string consumes 49 bytes. This is baseline.
>>> sys.getsizeof("")
49
Latin character a
(U+0061) consumes 1 byte (8 bits) for each code point.
>>> sys.getsizeof("a")
50
>>> sys.getsizeof("aa")
51
>>> sys.getsizeof("aaa")
52
Emoji character 🎉
(U+1F389) consumes 4 bytes (32 bits) for each code point. It consumes 76 bytes as baseline.
>>> sys.getsizeof("🎉")
80
>>> sys.getsizeof("🎉🎉")
84
>>> sys.getsizeof("🎉🎉🎉")
88
Emoji character 👏🏽 (U+1F44F, U+1F3FD) consumes 8 bytes (32 bits * 2) for each. Because it consists of 2 code points.
>>> sys.getsizeof("👏🏽")
84
>>> sys.getsizeof("👏🏽👏🏽")
92
>>> sys.getsizeof("👏🏽👏🏽👏🏽")
100
If both Latin a
(U+0061) and Emoji 🎉
(U+1F389), 👏🏽 (U+1F44F, U+1F3FD) characters are used in a string, it consumes 4 bytes (32 bits) for each code point.
>>> sys.getsizeof("a🎉")
84
>>> sys.getsizeof("a🎉👏🏽")
92
References
- Unicode Codepoints
- Unicode Emoji
- Emojipedia
- Unicode Normalization Forms
- Unicode Ideographic Variant Database
- String API in Java SDK
- International Components for Unicode (Reference implementation of Unicode specification)
- What's the difference between a character, a code point, a glyph and a grapheme?