Character

Watson NLP uses Unicode Standard (ISO/IEC 10646) to represent characters. It is maintained by the Unicode Consortium, where IBM is also a member. This section describes the points to be aware of regarding character index and length.

Overview

To calculate the length and position of characters, Watson NLP internally uses the code unit (UTF-16) of character; the API returns that value (e.g., Annotation.length(), Annotation.beginIndex(), Annotation.endIndex()). Most Java libraries use code unit (UTF-16) historically, while Python3 primarily uses code point; they do not match in some cases.

Character, Code Point, and Code Unit

In Unicode,

  • Character (or grapheme) is assigned a unique code point (or a unique code point sequence).

  • Code point is an identification number. It is 21-bit (U+000000U+10FFFF). U+ prefix indicates that it is a code point.

  • Code unit is a value that encoded code point to 8/16/32 bits to store/transmit Unicode text efficiently on a computer.

Unicode defines 3 major encoding forms from code point to code unit.

Encoding form Code unit size Code unit
UTF-32 32-bit Code point is encoded with one 32-bit code unit
- U+000000U+10FFFF (1 unit)
UTF-16 16-bit Code point is encoded with one or two 16-bit code units
- U+000000U+00FFFF (1 unit)
- U+010000U+10FFFF (2 units, surrogate pair)
UTF-8 8-bit Code point is encoded with one to four 8-bit code units
- U+000000U+00007F (1 unit)
- U+000080U+0007FF (2 units)
- U+000800U+00FFFF (3 units)
- U+010000U+10FFFF (4 units)

The following table shows an example of that. In this case, the length is 1 for all of them.

Value Length
Character (grapheme) a 1
Code point (21-bit) U+0061 (LATIN SMALL LETTER A) 1
Code unit (UTF-32) 0x00000061 1
Code unit (UTF-16) 0x0061 1
Code unit (UTF-8) 0x61 1

Examples

As Unicode grows, the mapping between character (grapheme), code point, and code unit has become complicated.

Latin Alphabet

In general, Latin alphabet is represented by a single code point. In addition to that, Unicode can compose Latin alphabet and diacritic marks to represent a single character.

For example, ä U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) is equivalent to a U+0061 (LATIN SMALL LETTER A) + ̈ U+0308 (COMBINING DIAERESIS).

Value Length
Character (grapheme) ä 1
Code point U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) 1
Code unit (UTF-32) 0x000000E4 1
Code unit (UTF-16) 0x00E4 1
Code units (UTF-8) [0xC3, 0xA4] 2
Value Length
Character (grapheme) 1
Code points [a U+0061 (LATIN SMALL LETTER A), ̈ U+0308 (COMBINING DIAERESIS)] 2
Code units (UTF-32) `[0x00000061, 0x00000308] 2
Code units (UTF-16) `[0x0061, 0x0308] 2
Code units (UTF-8) `[0x61, 0xCC, 0x88] 3

Emoji

Unicode defines quite a lot of Emoji characters. Most of those code points are > U+FFFF (16-bit).

Value Length
Character (grapheme) 🎉 1
Code point U+1F389 (PARTY POPPER) 1
Code unit (UTF-32) 0x0001F389 1
Code units (UTF-16) [0xD83C, 0xDF89] 2
Code units (UTF-8) [0xF0, 0x9F, 0x8E, 0x89] 4

Emoji With Modifier

Several Emoji characters have variants. For example, 👏 U+1F44F (CLAPPING HANDS SIGN) has 6 variants according to skin tone.

  1. 👏 Clapping Hands
  2. 👏🏻 Clapping Hands: Light Skin Tone
  3. 👏🏼 Clapping Hands: Medium-Light Skin Tone
  4. 👏🏽 Clapping Hands: Medium Skin Tone
  5. 👏🏾 Clapping Hands: Medium-Dark Skin Tone
  6. 👏🏿 Clapping Hands: Dark Skin Tone

They are represented by code point sequences of 👏 U+1F44F (CLAPPING HANDS SIGN) + EMOJI MODIFIER (U+1F3FB - U+1F3FF) that specify skin tone.

Value Length
Character (grapheme) 👏 1
Code point U+1F44F (CLAPPING HANDS SIGN) 1
Code unit (UTF-32) 0x0001F44F 1
Code units (UTF-16) [0xD83D, 0xDC4F] 2
Code units (UTF-8) [0xF0, 0x9F, 0x91, 0x8F] 4
Value Length
Character (grapheme) 👏🏽 1
Code points [👏 U+1F44F (CLAPPING HANDS SIGN), U+1F3FD (EMOJI MODIFIER FITZPATRICK TYPE-4)] 2
Code units (UTF-32) [0x0001F44F, 0x0001F3FD] 2
Code units (UTF-16) [0xD83D, 0xDC4F, 0xD83C, 0xDFFD] 4
Code units (UTF-8) [0xF0, 0x9F, 0x91, 0x8F, 0xF0, 0x9F, 0x8F, 0xBD] 8

Emoji With Zero Width Joiner

Unicode has a special control character U+200D (ZERO WIDTH JOINER) to combine several characters to one. It is used in Emoji characters too. For example, 👨‍👩‍👦‍👦 (Family: Man, Woman, Boy, Boy) is assigned a code point sequence of 👨 U+1F468 (MAN) + U+200D (ZWJ) + 👩 U+1F469 (WOMAN) + U+200D (ZWJ) + 👦 U+1F466 (BOY) + U+200D (ZWJ) + 👦 U+1F466 (BOY).

Value Length
Character (grapheme) 👨‍👩‍👦‍👦 (Family: Man, Woman, Boy, Boy) 1
Code points [👨 U+1F468 (MAN), U+200D (ZWJ), 👩 U+1F469 (WOMAN), U+200D (ZWJ), 👦 U+1F466 (BOY), U+200D (ZWJ), 👦 U+1F466 (BOY)] 7
Code units (UTF-32) [0x0001F468, 0x0000200D, 0x0001F469, 0x0000200D, 0x0001F466, 0x0000200D, 0x0001F466] 7
Code units (UTF-16) [0xD83D, 0xDC68, 0x200D, 0xD83D, 0xDC69, 0x200D, 0xD83D, 0xDC66, 0x200D, 0xD83D, 0xDC66] 11
Code units (UTF-8) [0xF0, 0x9F, 0x91, 0xA8, 0xE2, 0x80, 0x8D, 0xF0, 0x9F, 0x91, 0xA9, 0xE2, 0x80, 0x8D, 0xF0, 0x9F, 0x91, 0xA6, 0xE2, 0x80, 0x8D, 0xF0, 0x9F, 0x91, 0xA6] 25

Note that not all of family Emoji characters use ZWJ. 👪 (Family: Man, Woman, Boy) is assigned a single code point without ZWJ.

Value Length
Character (grapheme) 👪 (Family: Man, Woman, Boy) 1
Code point U+1F46A (Family: Man, Woman, Boy) 1
Code unit (UTF-32) [0x01F46A] 1
Code units (UTF-16) [0xD83D, 0xDC6A] 2
Code units (UTF-8) [0xF0, 0x9F, 0x91, 0xAA] 4

Ideographic Character

Ideographic Variant Selector

Chinese ideographic characters have many variants. It is important to distinguish them because they are often used in proper nouns (e.g., person names, place names).

For example, 邉 U+9089 has 16 variants as follows. They are represented by the code point sequence of 邉 U+9089 + Ideographic Variant Selector (U+E0100 - U+E01EF).

IVD

Value Length
Character (grapheme) 1
Code point U+9089 1
Code unit (UTF-32) 0x00009089 1
Code unit (UTF-16) 0x9089 1
Code units (UTF-8) [0xE9, 0x82, 0x89] 3
Value Length
Character (grapheme) 邉󠄄 1
Code points [U+9089, U+E0104] 2
Code units (UTF-32) [0x00009089, 0x000E0104] 2
Code units (UTF-16) [0x9089, 0xDB40, 0xDD04] 3
Code units (UTF-8) [0xE9, 0x82, 0x89, 0xF3, 0xA0, 0x84, 0x84] 7

Python Strings

Strings are stored differently in Java and Python. It needs to be taken care of when creating test cases to estimate their memory consumptions. Python uses code point (21 bits) as a base unit to represent characters. Python strings can be stored by an array of 32 bits (There is no data type of 21 bits in Python). However this is not memory efficient. For example, most Latin characters are in the range of U+00 - U+FF (Only lower 8 bits have non-zero values; the rest of the bits are always zero). To address this issue, Python 3.3 introduced flexible string representation (PEP 393). It stores strings by an array of 8, 16, or 32 bits. The size is determined by the largest code point in the string. The following shows examples.

Empty string consumes 49 bytes. This is baseline.

>>> sys.getsizeof("")
49

Latin character a (U+0061) consumes 1 byte (8 bits) for each code point.

>>> sys.getsizeof("a")
50
>>> sys.getsizeof("aa")
51
>>> sys.getsizeof("aaa")
52

Emoji character 🎉 (U+1F389) consumes 4 bytes (32 bits) for each code point. It consumes 76 bytes as baseline.

>>> sys.getsizeof("🎉")
80
>>> sys.getsizeof("🎉🎉")
84
>>> sys.getsizeof("🎉🎉🎉")
88

Emoji character 👏🏽 (U+1F44F, U+1F3FD) consumes 8 bytes (32 bits * 2) for each. Because it consists of 2 code points.

>>> sys.getsizeof("👏🏽")
84
>>> sys.getsizeof("👏🏽👏🏽")
92
>>> sys.getsizeof("👏🏽👏🏽👏🏽")
100

If both Latin a (U+0061) and Emoji 🎉 (U+1F389), 👏🏽 (U+1F44F, U+1F3FD) characters are used in a string, it consumes 4 bytes (32 bits) for each code point.

>>> sys.getsizeof("a🎉")
84
>>> sys.getsizeof("a🎉👏🏽")
92

References

  1. Unicode Codepoints
  2. Unicode Emoji
  3. Emojipedia
  4. Unicode Normalization Forms
  5. Unicode Ideographic Variant Database
  6. String API in Java SDK
  7. International Components for Unicode (Reference implementation of Unicode specification)
  8. What's the difference between a character, a code point, a glyph and a grapheme?