Tokenization and Word Segmentation

As a general principle, Watson NLP follows the tokenization of UD corpora. However tokenization and word segmentation are not trivial tasks in many languages. They depend on languages, and sometimes even on the context of contents, domains, and applications. This section describes implementation notes on various cases of Watson NLP tokenization.

Multiword Token, and Compound Word Token

In UD, words are basic units holding syntactic features such as Part-of-Speech (UPOS) and dependency relations (DEPREL , HEAD). Words and tokens are equivalent in most languages including English. However they are not in some languages. For example, del (contraction) in Spanish needs to be split into de (ADP) + el (DET) before assigning PoS and dependency relations correctly. Also, inspirándose (clitic) in Spanish need to be split into inspirándo (VERB) + se (PRON). Those del and inspirándose may not have syntactic features. The components, de, el, inspirándo, and se, have them always. Those del and inspirándose are referred to as "multiword token" (MWT) in UD.

Similar to "multiword token", there is another concept "compound word token" (CWT) in Watson NLP. It is an Watson NLP specific extension to represent words formed of two or more components without syntactic features. For example, Apfelsaft (apple juice) in German is a word formed of apfel (apple) and saft (juice). The compound word token, Apfelsaft, has syntactic features always. The components, apfel, and saft, may not have syntactic features.

The following table shows examples of "multiword token" and "compound word token" from various languages.

Language Multiword token (e.g. contraction, clitic)
- Components have syntactic features always
Compound word token
- Components may not have syntactic features
Afrikaans N/A lidstate (member states) -> lid + state
Arabic N/A اللغة -> ال + لغة (DET + NOUN)
Chinese N/A 闭幕辞 (closing speech) -> 闭幕 +
Danish N/A matematiklæreren (mathematics teacher) -> matematik + læreren
sportshallen (sports hall) -> sports + hallen
Dutch N/A koffiekop (coffee mug) -> koffie + kop
werkplaats (workshop) -> werk + plaats
English N/A co-founder -> co + - + founder
mid-May -> mid + - + May
Finnish ettei -> ett (SCONJ) + ei (AUX)
miksei -> miks (ADV) + ei (AUX)
urheiluauto (sports car) -> urheilu + auto
French du -> de (ADP) + le (DET)
au ->à (ADP) + le (DET)
N/A
German im -> in (ADP) + dem (DET)
vom -> von (ADP) + dem (DET)
Apfelsaft (apple juice) -> Apfel + saft
Autobahnanschlussstelle (motorway junction)-> Autobahn + anschluss + stelle
Greek στην -> σ (ADP) + την (DET) N/A
Italian negli -> in (ADP) + gli (DET) N/A
Norwegian (Bokmål) N/A møterom (meeting room) -> møte + rom
prosjektleder (project manager) -> prosjekt + leder
Portuguese à -> a (ADP) + o (DET)
deles -> de (ADP) + eles (PRON)
N/A
Spanish del -> de (ADP) + el (DET)
al -> a (ADP) + el (DET)
inspirándose -> inspirándo (VERB) + se (PRON)
N/A
Swedish N/A telefonkort (phone card) -> telefon + kort
bordslampa (table lamp) -> bord + lampa

Example of Outputs

Multiword Token in JSON

Watson NLP uses token annotation for multiword token (e.g. del) and component annotations for the components (e.g. de and el) as follows. The token annotation has 3 additional properties:

  1. ud:mwt: flag to indicate this is a multiword token
  2. contraction: flag to indicate this is a contracted form
  3. components: reference to the components

The same information can be retrieved using a Java API:

  {
    "id" : "e5a0dbf9-fa6a-355a-ab15-3fc6ae33c7f0",
    "type" : "token",               // `token` annotation
    "text" : "del",                 // input text is `del`
    "beginIndex" : 0,
    "endIndex" : 3,
    "properties" : {
      "components" : [ "6f9dad27-a769-31ac-a61d-7d8b5c204bd7", "397a7b5b-1927-3088-8e9d-cd5155a9681f" ], // this token has references to 2 components `de` and `el`
      "contraction" : true,         // this token is a contraction
      "locale" : "es",
      "ud:mwt" : true               // this token is a multiword token
    }
  }, {
    "id" : "6f9dad27-a769-31ac-a61d-7d8b5c204bd7",
    "type" : "component",           // `component` annotation for the first element
    "text" : "del",                 // this is a part of contraction, so the surface text is associated with the whole span `del` for simplicity
    "beginIndex" : 0,
    "endIndex" : 3,
    "properties" : {
      "locale" : "es",
      "ud:children" : [ "397a7b5b-1927-3088-8e9d-cd5155a9681f" ], // this component has dependency relation
      "ud:lemma" : "de",            // the lemma is `de`
      "ud:pos" : "ADP"              // the PoS is `ADP`
    }
  }, {
    "id" : "397a7b5b-1927-3088-8e9d-cd5155a9681f",
    "type" : "component",           // `component` annotation for the second element
    "text" : "del",                 // this is a part of contraction, so the surface text is associated with the whole span `del` for simplicity
    "beginIndex" : 0,
    "endIndex" : 3,
    "properties" : {
      "annotatedBy" : "com.ibm.nlp.izumo.es",
      "locale" : "es",
      "ud:lemma" : "el",            // the lemma is `el`
      "ud:parent" : "6f9dad27-a769-31ac-a61d-7d8b5c204bd7", // this component has dependency relation
      "ud:pos" : "DET",             // the PoS is `DET`
      "ud:relation" : "dep"
    }
  }

Multiword Token in CoNLLU

# text = del
1-2	del	_	_	_	_	_	_	_	_
1	de	de	ADP	_	_	0	root	_	_
2	el	el	DET	_	_	1	dep	_	_

Compound Word Token in JSON

Watson NLP uses token annotation for compound word token (e.g. Apfelsaft) and component annotations for the components (e.g. Apfel and saft) as follows.

The same information can be retrieved using a Java API:

  {
    "id" : "6f71d36c-b995-3182-8cc0-76de762bf31a",
    "type" : "token",                // `token` annotation
    "text" : "Apfelsaft",            // input text is `Apfelsaft`
    "beginIndex" : 0,
    "endIndex" : 9,
    "properties" : {
      "components" : [ "3fd0b193-cbf0-3f9c-9d5b-040d94ebfd2b", "a1f3620c-8448-3dc8-b02a-1969de18fd27" ], // the token has references to 2 components `Apfel` and `saft`
      "locale" : "de",
      "ud:relation" : "root",        // this token has dependency relation
      "unknown" : true
    }
  }, {
    "id" : "3fd0b193-cbf0-3f9c-9d5b-040d94ebfd2b",
    "type" : "component",            // `component` annotation for the first element
    "text" : "Apfel",
    "beginIndex" : 0,
    "endIndex" : 5,
    "properties" : {
      "locale" : "de",
      "ud:lemma" : "Apfel",          // the lemma is `Apfel` (optional)
      "ud:pos" : "NOUN"              // the PoS is `NOUN` (optional)
    }
  }, {
    "id" : "a1f3620c-8448-3dc8-b02a-1969de18fd27",
    "type" : "component",            // `component` annotation for the second element
    "text" : "saft",
    "beginIndex" : 5,
    "endIndex" : 9,
    "properties" : {
      "locale" : "de",
      "ud:lemma" : "Saft",           // the lemma is `Saft` (optional)
      "ud:pos" : "NOUN"              // the PoS is `NOUN` (optional)
    }
  }

Compound Word Token in CoNLLU

This is a deviation from CoNLLU format because it does not define anything for compound word token. So this feature is disabled by default. It needs to set CoNLLU.withComponents(true) explicitly.

String conllu = new CoNLLU()
  .withComments(Arrays.asList(COMMENT.TEXT))
  .withComponents(true)
  .toString(result);

Then the output will contain the components of compound word tokens. Note that the first column ID uses n:m format for them as follows.

# text = Apfelsaft
1	Apfelsaft	Apfelsaft	PROPN	PROPN	_	0	root	_	_
1:1	Apfel	Apfel	NOUN	_	_	_	_	_	_
1:2	saft	Saft	NOUN	_	_	_	_	_	_

Locale Specific Expressions

Dates, Times, and Numbers

Watson NLP recognizes dates, times, and numbers as single token when written in digits and symbols. Many languages use unique formats. The following table shows examples of them.

Language Date Time Number
Arabic 31/01/2020, 31/1/2020 12:30, 12:30:00 1,234,567.00
Danish 31.01.2020 12.30, 12.30.00,
12:30, 12:30:00
1.234.567,00
Dutch 31-01-2020 12:30, 12:30:00 1.234.567,00
English 1/31/2020, 1/31/20 12:30, 12:30:00 1,234,567.00
Finnish 31.1.2020, 31.1.-20 12.30, 12.30.00,
12:30, 12:30:00
1 234 567,00
French 31/01/2020 12:30, 12:30:00 1 234 567,00
German 31.01.2020, 31.01.20 12:30, 12:30:00 1.234.567,00
Hindi 31/1/2020 12:30, 12:30:00 12,34,567.00
Japanese 2020/01/31 12:30, 12:30:00 1,234,567.00
Swedish 2020-01-31 12.30, 12.30.00,
12:30, 12:30:00
1 234 567,00

Example of Outputs

In the JSON outputs, Watson NLP adds term annotation additionally for those dates, times, and numbers.

The same information can be retrieved using Java a API. For example:

  {
    "id" : "42c43cfc-96cd-32b6-abf8-5fdf98aa5d9f",
    "type" : "term",                 // `term` annotation
    "text" : "1/31/2020",            // input text is `1/31/2020`
    "beginIndex" : 0,
    "endIndex" : 9,
    "properties" : {
      "g:name" : "org.unicode.cldr.time.Date", // this `term` is `o.u.c.time.Date`
      "g:properties" : {
        "locale" : "en",             // locale is English
      }
    }
  }
  {
    "id" : "7c47f0c9-04e2-332e-958b-f16e09e4bf31",
    "type" : "term",                 // `term` annotation
    "text" : "12:30",                // input text is `1/31/2020`
    "beginIndex" : 10,
    "endIndex" : 15,
    "properties" : {
      "g:name" : "org.unicode.cldr.time.Time", // this `term` is `o.u.c.time.Time`
      "g:properties" : {
        "locale" : "en",             // locale is English
      }
    }
  }
  {
    "id" : "9b7ef354-4483-3dc6-9b27-d24823a701dc",
    "type" : "term",                 // `term` annotation
    "text" : "1,234,567.00",         // input text is `1,234,567.00`
    "beginIndex" : 16,
    "endIndex" : 28,
    "properties" : {
      "g:name" : "org.unicode.cldr.number.Decimal", // this `term` is `o.u.c.number.Decimal`
      "g:properties" : {
        "locale" : "en"              // locale is English
      }
    }
  }

Ordinal, Cardinal Numbers

  • Danish and Norwegian use . to represent ordinal numbers.

  • Finnish uses "colons+suffixes" for the declension of cardinal (e.g. 3:n) and ordinal (e.g. 3:nnen) numbers.

Language Ordinal, cardinal number
Danish, Norwegian 1., 2., 3., ...
Finnish 1:n, 1 234:n, 1 234:n, 1 234:nnen

Letters, Numbers, and Symbols

URI, E-mail Addresses, Host Names, File Names, Hashtags, and Mentions

Watson NLP recognizes them as single token.

Example
URI http://www.ibm.com
URI http://www.ibm.com/index.htm
e-mail address john@ibm.com, ジョン@ibm.com
Host name ibm.com
File name (without space and path) doc.pdf
Hashtag #ibm
Mention @John

Example of Output

In the JSON outputs, Watson NLP adds term annotation additionally for URI, e-mail address, hashtag, and mention. It does not add to host name and file name because they are sometimes ambiguous.

The same information can be retrieved using a Java API. For example:

  {
    "id" : "42c43cfc-96cd-32b6-abf8-5fdf98aa5d9f",
    "type" : "term",                 // `term` annotation
    "text" : "http://www.ibm.com/index.htm",    // input text is `http://www.ibm.com/index.htm`
    "beginIndex" : 0,
    "endIndex" : 28,
    "properties" : {
      "g:name" : "com.ibm.nlp.commons.net.URI", // this is `c.i.nlp.c.net.URI`
    }
  }
  {
    "id" : "7c47f0c9-04e2-332e-958b-f16e09e4bf31",
    "type" : "term",                 // `term` annotation
    "text" : "john@ibm.com",         // input text is `john@ibm.com`
    "beginIndex" : 29,
    "endIndex" : 41,
    "properties" : {
      "g:name" : "com.ibm.nlp.commons.net.EmailAddress", // this is `c.i.nlp.c.net.EmailAddress`
    }
  }
  {
    "id" : "ce5613eb-1e46-32ff-92c4-433b97ab02f6",
    "type" : "term",                 // `term` annotation
    "text" : "#ibm",                 // input text is `#ibm`
    "beginIndex" : 62,
    "endIndex" : 66,
    "properties" : {
      "g:name" : "com.ibm.nlp.commons.net.HashTag", // this is `c.i.nlp.c.net.HashTag`
    }
  }
 {
    "id" : "92686813-6e3f-30c2-8950-a789a35bc787",
    "type" : "term",                 // `term` annotation
    "text" : "@John",                // input text is `@John`
    "beginIndex" : 67,
    "endIndex" : 72,
    "properties" : {
      "g:name" : "com.ibm.nlp.commons.net.Mention", // this is `c.i.nlp.c.net.Mention`
    }
  }

Roman Numerals

Watson NLP recognizes Roman numerals from 1 to 21 in upper case. This is mainly for proper nouns using them (e.g., Queen Elizabeth II, Dragon Quest XII, Century XXI).

Example
1 I
2 II
3 III
...
20 XX
21 XXI

Limitations:

  • It supports only small numbers. Because bigger numbers are infrequent and ambiguous in many cases (e.g, XXX (30)=hidden word, XL (40)=extra large, L (50)=liter, LV (55)=Louis Vuitton, CC (200)=carbon copy).

  • It may incorrectly recognize the following nouns as Roman numerals when written in all upper case:

    • ii (2) in Romanian (traditional Romanian embroidered blouse)

    • vi (6) in Catalan (wine)

    • vii (7) in Romanian (1. vineyard 2. vine)

Example Output

In the JSON outputs, Watson NLP adds term annotation additionally for Roman numerals.

The same information can be retrieved using a Java API. For example:

{
  "id" : "1c89af51-b855-327a-b202-187b96084768",
  "type" : "term",                    // `term` annotation
  "text" : "II",                      // input text is `II`
  "beginIndex" : 10,
  "endIndex" : 12,
  "properties" : {
    "g:name" : "com.ibm.nlp.izumo.resources.RomanNumeral", // this is `c.i.nlp.izumo.resources.RomanNumeral`
    "g:properties" : {
      "g:normalForm" : "II",
      "value" : 2                     // numeral value is `2`
    },
    "locale" : "en"
  }
}

Hyphenated Words

Watson NLP splits out-of-vocabulary words by hyphen.

Out-of-vocabulary word in Watson NLP
e-mail
non-fiction
twelve / - / year / - / old
three / - / quarters

Note:

  • English does not split words with the following prefix modifiers: anti-, co-, counter-, cross-, e-, ex-, mid-, multi-, non-, over-, post-, pre-, pro-, re-, semi-, sub-, vice-.

  • Finnish does not split by hyphen always.

Abbreviations

Watson NLP splits off period for out-of-vocabulary words

Out-of-vocabulary word in Watson NLP
Dr.
Sun.
Jan.
a.m.
etc.
arrv / .
dept / .

Others

Watson NLP splits other out-of-vocabulary words by character categories.

Out-of-vocabulary word in Watson NLP
+ / 852
US / $
100 / M
ABC / 123 / - / 456 / : / 789
b / / / c
wo / /

Watson NLP splits symbols into single character.

Out-of-vocabulary word in Watson NLP
. / ?
. / /
! / .

Note:

  • English recognizes the following symbols as single token:

    • Double quotation (e.g., ’’, ‘’)

    • Horizontal rule (e.g., -----, =====, ******)

    • Comparator (e.g., <=, <<, ==, !=)

    • Arrow (e.g., <--, -->)

    • Ellipsis (e.g., ...)