Tokenization and Word Segmentation

As a general principle, Watson NLP follows the tokenization of UD corpora. However tokenization and word segmentation are not trivial tasks in many languages. They depend on languages, and sometimes even on the context of contents, domains, and applications. This section describes implementation notes on various cases of Watson NLP tokenization.

Multiword Token, and Compound Word Token

In UD, words are basic units holding syntactic features such as Part-of-Speech (UPOS) and dependency relations (DEPREL , HEAD). Words and tokens are equivalent in most languages including English. However they are not in some languages. For example, del (contraction) in Spanish needs to be split into de (ADP) + el (DET) before assigning PoS and dependency relations correctly. Also, inspirándose (clitic) in Spanish need to be split into inspirándo (VERB) + se (PRON). Those del and inspirándose may not have syntactic features. The components, de, el, inspirándo, and se, have them always. Those del and inspirándose are referred to as "multiword token" (MWT) in UD.

Similar to "multiword token", there is another concept "compound word token" (CWT) in Watson NLP. It is an Watson NLP specific extension to represent words formed of two or more components without syntactic features. For example, Apfelsaft (apple juice) in German is a word formed of apfel (apple) and saft (juice). The compound word token, Apfelsaft, has syntactic features always. The components, apfel, and saft, may not have syntactic features.

The following table shows examples of "multiword token" and "compound word token" from various languages.

Language	Multiword token (e.g. contraction, clitic) - Components have syntactic features always	Compound word token - Components may not have syntactic features
Afrikaans	N/A	`lidstate` (member states) -> `lid` + `state`
Arabic	N/A	`اللغة` -> `ال` + `لغة` (DET + NOUN)
Chinese	N/A	`闭幕辞` (closing speech) -> `闭幕` + `辞`
Danish	N/A	`matematiklæreren` (mathematics teacher) -> `matematik` + `læreren` `sportshallen` (sports hall) -> `sports` + `hallen`
Dutch	N/A	`koffiekop` (coffee mug) -> `koffie` + `kop` `werkplaats` (workshop) -> `werk` + `plaats`
English	N/A	`co-founder` -> `co` + `-` + `founder` `mid-May` -> `mid` + `-` + `May`
Finnish	`ettei` -> `ett` (SCONJ) + `ei` (AUX) `miksei` -> `miks` (ADV) + `ei` (AUX)	`urheiluauto` (sports car) -> `urheilu` + `auto`
French	`du` -> `de` (ADP) + `le` (DET) `au` ->`à` (ADP) + `le` (DET)	N/A
German	`im` -> `in` (ADP) + `dem` (DET) `vom` -> `von` (ADP) + `dem` (DET)	`Apfelsaft` (apple juice) -> `Apfel` + `saft` `Autobahnanschlussstelle` (motorway junction)-> `Autobahn` + `anschluss` + `stelle`
Greek	`στην` -> `σ` (ADP) + `την` (DET)	N/A
Italian	`negli` -> `in` (ADP) + `gli` (DET)	N/A
Norwegian (Bokmål)	N/A	`møterom` (meeting room) -> `møte` + `rom` `prosjektleder` (project manager) -> `prosjekt` + `leder`
Portuguese	`à` -> `a` (ADP) + `o` (DET) `deles` -> `de` (ADP) + `eles` (PRON)	N/A
Spanish	`del` -> `de` (ADP) + `el` (DET) `al` -> `a` (ADP) + `el` (DET) `inspirándose` -> `inspirándo` (VERB) + `se` (PRON)	N/A
Swedish	N/A	`telefonkort` (phone card) -> `telefon` + `kort` `bordslampa` (table lamp) -> `bord` + `lampa`

Example of Outputs

Multiword Token in JSON

Watson NLP uses token annotation for multiword token (e.g. del) and component annotations for the components (e.g. de and el) as follows. The token annotation has 3 additional properties:

ud:mwt: flag to indicate this is a multiword token
contraction: flag to indicate this is a contracted form
components: reference to the components

The same information can be retrieved using a Java API:

  {
    "id" : "e5a0dbf9-fa6a-355a-ab15-3fc6ae33c7f0",
    "type" : "token",               // `token` annotation
    "text" : "del",                 // input text is `del`
    "beginIndex" : 0,
    "endIndex" : 3,
    "properties" : {
      "components" : [ "6f9dad27-a769-31ac-a61d-7d8b5c204bd7", "397a7b5b-1927-3088-8e9d-cd5155a9681f" ], // this token has references to 2 components `de` and `el`
      "contraction" : true,         // this token is a contraction
      "locale" : "es",
      "ud:mwt" : true               // this token is a multiword token
    }
  }, {
    "id" : "6f9dad27-a769-31ac-a61d-7d8b5c204bd7",
    "type" : "component",           // `component` annotation for the first element
    "text" : "del",                 // this is a part of contraction, so the surface text is associated with the whole span `del` for simplicity
    "beginIndex" : 0,
    "endIndex" : 3,
    "properties" : {
      "locale" : "es",
      "ud:children" : [ "397a7b5b-1927-3088-8e9d-cd5155a9681f" ], // this component has dependency relation
      "ud:lemma" : "de",            // the lemma is `de`
      "ud:pos" : "ADP"              // the PoS is `ADP`
    }
  }, {
    "id" : "397a7b5b-1927-3088-8e9d-cd5155a9681f",
    "type" : "component",           // `component` annotation for the second element
    "text" : "del",                 // this is a part of contraction, so the surface text is associated with the whole span `del` for simplicity
    "beginIndex" : 0,
    "endIndex" : 3,
    "properties" : {
      "annotatedBy" : "com.ibm.nlp.izumo.es",
      "locale" : "es",
      "ud:lemma" : "el",            // the lemma is `el`
      "ud:parent" : "6f9dad27-a769-31ac-a61d-7d8b5c204bd7", // this component has dependency relation
      "ud:pos" : "DET",             // the PoS is `DET`
      "ud:relation" : "dep"
    }
  }

Multiword Token in CoNLLU

# text = del
1-2	del	_	_	_	_	_	_	_	_
1	de	de	ADP	_	_	0	root	_	_
2	el	el	DET	_	_	1	dep	_	_

Compound Word Token in JSON

Watson NLP uses token annotation for compound word token (e.g. Apfelsaft) and component annotations for the components (e.g. Apfel and saft) as follows.

The same information can be retrieved using a Java API:

  {
    "id" : "6f71d36c-b995-3182-8cc0-76de762bf31a",
    "type" : "token",                // `token` annotation
    "text" : "Apfelsaft",            // input text is `Apfelsaft`
    "beginIndex" : 0,
    "endIndex" : 9,
    "properties" : {
      "components" : [ "3fd0b193-cbf0-3f9c-9d5b-040d94ebfd2b", "a1f3620c-8448-3dc8-b02a-1969de18fd27" ], // the token has references to 2 components `Apfel` and `saft`
      "locale" : "de",
      "ud:relation" : "root",        // this token has dependency relation
      "unknown" : true
    }
  }, {
    "id" : "3fd0b193-cbf0-3f9c-9d5b-040d94ebfd2b",
    "type" : "component",            // `component` annotation for the first element
    "text" : "Apfel",
    "beginIndex" : 0,
    "endIndex" : 5,
    "properties" : {
      "locale" : "de",
      "ud:lemma" : "Apfel",          // the lemma is `Apfel` (optional)
      "ud:pos" : "NOUN"              // the PoS is `NOUN` (optional)
    }
  }, {
    "id" : "a1f3620c-8448-3dc8-b02a-1969de18fd27",
    "type" : "component",            // `component` annotation for the second element
    "text" : "saft",
    "beginIndex" : 5,
    "endIndex" : 9,
    "properties" : {
      "locale" : "de",
      "ud:lemma" : "Saft",           // the lemma is `Saft` (optional)
      "ud:pos" : "NOUN"              // the PoS is `NOUN` (optional)
    }
  }

Compound Word Token in CoNLLU

This is a deviation from CoNLLU format because it does not define anything for compound word token. So this feature is disabled by default. It needs to set CoNLLU.withComponents(true) explicitly.

String conllu = new CoNLLU()
  .withComments(Arrays.asList(COMMENT.TEXT))
  .withComponents(true)
  .toString(result);

Then the output will contain the components of compound word tokens. Note that the first column ID uses n:m format for them as follows.

# text = Apfelsaft
1	Apfelsaft	Apfelsaft	PROPN	PROPN	_	0	root	_	_
1:1	Apfel	Apfel	NOUN	_	_	_	_	_	_
1:2	saft	Saft	NOUN	_	_	_	_	_	_

Locale Specific Expressions

Dates, Times, and Numbers

Watson NLP recognizes dates, times, and numbers as single token when written in digits and symbols. Many languages use unique formats. The following table shows examples of them.

Language	Date	Time	Number
Arabic	`31/01/2020`, `31/1/2020`	`12:30`, `12:30:00`	`1,234,567.00`
Danish	`31.01.2020`	`12.30`, `12.30.00`, `12:30`, `12:30:00`	`1.234.567,00`
Dutch	`31-01-2020`	`12:30`, `12:30:00`	`1.234.567,00`
English	`1/31/2020`, `1/31/20`	`12:30`, `12:30:00`	`1,234,567.00`
Finnish	`31.1.2020`, `31.1.-20`	`12.30`, `12.30.00`, `12:30`, `12:30:00`	`1 234 567,00`
French	`31/01/2020`	`12:30`, `12:30:00`	`1 234 567,00`
German	`31.01.2020`, `31.01.20`	`12:30`, `12:30:00`	`1.234.567,00`
Hindi	`31/1/2020`	`12:30`, `12:30:00`	`12,34,567.00`
Japanese	`2020/01/31`	`12:30`, `12:30:00`	`1,234,567.00`
Swedish	`2020-01-31`	`12.30`, `12.30.00`, `12:30`, `12:30:00`	`1 234 567,00`

Example of Outputs

In the JSON outputs, Watson NLP adds term annotation additionally for those dates, times, and numbers.

The same information can be retrieved using Java a API. For example:

  {
    "id" : "42c43cfc-96cd-32b6-abf8-5fdf98aa5d9f",
    "type" : "term",                 // `term` annotation
    "text" : "1/31/2020",            // input text is `1/31/2020`
    "beginIndex" : 0,
    "endIndex" : 9,
    "properties" : {
      "g:name" : "org.unicode.cldr.time.Date", // this `term` is `o.u.c.time.Date`
      "g:properties" : {
        "locale" : "en",             // locale is English
      }
    }
  }

  {
    "id" : "7c47f0c9-04e2-332e-958b-f16e09e4bf31",
    "type" : "term",                 // `term` annotation
    "text" : "12:30",                // input text is `1/31/2020`
    "beginIndex" : 10,
    "endIndex" : 15,
    "properties" : {
      "g:name" : "org.unicode.cldr.time.Time", // this `term` is `o.u.c.time.Time`
      "g:properties" : {
        "locale" : "en",             // locale is English
      }
    }
  }

  {
    "id" : "9b7ef354-4483-3dc6-9b27-d24823a701dc",
    "type" : "term",                 // `term` annotation
    "text" : "1,234,567.00",         // input text is `1,234,567.00`
    "beginIndex" : 16,
    "endIndex" : 28,
    "properties" : {
      "g:name" : "org.unicode.cldr.number.Decimal", // this `term` is `o.u.c.number.Decimal`
      "g:properties" : {
        "locale" : "en"              // locale is English
      }
    }
  }

Ordinal, Cardinal Numbers

Danish and Norwegian use . to represent ordinal numbers.
Finnish uses "colons+suffixes" for the declension of cardinal (e.g. 3:n) and ordinal (e.g. 3:nnen) numbers.

Language	Ordinal, cardinal number
Danish, Norwegian	`1.`, `2.`, `3.`, ...
Finnish	`1:n`, `1 234:n`, `1 234:n`, `1 234:nnen`

Letters, Numbers, and Symbols

URI, E-mail Addresses, Host Names, File Names, Hashtags, and Mentions

Watson NLP recognizes them as single token.

	Example
URI	`http://www.ibm.com`
URI	`http://www.ibm.com/index.htm`
e-mail address	`john@ibm.com`, `ジョン@ibm.com`
Host name	`ibm.com`
File name (without space and path)	`doc.pdf`
Hashtag	`#ibm`
Mention	`@John`

Example of Output

In the JSON outputs, Watson NLP adds term annotation additionally for URI, e-mail address, hashtag, and mention. It does not add to host name and file name because they are sometimes ambiguous.

The same information can be retrieved using a Java API. For example:

  {
    "id" : "42c43cfc-96cd-32b6-abf8-5fdf98aa5d9f",
    "type" : "term",                 // `term` annotation
    "text" : "http://www.ibm.com/index.htm",    // input text is `http://www.ibm.com/index.htm`
    "beginIndex" : 0,
    "endIndex" : 28,
    "properties" : {
      "g:name" : "com.ibm.nlp.commons.net.URI", // this is `c.i.nlp.c.net.URI`
    }
  }

  {
    "id" : "7c47f0c9-04e2-332e-958b-f16e09e4bf31",
    "type" : "term",                 // `term` annotation
    "text" : "john@ibm.com",         // input text is `john@ibm.com`
    "beginIndex" : 29,
    "endIndex" : 41,
    "properties" : {
      "g:name" : "com.ibm.nlp.commons.net.EmailAddress", // this is `c.i.nlp.c.net.EmailAddress`
    }
  }

  {
    "id" : "ce5613eb-1e46-32ff-92c4-433b97ab02f6",
    "type" : "term",                 // `term` annotation
    "text" : "#ibm",                 // input text is `#ibm`
    "beginIndex" : 62,
    "endIndex" : 66,
    "properties" : {
      "g:name" : "com.ibm.nlp.commons.net.HashTag", // this is `c.i.nlp.c.net.HashTag`
    }
  }

 {
    "id" : "92686813-6e3f-30c2-8950-a789a35bc787",
    "type" : "term",                 // `term` annotation
    "text" : "@John",                // input text is `@John`
    "beginIndex" : 67,
    "endIndex" : 72,
    "properties" : {
      "g:name" : "com.ibm.nlp.commons.net.Mention", // this is `c.i.nlp.c.net.Mention`
    }
  }

Roman Numerals

Watson NLP recognizes Roman numerals from 1 to 21 in upper case. This is mainly for proper nouns using them (e.g., Queen Elizabeth II, Dragon Quest XII, Century XXI).

	Example
1	`I`
2	`II`
3	`III`
...
20	`XX`
21	`XXI`

Limitations:

It supports only small numbers. Because bigger numbers are infrequent and ambiguous in many cases (e.g, XXX (30)=hidden word, XL (40)=extra large, L (50)=liter, LV (55)=Louis Vuitton, CC (200)=carbon copy).
It may incorrectly recognize the following nouns as Roman numerals when written in all upper case:
- ii (2) in Romanian (traditional Romanian embroidered blouse)
- vi (6) in Catalan (wine)
- vii (7) in Romanian (1. vineyard 2. vine)

Example Output

In the JSON outputs, Watson NLP adds term annotation additionally for Roman numerals.

The same information can be retrieved using a Java API. For example:

{
  "id" : "1c89af51-b855-327a-b202-187b96084768",
  "type" : "term",                    // `term` annotation
  "text" : "II",                      // input text is `II`
  "beginIndex" : 10,
  "endIndex" : 12,
  "properties" : {
    "g:name" : "com.ibm.nlp.izumo.resources.RomanNumeral", // this is `c.i.nlp.izumo.resources.RomanNumeral`
    "g:properties" : {
      "g:normalForm" : "II",
      "value" : 2                     // numeral value is `2`
    },
    "locale" : "en"
  }
}

Hyphenated Words

Watson NLP splits out-of-vocabulary words by hyphen.

	Out-of-vocabulary word in Watson NLP
`e-mail`
`non-fiction`
`twelve` / `-` / `year` / `-` / `old`	✓
`three` / `-` / `quarters`	✓

Note:

English does not split words with the following prefix modifiers: anti-, co-, counter-, cross-, e-, ex-, mid-, multi-, non-, over-, post-, pre-, pro-, re-, semi-, sub-, vice-.
Finnish does not split by hyphen always.

Abbreviations

Watson NLP splits off period for out-of-vocabulary words

	Out-of-vocabulary word in Watson NLP
`Dr.`
`Sun.`
`Jan.`
`a.m.`
`etc.`
`arrv` / `.`	✓
`dept` / `.`	✓

Others

Watson NLP splits other out-of-vocabulary words by character categories.

	Out-of-vocabulary word in Watson NLP
`+` / `852`	✓
`US` / `$`	✓
`100 / M`	✓
`ABC` / `123` / `-` / `456` / `:` / `789`	✓
`b` / `/` / `c`	✓
`wo` / `/`	✓

Watson NLP splits symbols into single character.

	Out-of-vocabulary word in Watson NLP
`.` / `?`	✓
`.` / `/`	✓
`!` / `.`	✓

Note:

English recognizes the following symbols as single token:
- Double quotation (e.g., ’’, ‘’)
- Horizontal rule (e.g., -----, =====, ******)
- Comparator (e.g., <=, <<, ==, !=)
- Arrow (e.g., <--, -->)
- Ellipsis (e.g., ...)