Tokenization and Word Segmentation
As a general principle, Watson NLP follows the tokenization of UD corpora. However tokenization and word segmentation are not trivial tasks in many languages. They depend on languages, and sometimes even on the context of contents, domains, and applications. This section describes implementation notes on various cases of Watson NLP tokenization.
Multiword Token, and Compound Word Token
In UD, words are basic units holding syntactic features such as Part-of-Speech (UPOS
) and dependency relations (DEPREL
, HEAD
). Words and tokens are equivalent in most languages including English. However
they are not in some languages. For example, del
(contraction) in Spanish needs to be split into de
(ADP) + el
(DET) before assigning PoS and dependency relations correctly. Also, inspirándose
(clitic) in Spanish need to be split into inspirándo
(VERB) + se
(PRON). Those del
and inspirándose
may not have syntactic features. The components, de
, el
, inspirándo
,
and se
, have them always. Those del
and inspirándose
are referred to as "multiword token" (MWT) in UD.
Similar to "multiword token", there is another concept "compound word token" (CWT) in Watson NLP. It is an Watson NLP specific extension to represent words formed of two or more components without syntactic features. For
example, Apfelsaft
(apple juice) in German is a word formed of apfel
(apple) and saft
(juice). The compound word token, Apfelsaft
, has syntactic features always. The components, apfel
,
and saft
, may not have syntactic features.
The following table shows examples of "multiword token" and "compound word token" from various languages.
Language | Multiword token (e.g. contraction, clitic) - Components have syntactic features always |
Compound word token - Components may not have syntactic features |
---|---|---|
Afrikaans | N/A | lidstate (member states) -> lid + state |
Arabic | N/A | اللغة -> ال + لغة (DET + NOUN) |
Chinese | N/A | 闭幕辞 (closing speech) -> 闭幕 + 辞 |
Danish | N/A | matematiklæreren (mathematics teacher) -> matematik + læreren sportshallen (sports hall) -> sports + hallen |
Dutch | N/A | koffiekop (coffee mug) -> koffie + kop werkplaats (workshop) -> werk + plaats |
English | N/A | co-founder -> co + - + founder mid-May -> mid + - + May |
Finnish | ettei -> ett (SCONJ) + ei (AUX)miksei -> miks (ADV) + ei (AUX) |
urheiluauto (sports car) -> urheilu + auto |
French | du -> de (ADP) + le (DET)au ->à (ADP) + le (DET) |
N/A |
German | im -> in (ADP) + dem (DET)vom -> von (ADP) + dem (DET) |
Apfelsaft (apple juice) -> Apfel + saft Autobahnanschlussstelle (motorway junction)-> Autobahn + anschluss + stelle |
Greek | στην -> σ (ADP) + την (DET) |
N/A |
Italian | negli -> in (ADP) + gli (DET) |
N/A |
Norwegian (Bokmål) | N/A | møterom (meeting room) -> møte + rom prosjektleder (project manager) -> prosjekt + leder |
Portuguese | à -> a (ADP) + o (DET)deles -> de (ADP) + eles (PRON) |
N/A |
Spanish | del -> de (ADP) + el (DET)al -> a (ADP) + el (DET)inspirándose -> inspirándo (VERB) + se (PRON) |
N/A |
Swedish | N/A | telefonkort (phone card) -> telefon + kort bordslampa (table lamp) -> bord + lampa |
Example of Outputs
Multiword Token in JSON
Watson NLP uses token
annotation for multiword token (e.g. del
) and component
annotations for the components (e.g. de
and el
) as follows. The token
annotation
has 3 additional properties:
ud:mwt
: flag to indicate this is a multiword tokencontraction
: flag to indicate this is a contracted formcomponents
: reference to the components
The same information can be retrieved using a Java API:
{
"id" : "e5a0dbf9-fa6a-355a-ab15-3fc6ae33c7f0",
"type" : "token", // `token` annotation
"text" : "del", // input text is `del`
"beginIndex" : 0,
"endIndex" : 3,
"properties" : {
"components" : [ "6f9dad27-a769-31ac-a61d-7d8b5c204bd7", "397a7b5b-1927-3088-8e9d-cd5155a9681f" ], // this token has references to 2 components `de` and `el`
"contraction" : true, // this token is a contraction
"locale" : "es",
"ud:mwt" : true // this token is a multiword token
}
}, {
"id" : "6f9dad27-a769-31ac-a61d-7d8b5c204bd7",
"type" : "component", // `component` annotation for the first element
"text" : "del", // this is a part of contraction, so the surface text is associated with the whole span `del` for simplicity
"beginIndex" : 0,
"endIndex" : 3,
"properties" : {
"locale" : "es",
"ud:children" : [ "397a7b5b-1927-3088-8e9d-cd5155a9681f" ], // this component has dependency relation
"ud:lemma" : "de", // the lemma is `de`
"ud:pos" : "ADP" // the PoS is `ADP`
}
}, {
"id" : "397a7b5b-1927-3088-8e9d-cd5155a9681f",
"type" : "component", // `component` annotation for the second element
"text" : "del", // this is a part of contraction, so the surface text is associated with the whole span `del` for simplicity
"beginIndex" : 0,
"endIndex" : 3,
"properties" : {
"annotatedBy" : "com.ibm.nlp.izumo.es",
"locale" : "es",
"ud:lemma" : "el", // the lemma is `el`
"ud:parent" : "6f9dad27-a769-31ac-a61d-7d8b5c204bd7", // this component has dependency relation
"ud:pos" : "DET", // the PoS is `DET`
"ud:relation" : "dep"
}
}
Multiword Token in CoNLLU
# text = del
1-2 del _ _ _ _ _ _ _ _
1 de de ADP _ _ 0 root _ _
2 el el DET _ _ 1 dep _ _
Compound Word Token in JSON
Watson NLP uses token
annotation for compound word token (e.g. Apfelsaft
) and component
annotations for the components (e.g. Apfel
and saft
) as follows.
The same information can be retrieved using a Java API:
{
"id" : "6f71d36c-b995-3182-8cc0-76de762bf31a",
"type" : "token", // `token` annotation
"text" : "Apfelsaft", // input text is `Apfelsaft`
"beginIndex" : 0,
"endIndex" : 9,
"properties" : {
"components" : [ "3fd0b193-cbf0-3f9c-9d5b-040d94ebfd2b", "a1f3620c-8448-3dc8-b02a-1969de18fd27" ], // the token has references to 2 components `Apfel` and `saft`
"locale" : "de",
"ud:relation" : "root", // this token has dependency relation
"unknown" : true
}
}, {
"id" : "3fd0b193-cbf0-3f9c-9d5b-040d94ebfd2b",
"type" : "component", // `component` annotation for the first element
"text" : "Apfel",
"beginIndex" : 0,
"endIndex" : 5,
"properties" : {
"locale" : "de",
"ud:lemma" : "Apfel", // the lemma is `Apfel` (optional)
"ud:pos" : "NOUN" // the PoS is `NOUN` (optional)
}
}, {
"id" : "a1f3620c-8448-3dc8-b02a-1969de18fd27",
"type" : "component", // `component` annotation for the second element
"text" : "saft",
"beginIndex" : 5,
"endIndex" : 9,
"properties" : {
"locale" : "de",
"ud:lemma" : "Saft", // the lemma is `Saft` (optional)
"ud:pos" : "NOUN" // the PoS is `NOUN` (optional)
}
}
Compound Word Token in CoNLLU
This is a deviation from CoNLLU format because it does not define anything for compound word token. So this feature is disabled by default. It needs to set CoNLLU.withComponents(true)
explicitly.
String conllu = new CoNLLU()
.withComments(Arrays.asList(COMMENT.TEXT))
.withComponents(true)
.toString(result);
Then the output will contain the components of compound word tokens. Note that the first column ID
uses n:m
format for them as follows.
# text = Apfelsaft
1 Apfelsaft Apfelsaft PROPN PROPN _ 0 root _ _
1:1 Apfel Apfel NOUN _ _ _ _ _ _
1:2 saft Saft NOUN _ _ _ _ _ _
Locale Specific Expressions
Dates, Times, and Numbers
Watson NLP recognizes dates, times, and numbers as single token when written in digits and symbols. Many languages use unique formats. The following table shows examples of them.
Language | Date | Time | Number |
---|---|---|---|
Arabic | 31/01/2020 , 31/1/2020 |
12:30 , 12:30:00 |
1,234,567.00 |
Danish | 31.01.2020 |
12.30 , 12.30.00 ,12:30 , 12:30:00 |
1.234.567,00 |
Dutch | 31-01-2020 |
12:30 , 12:30:00 |
1.234.567,00 |
English | 1/31/2020 , 1/31/20 |
12:30 , 12:30:00 |
1,234,567.00 |
Finnish | 31.1.2020 , 31.1.-20 |
12.30 , 12.30.00 ,12:30 , 12:30:00 |
1 234 567,00 |
French | 31/01/2020 |
12:30 , 12:30:00 |
1 234 567,00 |
German | 31.01.2020 , 31.01.20 |
12:30 , 12:30:00 |
1.234.567,00 |
Hindi | 31/1/2020 |
12:30 , 12:30:00 |
12,34,567.00 |
Japanese | 2020/01/31 |
12:30 , 12:30:00 |
1,234,567.00 |
Swedish | 2020-01-31 |
12.30 , 12.30.00 ,12:30 , 12:30:00 |
1 234 567,00 |
Example of Outputs
In the JSON outputs, Watson NLP adds term
annotation additionally for those dates, times, and numbers.
The same information can be retrieved using Java a API. For example:
{
"id" : "42c43cfc-96cd-32b6-abf8-5fdf98aa5d9f",
"type" : "term", // `term` annotation
"text" : "1/31/2020", // input text is `1/31/2020`
"beginIndex" : 0,
"endIndex" : 9,
"properties" : {
"g:name" : "org.unicode.cldr.time.Date", // this `term` is `o.u.c.time.Date`
"g:properties" : {
"locale" : "en", // locale is English
}
}
}
{
"id" : "7c47f0c9-04e2-332e-958b-f16e09e4bf31",
"type" : "term", // `term` annotation
"text" : "12:30", // input text is `1/31/2020`
"beginIndex" : 10,
"endIndex" : 15,
"properties" : {
"g:name" : "org.unicode.cldr.time.Time", // this `term` is `o.u.c.time.Time`
"g:properties" : {
"locale" : "en", // locale is English
}
}
}
{
"id" : "9b7ef354-4483-3dc6-9b27-d24823a701dc",
"type" : "term", // `term` annotation
"text" : "1,234,567.00", // input text is `1,234,567.00`
"beginIndex" : 16,
"endIndex" : 28,
"properties" : {
"g:name" : "org.unicode.cldr.number.Decimal", // this `term` is `o.u.c.number.Decimal`
"g:properties" : {
"locale" : "en" // locale is English
}
}
}
Ordinal, Cardinal Numbers
-
Danish and Norwegian use
.
to represent ordinal numbers. -
Finnish uses "colons+suffixes" for the declension of cardinal (e.g.
3:n
) and ordinal (e.g.3:nnen
) numbers.
Language | Ordinal, cardinal number |
---|---|
Danish, Norwegian | 1. , 2. , 3. , ... |
Finnish | 1:n , 1 234:n , 1 234:n , 1 234:nnen |
Letters, Numbers, and Symbols
Example of Output
In the JSON outputs, Watson NLP adds term
annotation additionally for URI, e-mail address, hashtag, and mention. It does not add to host name and file name because they are sometimes ambiguous.
The same information can be retrieved using a Java API. For example:
{
"id" : "42c43cfc-96cd-32b6-abf8-5fdf98aa5d9f",
"type" : "term", // `term` annotation
"text" : "http://www.ibm.com/index.htm", // input text is `http://www.ibm.com/index.htm`
"beginIndex" : 0,
"endIndex" : 28,
"properties" : {
"g:name" : "com.ibm.nlp.commons.net.URI", // this is `c.i.nlp.c.net.URI`
}
}
{
"id" : "7c47f0c9-04e2-332e-958b-f16e09e4bf31",
"type" : "term", // `term` annotation
"text" : "john@ibm.com", // input text is `john@ibm.com`
"beginIndex" : 29,
"endIndex" : 41,
"properties" : {
"g:name" : "com.ibm.nlp.commons.net.EmailAddress", // this is `c.i.nlp.c.net.EmailAddress`
}
}
{
"id" : "ce5613eb-1e46-32ff-92c4-433b97ab02f6",
"type" : "term", // `term` annotation
"text" : "#ibm", // input text is `#ibm`
"beginIndex" : 62,
"endIndex" : 66,
"properties" : {
"g:name" : "com.ibm.nlp.commons.net.HashTag", // this is `c.i.nlp.c.net.HashTag`
}
}
{
"id" : "92686813-6e3f-30c2-8950-a789a35bc787",
"type" : "term", // `term` annotation
"text" : "@John", // input text is `@John`
"beginIndex" : 67,
"endIndex" : 72,
"properties" : {
"g:name" : "com.ibm.nlp.commons.net.Mention", // this is `c.i.nlp.c.net.Mention`
}
}
Roman Numerals
Watson NLP recognizes Roman numerals from 1 to 21 in upper case. This is mainly for proper nouns using them (e.g., Queen Elizabeth II
, Dragon Quest XII
, Century XXI
).
Example | |
---|---|
1 | I |
2 | II |
3 | III |
... | |
20 | XX |
21 | XXI |
Limitations:
-
It supports only small numbers. Because bigger numbers are infrequent and ambiguous in many cases (e.g,
XXX
(30)=hidden word,XL
(40)=extra large,L
(50)=liter,LV
(55)=Louis Vuitton,CC
(200)=carbon copy). -
It may incorrectly recognize the following nouns as Roman numerals when written in all upper case:
-
ii
(2) in Romanian (traditional Romanian embroidered blouse) -
vi
(6) in Catalan (wine) -
vii
(7) in Romanian (1. vineyard 2. vine)
-
Example Output
In the JSON outputs, Watson NLP adds term
annotation additionally for Roman numerals.
The same information can be retrieved using a Java API. For example:
{
"id" : "1c89af51-b855-327a-b202-187b96084768",
"type" : "term", // `term` annotation
"text" : "II", // input text is `II`
"beginIndex" : 10,
"endIndex" : 12,
"properties" : {
"g:name" : "com.ibm.nlp.izumo.resources.RomanNumeral", // this is `c.i.nlp.izumo.resources.RomanNumeral`
"g:properties" : {
"g:normalForm" : "II",
"value" : 2 // numeral value is `2`
},
"locale" : "en"
}
}
Hyphenated Words
Watson NLP splits out-of-vocabulary words by hyphen.
Out-of-vocabulary word in Watson NLP | |
---|---|
e-mail |
|
non-fiction |
|
twelve / - / year / - / old |
✓ |
three / - / quarters |
✓ |
Note:
-
English does not split words with the following prefix modifiers:
anti-
,co-
,counter-
,cross-
,e-
,ex-
,mid-
,multi-
,non-
,over-
,post-
,pre-
,pro-
,re-
,semi-
,sub-
,vice-
. -
Finnish does not split by hyphen always.
Abbreviations
Watson NLP splits off period for out-of-vocabulary words
Out-of-vocabulary word in Watson NLP | |
---|---|
Dr. |
|
Sun. |
|
Jan. |
|
a.m. |
|
etc. |
|
arrv / . |
✓ |
dept / . |
✓ |
Others
Watson NLP splits other out-of-vocabulary words by character categories.
Out-of-vocabulary word in Watson NLP | |
---|---|
+ / 852 |
✓ |
US / $ |
✓ |
100 / M |
✓ |
ABC / 123 / - / 456 / : / 789 |
✓ |
b / / / c |
✓ |
wo / / |
✓ |
Watson NLP splits symbols into single character.
Out-of-vocabulary word in Watson NLP | |
---|---|
. / ? |
✓ |
. / / |
✓ |
! / . |
✓ |
Note:
-
English recognizes the following symbols as single token:
-
Double quotation (e.g.,
’’
,‘’
) -
Horizontal rule (e.g.,
-----
,=====
,******
) -
Comparator (e.g.,
<=
,<<
,==
,!=
) -
Arrow (e.g.,
<--
,-->
) -
Ellipsis (e.g.,
...
)
-