Text segmentation splits text into individual words or tokens. Typically, this is done based on the blank characters between words. However, languages such as Chinese, Japanese, or Thai, do not have word boundaries like this. Therefore text parsing in these languages requires the identification of word boundaries, which can be a very difficult task.
Text segmentation uses rules and language-specific lexical dictionaries for text parsing.
You can select the language of the text to be analyzed. Depending on the selected language, the appropriate lexical dictionary is used by the Dictionary Lookup operator. The words that are included in the lexical dictionary are called in-vocabulary words. Words that are not included in the lexical dictionary are called out-of-vocabulary words. Typically, product names or names of persons are out-of-vocabulary words.
You do not need to add these variants to the domain dictionary. This behavior applies also for out-of-vocabulary words if you type them in lower-case letters into your dictionary, for example, ipod.
The dictionary does not show the automatically detected upper-and-lower case variations of a term. However, it shows whether the term is in-vocabulary or out-of-vocabulary for the selected language.
For in-vocabulary-words, stem derivation finds derived forms of words based on grammatical inflection, for example, verb conjugations such as Write, Wrote, Written, or noun forms such as Mouse or Mice.
Typically, names of persons are not inflected, however, other concepts like product names might have inflections. To find inflections for out-of-vocabulary terms, you must add the inflections manually to the dictionary.
In compound words, stem derivation is applied to the single words of the compound word. This works only for in-vocabulary-words.
If the dictionary contains the term customer service representative, the Dictionary Lookup operator finds also the terms customer service representatives or customers service representatives. You do not need to add these variants to the domain dictionary.
Base form | Variants |
---|---|
iPod | ipods, iPods, ipod, i-pod, i-Pod |
The Dictionary editor shows the automatically detected inflections of the term for the selected language.