Natural language processing

Before the Dictionary Lookup operator performs its lookup based on the entries in the custom dictionary, InfoSphere™ Warehouse applies natural-language processing to the text.

Text segmentation

Text segmentation splits text into individual words or tokens. Typically, this is done based on the blank characters between words. However, languages such as Chinese, Japanese, or Thai, do not have word boundaries like this. Therefore text parsing in these languages requires the identification of word boundaries, which can be a very difficult task.

Text segmentation uses rules and language-specific lexical dictionaries for text parsing.

Case sensitivity

You can select the language of the text to be analyzed. Depending on the selected language, the appropriate lexical dictionary is used by the Dictionary Lookup operator. The words that are included in the lexical dictionary are called in-vocabulary words. Words that are not included in the lexical dictionary are called out-of-vocabulary words. Typically, product names or names of persons are out-of-vocabulary words.

The Dictionary Lookup operator finds the following spelling for in-vocabulary words:
Initial-upper case letters
For example, Customer
Upper-case letters
For example, CUSTOMER
Mixed-case letters
For example, cUstomer

You do not need to add these variants to the domain dictionary. This behavior applies also for out-of-vocabulary words if you type them in lower-case letters into your dictionary, for example, ipod.

If you do not enter out-of-vocabulary words in lower-case letters in your dictionaries, the word is only found if it exists in your dictionary in the same spelling that you used. For example:
  • If you enter an out-of-vocabulary word in lower-case letters, for example, ipod, the Dictionary Lookup operator also finds the spelling iPod, IPod, and IPOD.
  • If you enter an out-of-vocabulary word in mixed case, for example, iPod, the Dictionary Lookup operator finds only the spelling iPod.

    The dictionary does not show the automatically detected upper-and-lower case variations of a term. However, it shows whether the term is in-vocabulary or out-of-vocabulary for the selected language.

Stem derivation

For in-vocabulary-words, stem derivation finds derived forms of words based on grammatical inflection, for example, verb conjugations such as Write, Wrote, Written, or noun forms such as Mouse or Mice.

Typically, names of persons are not inflected, however, other concepts like product names might have inflections. To find inflections for out-of-vocabulary terms, you must add the inflections manually to the dictionary.

In compound words, stem derivation is applied to the single words of the compound word. This works only for in-vocabulary-words.

If the dictionary contains the term customer service representative, the Dictionary Lookup operator finds also the terms customer service representatives or customers service representatives. You do not need to add these variants to the domain dictionary.

For example, to find the term iPod in the text IPods are an example of MP3-players, you must create a dictionary that includes variants of product names because product names are not typically included in lexical dictionaries. For example, you might want to create the dictionary Products to include the base form and its variants of the product name iPod. The dictionary entry might look like this:
Table 1. Variants for the base form iPod in a dictionary that is called Products
Base form Variants
iPod ipods, iPods, ipod, i-pod, i-Pod

The Dictionary editor shows the automatically detected inflections of the term for the selected language.

Supported languages

InfoSphere Warehouse contains lexical dictionaries for the following languages:
  • Afrikaans
  • Arabic
  • Catalan
  • Chinese
  • Czech
  • Danish
  • Dutch
  • English
  • English (Australia)
  • English (Great Britain)
  • English (United States)
  • Finnish
  • French
  • French (France)
  • French (Canada)
  • German
  • German (Germany)
  • German (Swiss)
  • Greek
  • Italian
  • Japanese
  • Korean
  • Norwegian (Bokmal)
  • Norwegian (Nynorsk)
  • Polish
  • Portuguese
  • Portuguese (Portugal)
  • Portuguese (Brazil)
  • Russian
  • Spanish
  • Swedish

Inflection-lookup dictionaries

InfoSphere Warehouse contains inflection-lookup dictionaries for the following languages:


Feedback | Information roadmap