Text indexing languages

Language codes identify the languages for which IBM® Content Search Services can index a document. Language analyzers identify the languages for which Elasticsearch can index a document.

Language codes for IBM Content Search Services

IBM Content Search Services uses language codes to identify the languages used for document indexing.

Important: Because IBM Content Search Services runs searches in all indexed languages, the number of indexed languages affects search performance. For example, searches might run 10% - 20% slower for three indexed languages than for a single indexed language.
Language Code
Afrikaans af
Arabic ar
Catalan ca
Chinese zh
Czech cs
Danish da
Dutch nl
English en
Finish fi
French fr
German de
Greek el
Hebrew he
Italian it
Japanese ja
Korean ko
Norwegian (Bokmal) nb
Norwegian (Nynorsk) nn
Polish pl
Portuguese pt
Russian ru
Spanish es
Swedish sv
Turkish tr

To set the indexing languages for your object store, see topic Selecting text languages for an object store (Content Search Services).

Text analyzers for Elasticsearch

Available language analyzers

Analyzers are applied when objects are first indexed. The analyzers that are used by the Content Platform Engine are set at the object store level and are applied to all CBR-enabled classes in the object store. If the analyzer list changes, a reindex is required. The recommendation is to use the simple analyzer and one language analyzer for each of the languages in which documents are written and ingested into the object store.

The list of available language analyzers are as follows:
  • Arabic
  • Armenian
  • Basque
  • Bengali
  • Brazilian
  • Bulgarian
  • Catalan
  • CJK
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • Galician
  • German
  • Greek
  • Hindi
  • Hungarian
  • Indonesian
  • Irish
  • Italian
  • Latvian
  • Lithuanian
  • Norwegian
  • Persian
  • Portuguese
  • Romanian
  • Russian
  • Sorani
  • Spanish
  • Swedish
  • Turkish
  • Thai
Built-in analyzers
Standard analyzer
The Content Platform Engine always includes the standard analyzer. The standard analyzer divides text into terms on word boundaries, removes most punctuation, parses terms to lowercase, and supports removing stop words.
Available analyzers
The following is the list of other available analyzers:
  • Simple analyzer - The simple analyzer breaks tokens on punctuation. Without the simple analyzer, sentences that lack spaces between the punctuation are not tokenized as expected. However, using the simple analyzer can cause problems with searches not finding strings with numbers. For example, ‘PO3025721’ is tokenized as just ‘po’ causing the search results to match far more documents than expected.
  • fncm_email_analyzer - The fncm_email_analyzer is a custom analyzer that is designed to handle information in emails.

To select the analyzers for your object store, see topic Selecting text analyzers for an object store (Elasticsearch).