Text indexing languages

Language codes identify the languages for which IBM® Content Search Services can index a document. Language analyzers identify the languages for which Elasticsearch can index a document.

Language codes for IBM Content Search Services

IBM Content Search Services uses language codes to identify the languages used for document indexing.

Important: Because IBM Content Search Services runs searches in all indexed languages, the number of indexed languages affects search performance. For example, searches might run 10% - 20% slower for three indexed languages than for a single indexed language.

Language	Code
Afrikaans	`af`
Arabic	`ar`
Catalan	`ca`
Chinese	`zh`
Czech	`cs`
Danish	`da`
Dutch	`nl`
English	`en`
Finish	`fi`
French	`fr`
German	`de`
Greek	`el`
Hebrew	`he`
Italian	`it`
Japanese	`ja`
Korean	`ko`
Norwegian (Bokmal)	`nb`
Norwegian (Nynorsk)	`nn`
Polish	`pl`
Portuguese	`pt`
Russian	`ru`
Spanish	`es`
Swedish	`sv`
Turkish	`tr`

To set the indexing languages for your object store, see topic Selecting text languages for an object store (Content Search Services).

Text analyzers for Elasticsearch

Available language analyzers

Analyzers are applied when objects are first indexed. The analyzers that are used by the Content Platform Engine are set at the object store level and are applied to all CBR-enabled classes in the object store. If the analyzer list changes, a reindex is required. The recommendation is to use the simple analyzer and one language analyzer for each of the languages in which documents are written and ingested into the object store.

The list of available language analyzers are as follows:

Arabic
Armenian
Basque
Bengali
Brazilian
Bulgarian
Catalan
CJK
Czech
Danish
Dutch
English
Estonian
Finnish
French
Galician
German
Greek
Hindi
Hungarian
Indonesian
Irish
Italian
Latvian
Lithuanian
Norwegian
Persian
Portuguese
Romanian
Russian
Sorani
Spanish
Swedish
Turkish
Thai

Built-in analyzers

Standard analyzer

The Content Platform Engine always includes the standard analyzer. The standard analyzer divides text into terms on word boundaries, removes most punctuation, parses terms to lowercase, and supports removing stop words.

Available analyzers

The following is the list of other available analyzers:

Simple analyzer - The simple analyzer breaks tokens on punctuation. Without the simple analyzer, sentences that lack spaces between the punctuation are not tokenized as expected. However, using the simple analyzer can cause problems with searches not finding strings with numbers. For example, ‘PO3025721’ is tokenized as just ‘po’ causing the search results to match far more documents than expected.
fncm_email_analyzer - The fncm_email_analyzer is a custom analyzer that is designed to handle information in emails.

To select the analyzers for your object store, see topic Selecting text analyzers for an object store (Elasticsearch).