Text indexing languages
Language codes identify the languages for which IBM® Content Search Services can index a document. Language analyzers identify the languages for which Elasticsearch can index a document.
Language codes for IBM Content Search Services
IBM Content Search Services uses language codes to identify the languages used for document indexing.
| Language | Code |
|---|---|
| Afrikaans | af |
| Arabic | ar |
| Catalan | ca |
| Chinese | zh |
| Czech | cs |
| Danish | da |
| Dutch | nl |
| English | en |
| Finish | fi |
| French | fr |
| German | de |
| Greek | el |
| Hebrew | he |
| Italian | it |
| Japanese | ja |
| Korean | ko |
| Norwegian (Bokmal) | nb |
| Norwegian (Nynorsk) | nn |
| Polish | pl |
| Portuguese | pt |
| Russian | ru |
| Spanish | es |
| Swedish | sv |
| Turkish | tr |
To set the indexing languages for your object store, see topic Selecting text languages for an object store (Content Search Services).
Text analyzers for Elasticsearch
- Available language analyzers
-
Analyzers are applied when objects are first indexed. The analyzers that are used by the Content Platform Engine are set at the object store level and are applied to all CBR-enabled classes in the object store. If the analyzer list changes, a reindex is required. The recommendation is to use the simple analyzer and one language analyzer for each of the languages in which documents are written and ingested into the object store.
The list of available language analyzers are as follows:- Arabic
- Armenian
- Basque
- Bengali
- Brazilian
- Bulgarian
- Catalan
- CJK
- Czech
- Danish
- Dutch
- English
- Estonian
- Finnish
- French
- Galician
- German
- Greek
- Hindi
- Hungarian
- Indonesian
- Irish
- Italian
- Latvian
- Lithuanian
- Norwegian
- Persian
- Portuguese
- Romanian
- Russian
- Sorani
- Spanish
- Swedish
- Turkish
- Thai
- Built-in analyzers
-
- Standard analyzer
- The Content Platform Engine always includes the standard analyzer. The standard analyzer divides text into terms on word boundaries, removes most punctuation, parses terms to lowercase, and supports removing stop words.
- Available analyzers
- The following is the list of other available analyzers:
- Simple analyzer - The simple analyzer breaks tokens on punctuation. Without the simple analyzer, sentences that lack spaces between the punctuation are not tokenized as expected. However, using the simple analyzer can cause problems with searches not finding strings with numbers. For example, ‘PO3025721’ is tokenized as just ‘po’ causing the search results to match far more documents than expected.
fncm_email_analyzer- Thefncm_email_analyzeris a custom analyzer that is designed to handle information in emails.
To select the analyzers for your object store, see topic Selecting text analyzers for an object store (Elasticsearch).