Automatic language detection

The Watson Explorer Content Analytics system can process documents in virtually any language.

The system can automatically detect all of the languages that are supported by the provided Language Identification annotator. If you know the language of your documents, you can specify the language when you configure a crawler instead of allowing the system to detect the language automatically.

When you create a collection, you can specify the languages of documents in the collection and specify the order in which the automatic language detection process is to try to process the documents. For example, if a collection includes many English documents and some French documents, you can configure the automatic language detection process to try English first. If the process determines that the language of the document is not English, it next tries to determine whether the document is in French.

If a collection contains documents in multiple languages, but you want to limit the search results to documents in a specific language, disable automatic language detection and specify the language when you configure the crawler. If documents in multiple languages still appear in the results (for example, a collection that contains primarily Portuguese documents returns documents that are in Slovenian), separate the documents by creating a separate crawler for each language. Configure each crawler to add only documents in a specific language, specify the document language, and disable automatic language detection.