Configuring break rules

You can configure a break rules dictionary to specify how Content Analytics Studio tokenizes text in documents.

About this task

Break rules determine how Content Analytics Studio splits documents into paragraphs, sentences, and tokens during lexical analysis of the document. A token is a basic unit of text, such as a word, punctuation symbol, number, or a string of symbols. For example, break rules can indicate whether to treat each line of text as a new paragraph.

Most of the rules for splitting a document into components are standard and usually do not need to be configured. However, you might want to configure some rules depending on the document structure and your preferences. For example, Content Analytics Studio treats an alphanumeric sequence such as 2.5cm as a single token by default. But you might want to split the sequence into mulitple tokens such as 2.5 and cm. You might want to separate the numeric and alphabetic tokens so that the units can be identified with a dictionary or so that you can create a parsing rule or character rule to identify the numeric value.

The source data to create a break rules dictionary is stored in a BREAKRULES file. The break rules file is then built into a dictionary (DIC) file that can be used in the lexical analysis stage of a UIMA pipeline.

If you do not configure a custom break rules dictionary, Content Analytics Studio uses the default break rules.

Restriction: Custom break rules files are not supported for Japanese, Chinese, and Korean.

Procedure

To configure a break rules dictionary:

In the Studio Explorer view, right-click the Resources/Break Rules directory in your project and click New > Break Rules File.
After you set the configuration parameters and save the break rules file, build the break rules dictionary. Right-click the new break rule file and click Build Studio Resource.
Configure your UIMA pipeline to use the custom break rules dictionary. In the lexical analysis stage of the UIMA pipeline configuration file, specify your file in the Break Rules area.