You can configure a break rules dictionary to specify how Content Analytics Studio tokenizes text in documents.
Break rules determine how Content Analytics Studio splits documents into paragraphs, sentences, and tokens during lexical analysis of the document. A token is a basic unit of text, such as a word, punctuation symbol, number, or a string of symbols. For example, break rules can indicate whether to treat each line of text as a new paragraph.
Most of the rules for splitting a document into components are standard and usually do not need to be configured. However, you might want to configure some rules depending on the document structure and your preferences. For example, Content Analytics Studio treats an alphanumeric sequence such as 2.5cm as a single token by default. But you might want to split the sequence into mulitple tokens such as 2.5 and cm. You might want to separate the numeric and alphabetic tokens so that the units can be identified with a dictionary or so that you can create a parsing rule or character rule to identify the numeric value.
The source data to create a break rules dictionary is stored in a BREAKRULES file. The break rules file is then built into a dictionary (DIC) file that can be used in the lexical analysis stage of a UIMA pipeline.
If you do not configure a custom break rules dictionary, Content Analytics Studio uses the default break rules.
To configure a break rules dictionary: