Stemming

Stemmers are simple programs that try to guess the root of a word, which can be extremely useful in clustering. For example, the Porter stemmer, a well-known stemming algorithm, will find that "university" and "universities" share the same stem, and thus belong to the same stem class. All of the more advanced customizations are based on stem classes. Unfortunately, these simple stemming programs are not perfect. The Porter algorithm, for example, will state that "universal" has the same stem as "university" and "universities", an observation that may have historical basis but is no longer semantically relevant. The Porter stemmer also does not recognize that "theater" and "theatre" should belong to the same stem class. For reasons such as these, Watson Explorer Engine does not use the Porter stemmer as its english stemmer.

To provide additional flexibility and handle the inadequacies of many stemmers, Watson Explorer Engine enables you to add special stemming rules using the stem element.

Note: The examples in this section are already incorporated into Watson Explorer Engine english knowledge base, which is used with Watson Explorer Engine's english stemmer. You do not have to enter them yourself. They are presented here to show the types of situations for which you might want to add custom stemming rules.

Consider the following:

<stem this="universal" as="#universal" />
<stem this="universally" as="#universal" />

This causes the words "universal" and "universally" to be stemmed to their own stem class, "#universal". As seen here, using the character "#" allows you to create new stem classes.

<stem this="universal" as="universal" />
<stem this="universally" as="universal" />

This could have the same effect, except that you would not necessarily be creating a new stem class, since some other word may already be stemmed to "universal" and thus already a member of the "universal" stem class. When creating new stem classes, you should always use the "#" notation. There may be situations when you do not want create a new class, but instead wish to add a word to an existing class:

<stem this="preparatory" as="prep" />

This rule ensures that "preparatory" will be in the same stem class as "prep", but does not interfere with other words in its original stem class, for example "preparation". If changing the original stem class of "preparatory" is your intention, you may want to consider creating a Rephrase Rules.

Stemming rules can only be added using the stem element in the XML.