Concept Root Derivation

The concept root derivation technique creates categories by taking a concept and finding other concepts that are related to it through analyzing whether any of the concept components are morphologically related. A component is a word. The technique attempts to group concepts by looking at the endings (suffixes) of each component in a concept and finding other concepts that could be derived from them. The idea is that when words are derived from each other, they are likely to share or be close in meaning. In order to identify the endings, internal language-specific rules are used. For example, the concept opportunities to advance would be grouped with the concepts opportunity for advancement and advancement opportunity.

You can use concept root derivation on any sort of text. By itself, it produces fairly few categories, and each category tends to contain few concepts. The concepts in each category are either synonyms or situationally related. You may find it helpful to use this algorithm even if you are building categories manually; the synonyms it finds may be synonyms of those concepts you are particularly interested in.

Note: You can prevent concepts from being grouped together by specifying them explicitly. See the topic Managing Link Exception Pairs for more information.

Term Componentization and De-inflecting

When the concept root derivation or the concept inclusion techniques are applied, the terms are first broken down into components (words) and then the components are de-inflected. When a technique is applied, the concepts and their associated terms are loaded and split into components based on separators, such as spaces, hyphens, and apostrophes. For example, the term system administrator is split into components such as {administrator, system}.

However, some parts of the original term may not be used and are referred to as stop words. In English, some of these ignorable components might include a, and, as, by, for, from, in, of, on, or, the, to, and with.

For example, the term examination of the data has the component set {data, examination}, and both of and the are considered ignorable. Additionally, component order is not in a component set. In this way, the following three terms could be equivalent: cough relief for child, child relief from a cough, and relief of child cough since they all have the same component set {child, cough, relief}. Each time a pair of terms are identified as being equivalent, the corresponding concepts are merged to form a new concept that references all of the terms.

Additionally, since the components of a term may be inflected, language-specific rules are applied internally to identify equivalent terms regardless of inflectional variation, such as plural forms. In this way, the terms level of support and support levels can be identified as equivalent since the de-inflected singular form would be level.

How Concept Root Derivation Works

After terms have been componentized and de-inflected (see previous section), the concept root derivation algorithm analyzes the component endings, or suffixes, to find the component root and then groups the concepts with other concepts that have the same or similar roots. The endings are identified using a set of linguistic derivation rules specific to the text language. For example, there is a derivation rule for English language text that states that a concept component ending with the suffix ical might be derived from a concept having the same root stem and ending with the suffix ic. Using this rule (and the de-inflection), the algorithm would be able to group the concepts epidemiologic study and epidemiological studies.

Since terms are already componentized and the ignorable components (for example, in and of) have been identified, the concept root derivation algorithm would also be able to group the concept studies in epidemiology with epidemiological studies.

The set of component derivation rules has been chosen so that most of the concepts grouped by this algorithm are synonyms: the concepts epidemiologic studies, epidemiological studies, studies in epidemiology are all equivalent terms. To increase completeness, there are some derivation rules that allow the algorithm to group concepts that are situationally related. For example, the algorithm can group concepts such as empire builder and empire building.