Glossary

IBM® Content Classification terms and definitions.

action

An automatic classification process that is configured in a decision plan rule and initiated when the rule's trigger conditions are met. You can configure individual actions such as setting the value of a field or running matching on a knowledge base. You can also configure actions that are designed for use with IBM FileNet® Content Manager, IBM Content Manager, and file system content. For example, you can configure actions that file and unfile documents in IBM FileNet Content Manager folders, set document classes, and declare documents as records in IBM Enterprise Records.

category

A specifically defined division in a system of classification. A category is a label used to mark texts to indicate that they belong to a particular class of texts. Content items (for example, emails, documents, or other texts) are associated with categories to create a knowledge base. Categories can represent textual content, or indicate some other attribute of an item, such as its source. In general, each category has a specific use within an IBM Content Classification application. Classification Workbench creates categories in a knowledge base based on the categories indicated in the training set.

category scores

Content Classification analyzes the textual content and structure of content items to discern the intent of each item. For a given content item, Content Classification assigns a score to each category, representing the confidence with which that item is assigned to categories, and ranks them accordingly. The score is a percentage value representing the likelihood that the category is correctly identified. For example, an item from a particular Internet source might contain a question about a product and a request to receive a catalog. Content Classification might assign high scores to three categories, relating to the Internet source, product info, and send catalog.

Category scores (percentage values from 0 to 100) correspond to the match values (from 0 to 1) used in Classification Workbench graphical analysis.

Classification Center

A web application that is used to manage the classification of IBM FileNet Content Manager, IBM Content Manager, and file system content. You can use the Classification Center to determine the content to be classified, specify classification options (such as the decision plan to use and various runtime preferences), monitor classification activity, and view the classification results. If necessary, you can also use the Classification Center to reclassify documents if you determine that different knowledge base categories or decision plan actions are more applicable.

Classification Quick Start Tool

An application that is used to evaluate IBM Content Classification and create an initial classification project. You can easily import documents from the file system and organize them into categories. Tune classification by completing tasks that are suggested by the Classification Quick Start Tool to improve classification performance. Then you can view the results, adjust the ratio between automation and accuracy, and export the project to Classification Workbench for further development. End of change

Classification Workbench

An application that is used to create and analyze knowledge bases and decision plans. With Classification Workbench, you can also evaluate system performance by importing analysis data and viewing reports and graphical diagnostics.

clustering

A feature that can help users identify potential categories in an uncategorized (or partially categorized) content set. Content items are grouped together according to their semantic similarity. Clustering provides initial groupings ("clusters") of similar items. This feature can also be used to refine existing categorization, by helping you to identify new categories in a collection of categorized content items. The Taxonomy Proposer that is installed with Classification Workbench provides enhanced clustering workflows for category discovery.

concept

A basic unit of linguistic or quantitative information that is derived from input text (and its context) that can influence classification. The linguistic information can include semantic, contextual, and morphological data. The quantitative information can include various indicators derived from the text, such as its length. For example, a concept can be a word (typically in its stem form), a phrase, an indication of the input source, the length of the text, and so on.

Content Extractor

A command-line tool that is used to extract the content from an IBM FileNet Content Manager object store or IBM Content Manager repository. You can import the extracted content into Classification Workbench and use it to train a knowledge base or provide test data for a decision plan.

content set

A body of data (emails, documents, or other texts) supplied by your organization, imported into Classification Workbench, and used to create and analyze knowledge bases and decision plans. You use a training set to create a knowledge base and a testing set to analyze the knowledge base and generate results for evaluating knowledge base performance. Each content item in the content set is defined by one or more fields.

content type

A field attribute that determines how content is analyzed by Content Classification's natural language processing engine. Each content type is configured to achieve optimal results from text analysis. For example, the body content type is optimized for email fields that contain the main body text.

data type

A field attribute that specifies the kind of data that the field contains. Data types supported as field values by the API are string (Unicode) and numeric.

decision plan

A collection of rules built in Classification Workbench that determine how Content Classification classifies content items such as documents or emails. Each rule consists of one trigger and one or more actions. For example, you can configure rules that make automatic classification decisions based on keywords, word proximity, patterns, and so on. Decision plans can refer to one or more knowledge bases to combine rules with content-based classification.

feedback

Information that is presented to Content Classification to confirm/correct its current categorization knowledge. A positive categorization example confirms that a text belongs to a specific category. Submitting accurate feedback constantly improves Content Classification's ability to classify texts.

fields

Each content item is represented as a collection of fields. Fields can store the textual content of each item, metadata such as the delivery channel or author's name, and categorization information.

knowledge base

A single file encapsulating data that is required by Content Classification for accurate content-based classification. The knowledge base stores information accumulated initially through training, and later through ongoing learning, which enables the system to improve over time as it learns new categories and adapts to changes in data.

When a text is submitted to Content Classification for matching, the knowledge base is used to select the category that is most likely to match the text. During knowledge base analysis, content items are matched to categories, and categories are scored based on their relevancy. The results can be used by applications powered by Content Classification to perform automatic or semi-automatic actions.

Content Classification supports the use of multiple, monolingual or multilingual knowledge bases. Knowledge bases can be added to decision plans to combine content-based classification with rules-based classification.

learning

The ability of Content Classification to continually process feedback, update the knowledge base in real-time, and improve classification accuracy over time. Just as humans constantly adapt to change, so does Content Classification. Through exposure to new data, the system accumulates new information about existing categories, improving its ability to classify over time.

Learning is an incremental process. System learning simulates real life, where recently learned information has more impact on classification. You can use Classification Workbench to simulate online learning by providing feedback to the system manually, using a specified set of data.

listener

The server component that acts as the entry point to Content Classification. Client requests are received by the listener and dispatched to the appropriate server component for processing.

Management Console

An application used for Content Classification administration. You use the Management Console to create, import, and export knowledge bases and decision plans. You can also start and stop knowledge base and decision plan instances, modify properties, add and modify fields, and view the servers on which the system is running.

matching

Matching is the process of determining the relevancy of a text to categories in the knowledge base. Matching is also referred to as "suggestion".

natural language processing

The manipulation of data by computer systems for the purpose of understanding and reacting to human language. Content Classification's natural language processing engine uses morphological rules, word association, and complex algorithms to analyze textual content and generate concept models that are required for classification.

precision and recall rates

Statistic indicators that can be represented graphically to show how well Content Classification is performing. Precision is inversely related to false positives, and recall is inversely related to false negatives:

Precision: The fraction of content items that Content Classification identified as relevant to a category, that are actually relevant to the category (the rest are false positives).
Recall: The fraction of content items that are actually relevant to a category, that are recognized as such by Content Classification (the rest are false negatives).

Content Classification applications use thresholds to handle the trade-off between recall and precision. A high threshold results in high precision and lower recall; a low threshold results in lower precision, but higher recall. It is typically a business decision that determines where to set the threshold.

read-only instance

A server component capable of handling read-only requests such as "Suggest" on a knowledge base or “Decide” on a decision plan. Read-only requests can be processed by the read/write instance or forwarded to a read-only instance. Read-only instances are optional components, and the system can be configured to run any number of read-only instances of a specific knowledge base or decision plan on multiple computers. Read-only instances are configured by using the Management Console. End of change

read/write instance

A server component capable of handling both read/write and read-only requests on a knowledge base or decision plan. A single read/write instance should be configured for each knowledge base and decision plan. Read/write instances are configured by using the Management Console.

Relationship Modeling Engine

The online learning engine that uses a knowledge base to classify items based on their textual content or other factors, such as rules, and enables the system to accurately analyze the language, intent, and context of texts. The Relationship Modeling Engine includes both natural language processing and semantic modeling engines.

relevancy

During the matching process, the system assigns the text a relevancy score with respect to each category. This score is a value between 0.0 and 1.0, where a value of 1.0 indicates the text is a perfect match for a category. Content Classification applications take action according to the highest scoring categories.

snapshot

A complete backup of project data (including content set, knowledge base or decision plan, and analysis data).

Using Classification Workbench 's Snapshot Explorer, you can take multiple snapshots of your project, and restore previous snapshots, as required. For example, when you make changes to a knowledge base, you can compare your current project with any previous snapshot, to help you assess the impact of these changes on knowledge base performance.

summary reports and graphs

Allow you to assess how well Content Classification performs on a given testing set using a specific knowledge base or decision plan. For example, after you analyze a knowledge base using a testing set, you can view summary reports and perform graphical diagnostics to evaluate, troubleshoot, and improve performance.

testing set

Also known as the "analysis set", the testing set is a collection of categorized content items which is used to analyze an existing knowledge base. The content items in the testing set are processed by Content Classification, which utilizes the knowledge base to return suggested categories. The returned categories (or matches) are compared to the categorization information that is a part of each item in the testing set. The differences between the expected category results and the actual results are the basis for evaluating the knowledge base.

threshold

A value that specifies the minimum score that is required in order to determine that a content item belongs to a category. Thresholds can be defined globally or per category. You can set a fixed percentage value or calculate thresholds based on precision, recall, or cost ratio.

Your application can apply actions to texts that receive scores above a predefined threshold. For example, an email response application can send responses when the score of a text exceeds a specified threshold.

training set

A collection of categorized content items that is used to create a knowledge base. Content Classification processes the items in the training set to create the statistic models of each category that make up the knowledge base.

trigger

A constraint expression in a decision plan rule that determines the conditions that must be met to initiate the rule's actions. Each rule has exactly one trigger.