IBM® Content
Classification terms
and definitions.
- action
- An automatic classification process that is configured in a decision
plan rule and initiated when the rule's trigger conditions
are met. You can configure individual actions such as setting the
value of a field or running matching on a knowledge base.
You can also configure actions that are designed for use with IBM FileNet® Content Manager, IBM Content
Manager, and file system content.
For example, you can configure actions that file and unfile documents
in IBM FileNet Content Manager folders,
set document classes, and declare documents as records in IBM Enterprise
Records.
- category
- A specifically defined division in a system of classification.
A category is a label used to mark texts to indicate that they belong
to a particular class of texts. Content items (for example, emails,
documents, or other texts) are associated with categories to create
a knowledge base. Categories can represent textual content, or indicate
some other attribute of an item, such as its source. In general, each
category has a specific use within an IBM Content
Classification application. Classification Workbench creates categories in
a knowledge base based on the categories indicated in the training
set.
- category scores
- Content Classification analyzes
the textual content and structure of content items to discern the
intent of each item. For a given content item, Content Classification assigns a score to
each category, representing the confidence with which that item is
assigned to categories, and ranks them accordingly. The score is a
percentage value representing the likelihood that the category is
correctly identified. For example, an item from a particular Internet
source might contain a question about a product and a request to receive
a catalog. Content Classification might
assign high scores to three categories, relating to the Internet source,
product info, and send catalog.
Category scores (percentage values
from 0 to 100) correspond to the match values (from 0 to 1) used in Classification Workbench graphical analysis.
- Classification Center
- A web application that is used to manage the classification of IBM FileNet Content Manager, IBM Content
Manager, and file system content.
You can use the Classification Center to
determine the content to be classified, specify classification options
(such as the decision plan to use and various runtime preferences),
monitor classification activity, and view the classification results.
If necessary, you can also use the Classification Center to reclassify documents
if you determine that different knowledge base categories or decision
plan actions are more applicable.
Classification Quick Start Tool 
An application that is used to evaluate IBM Content
Classification and create an initial
classification project. You can easily import documents from the file
system and organize them into categories. Tune classification by completing
tasks that are suggested by the Classification Quick Start Tool to
improve classification performance. Then you can view the results,
adjust the ratio between automation and accuracy, and export the project
to Classification Workbench for further
development. 
- Classification Workbench
- An application that is used to create and analyze knowledge bases
and decision plans. With Classification Workbench,
you can also evaluate system performance by importing analysis data
and viewing reports and graphical diagnostics.
- clustering
- A feature that can help users identify potential categories in
an uncategorized (or partially categorized) content set. Content items
are grouped together according to their semantic similarity. Clustering
provides initial groupings ("clusters") of similar items. This feature
can also be used to refine existing categorization, by helping you
to identify new categories in a collection of categorized content
items. The Taxonomy Proposer that
is installed with Classification Workbench provides
enhanced clustering workflows for category discovery.
- concept
- A basic unit of linguistic or quantitative information that is
derived from input text (and its context) that can influence classification.
The linguistic information can include semantic, contextual, and morphological
data. The quantitative information can include various indicators
derived from the text, such as its length. For example, a concept
can be a word (typically in its stem form), a phrase, an indication
of the input source, the length of the text, and so on.
- Content Extractor
- A command-line tool that is used to extract the content from an IBM FileNet Content Manager object store or IBM Content
Manager repository. You can import
the extracted content into Classification Workbench and
use it to train a knowledge base or provide test data for a decision
plan.
- content set
- A body of data (emails, documents, or other texts) supplied by
your organization, imported into Classification Workbench, and used to create and
analyze knowledge bases and decision plans. You use a training
set to create a knowledge base and a testing set to analyze
the knowledge base and generate results for evaluating knowledge base
performance. Each content item in the content set is defined by one
or more fields.
- content type
- A field attribute that determines how content is analyzed by Content Classification's natural language
processing engine. Each content type is configured to achieve optimal
results from text analysis. For example, the body content
type is optimized for email fields that contain the main body text.
- data type
- A field attribute that specifies the kind of data that the field
contains. Data types supported as field values by the API are string
(Unicode) and numeric.
- decision plan
- A collection of rules built in Classification Workbench that determine how Content Classification classifies content
items such as documents or emails. Each rule consists of one trigger
and one or more actions. For example, you can configure rules that
make automatic classification decisions based on keywords, word proximity,
patterns, and so on. Decision plans can refer to one or more knowledge
bases to combine rules with content-based classification.
- feedback
- Information that is presented to Content Classification to confirm/correct
its current categorization knowledge. A positive categorization example
confirms that a text belongs to a specific category. Submitting accurate
feedback constantly improves Content Classification's
ability to classify texts.
- fields
- Each content item is represented as a collection of fields. Fields
can store the textual content of each item, metadata such as the delivery
channel or author's name, and categorization information.
- knowledge base
A single file encapsulating data that is required by Content Classification for accurate content-based
classification. The knowledge base stores information accumulated
initially through training, and later through ongoing learning, which
enables the system to improve over time as it learns new categories
and adapts to changes in data.
When a text is submitted to Content Classification for matching, the knowledge
base is used to select the category that is most likely to match the
text. During knowledge base analysis, content items are matched to
categories, and categories are scored based on their relevancy. The
results can be used by applications powered by Content Classification to perform automatic
or semi-automatic actions.
Content Classification supports the use of
multiple, monolingual or multilingual knowledge bases. Knowledge bases
can be added to decision plans to combine content-based classification
with rules-based classification.
- learning
- The ability of Content Classification to
continually process feedback, update the knowledge base in real-time,
and improve classification accuracy over time. Just as humans constantly
adapt to change, so does Content Classification.
Through exposure to new data, the system accumulates new information
about existing categories, improving its ability to classify over
time.
Learning is an incremental process. System learning simulates
real life, where recently learned information has more impact on classification.
You can use Classification Workbench to
simulate online learning by providing feedback to the system manually,
using a specified set of data.
- listener
- The server component that acts as the entry point to Content Classification. Client requests are
received by the listener and dispatched to the appropriate server
component for processing.
- Management Console
- An application used for Content Classification administration.
You use the Management Console to
create, import, and export knowledge bases and decision plans. You
can also start and stop knowledge base and decision plan instances,
modify properties, add and modify fields, and view the servers on
which the system is running.
- matching
- Matching is the process of determining the relevancy of a text
to categories in the knowledge base. Matching is also referred to
as "suggestion".
- natural language processing
- The manipulation of data by computer systems for the purpose of
understanding and reacting to human language. Content Classification's natural language
processing engine uses morphological rules, word association, and
complex algorithms to analyze textual content and generate concept
models that are required for classification.
- precision and recall rates
- Statistic indicators that can be represented graphically to show
how well Content Classification is
performing. Precision is inversely related to false positives, and
recall is inversely related to false negatives:
- Precision
- The fraction of content items that Content Classification identified as relevant
to a category, that are actually relevant to the category (the rest
are false positives).
- Recall
- The fraction of content items that are actually relevant to a
category, that are recognized as such by Content Classification (the rest are false
negatives).
Content Classification applications
use thresholds to handle the trade-off between recall and precision.
A high threshold results in high precision and lower recall; a low
threshold results in lower precision, but higher recall. It is typically
a business decision that determines where to set the threshold.
read-only instance 
A server component capable of handling read-only requests such
as "Suggest" on a knowledge base or “Decide” on a decision plan. Read-only
requests can be processed by the read/write instance or forwarded
to a read-only instance. Read-only instances are optional components,
and the system can be configured to run any number of read-only instances
of a specific knowledge base or decision plan on multiple computers.
Read-only instances are configured by using the Management Console. 
- read/write instance
- A server component capable of handling both read/write and read-only
requests on a knowledge base or decision plan. A single read/write
instance should be configured for each knowledge base and decision
plan. Read/write instances are configured by using the Management Console.
- Relationship Modeling Engine
- The online learning engine that uses a knowledge base to classify
items based on their textual content or other factors, such as rules,
and enables the system to accurately analyze the language, intent,
and context of texts. The Relationship Modeling Engine includes both
natural language processing and semantic modeling engines.
- relevancy
- During the matching process, the system assigns the text a relevancy
score with respect to each category. This score is a value between
0.0 and 1.0, where a value of 1.0 indicates the text is a perfect
match for a category. Content Classification applications
take action according to the highest scoring categories.
- snapshot
- A complete backup of project data (including content set, knowledge
base or decision plan, and analysis data).
Using Classification Workbench 's Snapshot Explorer,
you can take multiple snapshots of your project, and restore previous
snapshots, as required. For example, when you make changes to a knowledge
base, you can compare your current project with any previous snapshot,
to help you assess the impact of these changes on knowledge base performance.
- summary reports and graphs
- Allow you to assess how well Content Classification performs on a given
testing set using a specific knowledge base or decision plan. For
example, after you analyze a knowledge base using a testing set, you
can view summary reports and perform graphical diagnostics to evaluate,
troubleshoot, and improve performance.
- testing set
- Also known as the "analysis set", the testing set is a collection
of categorized content items which is used to analyze an existing
knowledge base. The content items in the testing set are processed
by Content Classification, which
utilizes the knowledge base to return suggested categories. The returned
categories (or matches) are compared to the categorization information
that is a part of each item in the testing set. The differences between
the expected category results and the actual results are the basis
for evaluating the knowledge base.
- threshold
- A value that specifies the minimum score that is required in order
to determine that a content item belongs to a category. Thresholds
can be defined globally or per category. You can set a fixed percentage
value or calculate thresholds based on precision, recall, or cost
ratio.
Your application can apply actions to texts that receive
scores above a predefined threshold. For example, an email response
application can send responses when the score of a text exceeds a
specified threshold.
- training set
- A collection of categorized content items that is used to create
a knowledge base. Content Classification processes
the items in the training set to create the statistic models of each
category that make up the knowledge base.
- trigger
- A constraint expression in a decision plan rule that determines
the conditions that must be met to initiate the rule's actions. Each
rule has exactly one trigger.