This glossary defines terms that are used in the IBM® Watson
Content Analytics product interfaces and
documentation.
For more information about linguistic terms, see the
Glossary of linguistic terms from the Summer
Institute of Linguistics. For more information about Unicode-related
terms, see the
Glossary of Unicode terms from the Unicode
Consortium.
- access control list (ACL)
- In computer security, a list associated with an object that identifies
all the subjects that can access the object and their access rights.
- administrative role
- A classification of a user that prescribes access to a user.
- analysis engine
- See text analysis engine.
- analysis results
- The information that is produced by annotators. Analysis results
are written to a data structure called a common analysis structure.
Analysis results produced by the custom text analysis engines (annotators)
can be made available for search by inclusion in the index.
- annotation
- Information about a span of text. For example, an annotation could
indicate that a span of text represents a company name. In the Unstructured
Information Management Architecture (UIMA), an annotation is a special
feature structure.
- annotator
- A software component that performs specific linguistic analysis
tasks and produces and records annotations. An annotator is the analysis
logic component in an analysis engine.
- base annotators
- A set of standard text analysis engines used for default document
analysis processing.
- Boolean search
- A search in which one or more search terms are combined by using
operators such as AND, NOT, and OR.
- boost class
- An object that contains specifications that can influence the
relative rank of a document in the search results.
- boost word
- A word that can influence the relative rank of a document in the
search results. During query processing, the importance of a document
that contains a boost word might be raised or lowered, depending on
a score that is predefined for the word.
- category tree
- A hierarchy of categories.
- certificate
- In computer security, a digital document that binds a public key
to the identity of the certificate owner, thereby enabling the certificate
owner to be authenticated. A certificate is issued by a certificate
authority and is digitally signed by that authority.
- certificate authority
- A trusted third-party organization or company that issues the
digital certificates used to create digital signatures and public-private
key pairs. The certificate authority guarantees the identity of the
individuals who are granted the unique certificate.
- character normalization
- A process in which the variant forms of a character, such as capitalization
and diacritical marks, are reduced to a common form.
- clitic
- A word that syntactically functions separately but is phonetically
connected to another word. A clitic can be written as connected or
separate from the word it is bound to. Common examples of clitics
include the last part of a contraction in English (wouldn't or you're).
- collection
- A set of data sources and options for crawling, parsing, indexing,
and searching those data sources.
- common analysis structure (CAS)
- A structure that stores the content and metadata of a document,
and all analysis results that are produced by a text analysis engine.
All data exchange during document analysis is handled by using the
common analysis structure.
- common analysis structure consumer (CAS consumer)
- A consumer that does the final processing on the analysis results
that are stored in the common analysis structure. For example, a consumer
indexes the contents of the common analysis structure in a search
engine or it populates a relational database with specific analysis
results.
- common communication layer (CCL)
- The communication infrastructure that unites the various product
components (controller, parser, crawler, and index server).
- concept extraction
- A text analysis function that identifies significant vocabulary
items (such as people, places, or products) in text documents and
produces a list of those items. See also theme extraction.
- correlation
- An indication of how relevant a facet value is in documents that match the query
conditions. The correlation score measures the uniqueness and frequency of a facet value in
some documents as compared to other documents that match the query. A correlation value that
is higher than 1.0 represents an anomaly that might require further investigation.
- crawl space
- A set of sources that match specified patterns (such as Uniform
Resource Locators (URLs), database names, file system paths, domain
names, and IP addresses) that a crawler reads from to retrieve items
for indexing.
- crawler
- A software program that retrieves documents from data sources
and gathers information that can be used to create search indexes.
- credential
- Detailed information, acquired during authentication, that describes
the user, any group associations, and other security-related identity
attributes. Credentials can be used to perform a multitude of services,
such as authorization, auditing, and delegation. For example, the
sign-on information (user ID and password) for a user are credentials
that allow the user to access an account.
- custom text analysis engine
- A text analysis engine that is created by using the Unstructured
Information Management Architecture (UIMA) software development kit
(SDK) and can be added to the set of standard text analysis engines
(also known as base annotators). See also text analysis engine.
- data source
- Any repository of data from which documents can be retrieved,
such as the web, relational and nonrelational databases, and content
management systems.
- data source type
- A grouping of data sources according to the protocol that is used
to access the data.
- data store
- A data structure where documents are kept in their parsed form.
- dequeue
- To remove items from a queue.
- diacritic
- A mark indicating a change in the phonetic value of a character
or a combination of characters.
- discoverer
- A function of a crawler that determines which data sources are
available for the crawler to retrieve information from.
- distinguished name
- The name that uniquely identifies an entry in a directory. A distinguished
name consists of attribute:value pairs, separated by commas. Also,
a set of name-value pairs (such as CN=person's name and C=country
or region) that uniquely identifies an entity in a digital certificate.
- Document Object Model
- A system in which a structured document, such as an XML file,
is viewed as a tree of objects that can be programmatically accessed
and updated.
- Domino® Document Manager cabinet
- A Domino Document Manager database that is used to organize documents.
Cabinets hold Domino databases.
- Domino Document Manager library
- A Domino Document Manager database that is the entry point to
Domino Document Manager.
- Domino Internet Inter-ORB Protocol (DIIOP)
- A server task that runs on the server and works with the Domino
Object Request Broker to allow communication between Java™ applets that are created with the Notes® Java
classes and the Domino server. Browser users and Domino servers use
DIIOP to communicate and to exchange object data.
- dynamic ranking
- A type of ranking in which the terms in the query are analyzed
with respect to the documents that are being searched to determine
the rank of results. See also text-based
scoring. Contrast with static
ranking.
- dynamic summarization
- A type of summarization in which the search terms are highlighted
and the search results contain phrases that best represent the concepts
of the document that the user is searching for. Contrast with static summarization.
- enqueue
- To put a message or item in a queue.
- escape character
- A character that suppresses or selects a special meaning for one
or more characters that follow.
- facet
- A clearly defined property of a subject. Facets for a given subject
are mutually exclusive and collectively exhaustive. Faceted classification
schemes differ from hierarchical categorization schemes in that more
than one facet can be used to find items of interest.
- facet value
- The combination of a facet and with a specific character string,
such as a facet named City combined with the string New York.
- faceted browsing
- A process of browsing information by filtering a set of topics
by progressively selecting from only valid values of a faceted classification
system, which is a predefined collection of facets.
- feature path
- A path that is used to access the value of a feature in a Unstructured
Information Management Architecture (UIMA) feature structure.
- feature structure
- The underlying data structure that represents the result of text
analysis. A feature structure is an attribute-value structure. Each
feature structure is of a type, and every type has a specified set
of valid features or attributes, much like a Java class.
- federated search
- A search capability that enables searches across multiple search
services and returns a consolidated list of search results.
- federation
- The process of combining naming systems so that the aggregate
system can process composite names that span the naming systems.
- field
- An area into which a particular category of data or control information
is entered.
- fielded search
- A query that is restricted to a particular field.
- free-form text
- Unstructured text consisting of words or sentences.
- free text search
- A search in which the search term is expressed as free-form text.
- frequency
- An indication of how many documents in the queried document set
contain a given facet value.
- full-text index
- A data structure that references data items to enable a search
to find documents that contain the query terms.
- fuzzy search
- A search that returns words with spelling that is similar to that
of the search term.
- gloss
- A unit of information that is associated with an ICA Studio dictionary entry, such as
the lemma, part of speech, or synonyms.
- hybrid search
- A combined Boolean search and free text search.
- identity management
- A set of APIs that control access to secure data and enable users
to search a collection without being required to specify a user ID
and password for each repository in the collection.
- index
- See full-text index.
- index cache
- A buffer that holds data that enables the index to be rebuilt
without recrawling documents.
- index field
- A field that exists only in the index to represent data that is
common between multiple input sources. Index fields can help users
retrieve documents without needing to be knowledgeable about actual
field names.
- inflection
- A variation in the form of a word to reflect grammatical information,
such as gender, tense, number or person. Inflections are typically
generated by adding affixes.
- information extraction
- A type of concept extraction that automatically recognizes significant
vocabulary items, such as names, terms, and expressions, in text documents.
- IP address
- A unique address for a device or logical unit on a network that
uses the IP standard.
- Java Database Connectivity (JDBC)
- An industry standard for database-independent connectivity between
the Java platform and a wide range of databases. The JDBC interface
provides a call-level API for SQL-based database access.
- JavaScript
- A web scripting language that is used in browsers and web servers.
- JavaServer Pages (JSP)
- A server scripting technology that enables Java code to be dynamically
embedded within web pages (HTML files) and executed when the page
is served, in order to return dynamic content to a client.
- Java virtual machine (JVM)
- A software implementation of a processor that runs compiled Java
code (applets and applications).
- Katakana
- A character set that consists of symbols that are used in one
of the two common Japanese phonetic alphabets, which is used primarily
to write foreign words phonetically.
- key database file
- See key ring. key ring.
- key ring
- In computer security, a file that contains public keys, private
keys, trusted roots, and certificates. See also keystore file.
- keystore file
- A key ring that contains both public keys that are stored as signer
certificates and private keys that are stored in personal certificates.
- language identification
- A search function that determines the language of a document.
- lemma
- The base form of a word plus inflected forms that share the same
part of speech.
- lemmatization
- A process that determines the lemma for each word form that occurs
in text. The lemma of a word encompasses its base form plus inflected
forms that share the same part of speech. For example, the lemma for
go encompasses go, goes, went, gone, and going. Lemmas for nouns group
singular and plural forms (such as calf and calves). Lemmas for adjectives
group comparative and superlative forms (such as good, better, and
best). Lemmas for pronouns group different grammatical cases of the
same pronoun (such as I, me, my, and mine).
- lexical affinity
- The relationship of search words in a document that are close
to each other in meaning. Lexical affinity is used to calculate the
relevancy of a result.
- lexical analysis
- The process by which a sequence of characters is grouped into
a series of lexical items, known as tokens, and all available dictionary
data is associated with the lexical items. Lexical analysis comprises
three separate steps: segmentation, normalization, and annotation.
- library
- A system object that serves as a directory to other objects. See
also Domino Document Manager
library.
- ligature
- Two or more characters that are connected so they appear as one
character. For example, ff and ffi are characters that can be presented
as ligatures.
- Lightweight Directory Access Protocol (LDAP)
- An open protocol that uses TCP/IP to provide access to directories
that support an X.500 model and that does not incur the resource requirements
of the more complex X.500 Directory Access Protocol (DAP). For example,
LDAP can be used to locate people, organizations, and other resources
in an Internet or intranet directory.
- linguistic search
- A search type that browses, retrieves, and indexes a document
with terms that are reduced to their base form (for example, so that mice is
indexed as mouse) or expanded with their base form (as with
compound words).
- link analysis
- A method that is based on the analysis of hyperlinks between documents
and used to determine what pages in the collection are important to
users.
- local federator
- A client object created by the search and index APIs that enables
users to search a set of heterogeneous collections and obtain a unified
set of search results.
- Lotus Quickr place
- A web venue that is provided by Lotus® Quickr® that enables geographically
dispersed participants to collaborate on projects and communicate
online in a structured and secure workspace.
- Lotus Quickr room
- A partitioned area of a Lotus Quickr place
that is restricted to authorized members who share a common interest
and a need to work collectively.
- masking character
- A character that is used to represent optional characters at the
front, middle, and end of a search term. Masking characters are normally
used for finding variations of a term in an index. See also wildcard character.
- master administrator
- An administrative role that enables a user to administer the entire Watson Content Analytics system.
- MIME type
- An Internet standard for identifying the type of object that is
being transferred across the Internet.
- monitor
- A user who has the authority to observe collection-level processes.
- newline character
- A control character that causes the print or display position
to move down one line.
- n-gram segmentation
- A segmentation method that considers overlapping sequences of
a specific number of characters as a single word. See also segmentation. Contrast with Unicode-based white space segmentation.
- no-follow directive
- A directive in a web page that instruct robots (such as the Web
crawler) to not follow links found in that page.
- no-index directive
- A directive in a web page that instruct robots (such as the Web
crawler) to not include the contents of that page in the index.
- normalization
- The process of replacing surface form representations with their
canonical form. This can include case normalization (such as replacing Run with run),
grammatical normalization (such as replacing runs with run),
and lexicographical normalization (such as replacing Unicode full
width characters with Unicode basic form, or removing white spaces
from Chinese text).
- normalized form
- A form of a word or multi-word unit after it has undergone a process
of normalization. The normalized form is also known as a lemma or
stem.
- Notes remote procedure call (NRPC)
- A communication mechanism of Lotus Notes® that is used for all Notes-to-Notes communication.
- out of vocabulary (OOV) word
- A word that is not included in the base ICA Studio dictionary that is used
for word recognition.
- opaque term
- A query term that is not parsed by the linguistic query parser.
Instead, opaque terms are identified by their syntax to be implementation-specific,
such as specific to the syntax for searching XML documents with an
XML query language. Opaque query terms begin with the @ character
and the query language identifier. For example, @xmlf2 specifies that
the query is to be handled by the XML fragment query language, and
@xmlp specifies that the query is to be handled by the XPath query
language.
- operator
- A user who has the authority to observe, start, and stop collection-level
processes.
- parametric search
- A type of search that looks for objects that contain a numeric
value or attribute, such as dates, integers, or other numeric data
types within a specified range.
- parser
- A program that interprets documents that are added to the data
store. The parser extracts information from the documents and prepares
them for indexing, search, and retrieval.
- parser driver
- A service that feeds the parser service with documents. There
is one parser driver for each collection. A collection's parser driver
service corresponds to the collection's parser in the administration
console.
- parser service
- The service that handles all document parsing and text analysis
processing across document collections. At least one parser service
is running at all times.
- place
- A virtual location that is visible in the portal where individuals
and groups meet to collaborate. In a portal, each user has a personal
place for private work, and individuals and groups have access to
a variety of shared places, which can be either public places or restricted
places. See also Lotus Quickr place.
- popular ranking
- A type of ranking that raises a document's existing ranking based
on the document's popularity.
- processing engine archive
- A .pear zip archive file that includes an Unstructured Information
Management Architecture (UIMA) analysis engine and all of the resources
required to use it for custom analysis.
- proximity search
- A text search that returns a result when two search patterns occur
within a specified distance from each other.
- proxy server
- A server that acts as an intermediary for HTTP web requests that
are hosted by an application or a web server. A proxy server acts
as a surrogate for the content servers in the enterprise.
- query expansion
- Adding search terms to a user's search string. For example, the
search string phone might be expanded to include the
terms telephone, mobile phone, and cellular
phone.
- quick link
- An association between a Uniform Resource Identifier (URI) and
keywords or phrases.
- ranking
- The assignment of an integer value to each document in the search
results from a query. The order of the documents in the search results
is based on the relevance to the query. A higher rank signifies a
closer match. See also dynamic
ranking and static ranking.
- raw data store
- A data structure where crawled documents are stored before they
are sent to the parser. Crawlers write to the raw data store, and
the parser reads from the raw data store. When documents have been
parsed, they are removed from the raw data store. Not to be confused
with data store.
- regular expression annotator
- A software component that detects entities or units of information
in a text document, such as product numbers, based on regular expressions
that describe the exact patterns that are searched in the document
text. If one of the regular expressions matches parts of the document
text, the regular expression annotator creates the corresponding annotations
that cover the match or part of it. These annotated expressions are
then stored, either in the index by using an index mapping file, or
a JDBC-capable database by using a database mapping file.
- remote federator
- A server federator that federates a set of searchable objects.
- Robots Exclusion Protocol
- A protocol that allows website administrators to indicate to visiting
robots which parts of their site should not be visited by the robot.
- room
- A program that allows users to create documents for others to
read, respond to comments from others, and review project status and
deadlines. Users can also chat with others who are in the same room.
See also Lotus Quickr room.
- rule-based category
- Categories that are created by rules that specify which documents
are associated with which categories. For example, you can define
rules to associate documents that contain or exclude certain words,
or that match a Uniform Resource Identifier (URI) pattern, with specific
categories.
- search application
- A program that processes queries, searches the index, returns
the search results, and retrieves the source documents.
- search cache
- A buffer that holds the data and results of previous search requests.
- search engine
- A program that accepts a search request and returns a list of
documents to the user.
- search results
- A list of documents that match the search request.
- Secure Sockets Layer (SSL)
- A security protocol that provides communication privacy. With
SSL, client/server applications can communicate in a way that is designed
to prevent eavesdropping, tampering, and message forgery.
- security token
- Information about identity and security that is used to authorize
access to documents in a collection. Different data source types support
different types of security tokens. Examples include user roles, user
IDs, group IDs, and other information that can be used to control
access to content.
- seed list page
- In WebSphere Portal, an XML page that contains links to the pages
that are available on a portal. Crawlers use the seed list to identify
the documents to crawl. The seed list page also contains metadata
that is stored with the crawled documents in the index.
- segmentation
- The division of text into distinct lexical units such as words,
phrases, sentences, paragraphs, or lemmas. See also n-gram segmentation and Unicode-based white space segmentation.
- semantic search
- A type of keyword search that incorporates linguistic and contextual
analysis. See also text analysis.
- servlet
- A Java program that runs on a web server and extends the server's
functionality by generating dynamic content in response to web client
requests. Servlets are commonly used to connect databases to the web.
- shingle
- A string of consecutive tokens (words) that are taken from a sentence.
For example, from "This is a very short sentence.", the 3-word shingles
(or trigrams) are:
This is a
is a very
a very short
very short sentence
Shingles can be used in statistical
linguistics. For example, if two different texts have a lot of common
shingles, the texts are probably related somehow.
- soft error page
- A type of web page that provides information about why the requested
web page cannot be returned. For example, instead of returning a simple
status code, the HTTP server can return a page that explains the status
code in detail.
- static ranking
- A type of ranking in which factors about the documents that are
being ranked, such as date, the number of links that point to the
document, and so on, augment the rank. Contrast with dynamic ranking.
- start Uniform Resource Locator (URL)
- The starting point for a crawl.
- static summarization
- A type of summarization in which the search results contain a
specified, stored summary from the document. Contrast with dynamic summarization.
- stemming
- See word stemming.
- stop word
- A word that is commonly used, such as the, an, or and,
that is ignored by a search application.
- stop word removal
- The process of removing stop words from the query to ignore common
words and return more relevant results.
- surface form
- The form of a word or multi-word unit as it is found in the unprocessed
input text.
- summarization
- The process of including non-redundant sentences in search results
to briefly describe the content of a document. See also dynamic summarization and static summarization.
- synonym dictionary
- A dictionary that enables users to search for synonyms of their
query terms when they search a collection.
- taxonomy
- A classification of objects into groups based on similarities.
A taxonomy organizes data into categories and subcategories. See also category tree.
- text analysis
- The process of extracting semantics and other information from
text to enhance the retrievability of data in a collection. See also semantic search.
- text analytics
- A form of natural language processing that includes linguistic,
statistical, and machine learning techniques for analyzing text and
extracting key information for business integration.
- text analysis engine
- A software component that is responsible for finding and representing
context and semantic content in text.
- text-based scoring
- The process of assigning an integer value to a document that signifies
the relevance of the document with respect to the terms in a query.
A higher integer value signifies a closer match to the query. See
also dynamic ranking.
- text extractor
- A component that uses document filtering technology based on Oracle
Outside In Content Access to identify document formats.
- text segmentation
- See segmentation.
- theme extraction
- A type of concept extraction that automatically recognizes significant
vocabulary items in text documents to extract the theme or topic of
a document. See also concept
extraction.
- token
- A span of text to be considered as a meaningful unit
for higher level processing, such as indexing. A token is typically
a word, a number, an acronym, or other entity that has syntactic or
semantic value.
- tokenization
- The process of parsing input into tokens.
- tokenizer
- A text segmentation program that scans text and determines if
and when a series of characters can be recognized as a token.
- trailing character
- A character that holds the last position in a word.
- type system
- The type system defines the types of objects (feature structures)
that may be discovered by a text analysis engine in a document. The
type system defines all possible feature structures in terms of types
and features. You can define any number of different types in a type
system. A type system is domain and application specific.
- Unicode-based white space segmentation
- A method of tokenization that uses Unicode character properties
to distinguish between token and separator characters. See also segmentation. Contrast with n-gram segmentation.
- Uniform Resource Identifier (URI)
- A compact string of characters that identifies an abstract or
physical resource.
- Uniform Resource Locator (URL)
- The unique address of an information resource that is accessible
in a network such as the Internet. The URL includes the abbreviated
name of the protocol used to access the information resource and the
information used by the protocol to locate the information resource.
- Unstructured Information Management Architecture (UIMA)
- An IBM architecture that defines a framework for
implementing systems for the analysis of unstructured data.
- user agent
- An application that browses the web and leaves information about
itself at the sites that it visits. For example, the Web crawler is
a user agent.
- Web crawler
- A type of crawler that explores the web by retrieving a web document
and following the links within that document.
- weighted term search
- A query in which certain terms are given more importance.
- wildcard character
- A character that is used to represent optional characters at the
front, middle, or end of a search term.
- word stemming
- A process of linguistic normalization in which the variant forms
of a word are reduced to a common form. For example, words like connections, connective,
and connected are reduced to connect.
- XML Path Language (XPath)
- A language that is designed to uniquely identify or address parts
of source XML data, for use with XML-related technologies, such as
XSLT, XQuery, and XML parsers. XPath is a World Wide Web Consortium
standard.