Content-Based Retrieval
The Content Engine Java™ and .NET APIs include a number of interfaces for content-based retrieval (CBR) administrative functions for IBM® FileNet Content Search Engine. By using the APIs, you can configure domains and servers, establish and configure index areas, and initiate and manage index jobs.
Indexing and Index Jobs
IBM Content Search Services indexing
aggregates data (in the form of indexes) to support full-text searches
of the content of objects and the string-valued properties of those
objects. Only objects and string-valued properties that are enabled
for CBR are included in full-text searches. CBR-enablement is controlled
by the Boolean value of the IsCBREnabled property
on ClassDefinition and PropertyDefinitionString objects. For
the content of an object to be enabled for full-text search and to
allow its string-valued properties to be CBR-enabled, you must enable
CBR for the ClassDefinition object that defines the
object's class (Document, Annotation, Folder,
or CustomObject class and subclasses only). For the
value of a string-valued property of an CBR-enabled object to be enabled
for full-text search, you must enable CBR for the PropertyDefinitionString object
that defines the property. Indexing is done automatically for all
CBR-enabled objects and properties. Because the indexing operation
is a batched, asynchronous operation, its results are not immediately
evident. For more information about enabling CBR, see Content
Searches.
With an IndexJob object,
you can track the status of an index job and initiate and control
the job. Usually, you initiate an index job to rebuild an index that
is corrupted, or to accommodate a configuration change. The IndexRequests
property on an IndexJob object contains a listing
of all index requests that are associated with the index job. The CmTextSearchIndexRequest classes
provide read and update operations, and status and tracking information
for index requests.
The IndexJobItem base
class is subclassed to provide for particular types of index jobs:
- Class index job (
IndexJobClassItem): All instances of the specified class are full-text indexed. Class index jobs require a table scan on the database, even if the amount of data to be indexed is minimal. A significant amount of time is required to scan a large table. The database tables are scanned once for all classes to be indexed. To minimize the number of table scans required, use a single index job operation for all classes to be indexed for the same table. - Single object index job (
IndexJobSingleItem): A single object is full-text indexed. - Reindex index job: An existing index is reindexed.
During an indexing operation, all currently indexed data is available for use in full-text searches. However, new index data might become available while a full-text query is in progress. In this case, the query returns duplicate matches because it uses both old and new indexed data. When the index job completes, old copies of indexed data are removed and duplicate matches no longer occur.
Canceling an Index Job
An index job can be canceled by setting its JobAbortRequested
property to true, or as a result of an unexpected
error. When an index job is canceled, all of its related index requests
are deleted. If the index job is canceled by an administrator, the
JobStatus property on the IndexJob object is set
to CANCELED. If the index job stops as the result
of an unexpected error, the JobStatus property has a value of TERMINATED_ABNORMALLY.
An index job was cancelled on Object Store <objectstoreName>. The following full-text indexes will need to be replaced by running full-text index reindex job(s) or base class with subclasses index job(s): Full-text index name: <name> in Index Area: <index area name> Full-text index name: <name> in Index Area: <index area name>
It is recommended that the administrator resolve this situation by creating index jobs to reindex the original indexes, depending on the type of index job that is canceled:
- Index job for a reindex operation: If this type of index job is
canceled, the resource state of the original index remains set to
CLOSED. It is recommended that the administrator create another index job on thisCLOSEDindex. - Index job for a root class with subclasses: If this type of index
job is canceled, there might be multiple indexes that have a resource
state set to
CLOSED. It is recommended that the administrator either create another index job on the root class with subclasses or create an index job for each of theCLOSEDindexes. For more information, see Status of Index Areas and Indexes.
Pausing and Resuming an Index Job
If you pause an index job, by setting the JobPauseRequested
property of an IndexJob object to true,
the JobStatus property of the index job is updated to IndexJobStatus.PAUSED,
and the dispatching of new index requests by the index job is halted.
Existing index requests are not paused. If you resume an index job,
by setting the JobPauseRequested property of an IndexJob object
to false, the JobStatus property of the index job
is updated to IndexJobStatus.IN_PROGRESS, and the
dispatching of new index requests by the index job is allowed.
Indexing Error Handling
All indexing errors of any kind are recorded in the Content Engine log file. Optionally, the
indexing errors also can be persisted to an object store. The ObjectStore property
IndexingFailureRecordingLevel determines whether indexing errors are
persisted. The default behavior is for indexing errors to be logged
only, rather than persisted.
Indexing Failure Codes
If an indexing failure of a CBR-enabled object occurs, the CmIndexingFailureCode
property of the CBR-enabled object is set to an IndexingFailureCode failure
code constant and the failure is recorded in the Content Engine log file. If you set the
value of the IndexingFailureRecordingLevel property of an ObjectStore object
to PROPAGATE_TO_SOURCE, the error information is
propagated to the CmIndexingFailureCode property of all CBR-enabled
objects.
The indexing failure code is recorded only when the associated object is processed by the Content Search Engine and some or all of the object's content could not be full-text indexed (due, for example, to text size limits causing truncation, or use of an unsupported format type). An indexing failure code does not necessarily mean that the object was not at least partially indexed; the server attempts to index as much of the object as it can. If a CBR-enabled object is successfully indexed, its CmIndexingFailureCode property has a value of zero.
System failures do not generate an indexing failure code. In such cases, the indexing operation is retained, a description of the error is written to the Content Engine log file, the error is recorded as part of the index request last failure description, and the indexing operation is automatically tried again when the system is in a stable state.
Indexing Request Errors
The
cause of an indexing request error is recorded in the associated index
request object (CmTextSearchIndexRequest object).
Indexing request errors are recorded in the LastFailureReason property
of the index request object. When an index request associated with
an object is successfully processed, any failed index requests for
that same object are deleted. In other words, a CBR-enabled object
does not have any failed index requests (the CmIndexingFailureCode
property is zero) if it is successfully indexed. An index request
is typically retained only in the case of a system failure and the
index request needs to be tried again.
Indexing Job Errors
In some cases, an error might be related to index job processing,
rather than index request processing. For example, at the end of some
index jobs the index must be deleted, and it is possible for the deletion
to fail. However, this error does not mean the index job itself failed:
an index job that successfully submits all of the associated index
requests always completes with a JobStatus property value of TERMINATED_NORMALLY (successful
completion).
If a failure occurs that is related to an index
job, the identified reason for the failure is recorded in the LastFailureReason
property on the IndexJob object.
Domain and Server Information
Full-text indexing and searching
requires that at least one IBM Content Search Services server
configuration to be associated with your FileNet
P8
domain. To create such
a configuration, create a CmTextSearchServer object
by using one of the Factory.CmTextSearchServer methods
and specify the domain object that represents your FileNet
P8
domain. Each CmTextSearchServer object
that you create and associate with a domain is automatically added
to the read-only CmTextSearchServerSet collection
object returned by the TextSearchServers property of the Domain object.
After you create an IBM Content Search Services server,
you must set its TextSearchServerStatus property to ENABLED for
it to be recognized by the Content Engine server.
If the Content Engine server
cannot communicate with the IBM Content Search Services server,
it automatically sets its TextSearchServerStatus property to UNAVAILABLE.
If you do not want the IBM Content Search Services server
to be available, set its TextSearchServerStatus property to DISABLED.
The host, port number, and connection token of the IBM Content Search Services server must be set by
using the CmTextSearchServer object properties.
IBM Content Search Services servers and FileNet P8 objects have the following many-to-one relationships:
- Multiple object stores can share the same IBM Content Search Services server configuration object.
- Multiple IBM Content Search Services server configuration objects can exist for a FileNet P8 domain.
The properties of a CmTextSearchConfiguration object enable you
to control IBM Content Search Services functions
on the Content Engine server.
A CmTextSearchConfiguration object is contained in
the SubsystemConfiguration property of the Domain, Site, VirtualServer,
and ServerInstance classes. The CmTextSearchConfiguration instance
to be used is determined by these classes in this order: ServerInstance, VirtualServer, Site,
and Domain.
Index Areas and Indexes
An IBM Content Search Services index area is a file system directory that contains the information necessary to perform full-text indexing that is updated and queried by IBM Content Search Services. A many-to-one relationship exists between an index area and an object store. Each index area is dedicated to a single object store, but you can have multiple index areas for an object store on a single file system, or you can distribute the indexing information for an object store in multiple index areas across file systems.
Each index area is represented by a CmTextSearchIndexArea object.
The file system location of an index area is stored in its RootDirectoryPath
property.
Each index area can hold multiple indexes (CmTextSearchIndex objects),
which are specified by its TextSearchIndexes property. A many-to-one
relationship exists between indexes and an index area; CmTextSearchIndex objects
are created automatically in the associated index area, as needed. When
an indexable class is instantiated, the CmTextSearchIndex object
that is associated with its base class, and any index partitioning
properties that are defined on the object store, can be used to reference
the full-text indexing information. If no CmTextSearchIndex object
is associated with the base class and index partitioning properties,
a new CmTextSearchIndex object (and the corresponding
index that is maintained by IBM Content Search Services)
is created. The index is identified by its IndexName property.
IBM Content Search Services indexing
and search servers update and query the indexes. The indexes in an
index area are only accessible to the servers that are in the same
site as the index area (Site property of the CmTextSearchServer and CmTextSearchIndexArea objects).
To improve indexing efficiency, you can specify which languages are supported in an object store by adding language codes to the string list of its TextSearchIndexingLanguages property. If the IBM Content Search Services server cannot determine the language of a document to be indexed in an index request, the first language code in the string list is used as the default language code for the index request. Ensure that the languages that you specify with this property match the languages of most of the documents in this object store; otherwise, you might experience a performance delay. If you do not set this property to at least one language code, and the deprecated TextSearchIndexingLanguage property was not previously set, an error occurs during indexing.
Depending on how many index areas are associated with an object store, not all the IBM Content Search Services index servers that are associated with the object store will perform indexing work. By default, without date or string partitioning configured, there is only one open full-text index for each index area that is associated with an object store. For all of the index servers to perform indexing work, the number of index areas that are associated with an object store must be equal to the number of index servers. For example, consider an object store that is associated with a Content Platform Engine server, a single index area, and two IBM Content Search Services servers. With a single index area, only one active full-text index is opened (assuming that no date or string partitioning is used). Because the full-text index has an affinity to only one IBM Content Search Services server for the lease duration, only one IBM Content Search Services index server performs indexing work on that full-text index at a time. The other index server is idle. For both index servers to be used, you must associate another index area with the object store.
Status of Index Areas and Indexes
Index areas (CmTextSearchIndexArea objects)
and indexes (CmTextSearchIndex objects) have a ResourceStatus property, which specifies their
availability status. This property can have a value of OPEN, CLOSED,
or FULL. For CmTextSearchIndexArea objects,
ResourceStatus can also have a value of STANDBY.
For CmTextSearchIndex objects, ResourceStatus can
also have a value of UNAVAILABLE.
Index Area Status
Indexes can be created only in an index area (CmTextSearchIndexArea object)
that has a resource status (ResourceStatus property) set to OPEN.
Otherwise, if there are no index areas in the object store that have
a resource status setting of OPEN, no new indexes
can be created in the object store.
The resource status of
an index area is automatically set to FULL when the
index area reaches its full capacity. This setting indicates that
no new objects can be indexed in the index area and no new indexes
can be created in the index area. However, existing indexes can be
deleted or queried. An index area is considered to be at full capacity
when the number of its indexes is equal to the value of its MaxIndexes
property and all of its indexes have a resource status of either FULL or CLOSED.
If an index area is full, and another index area that has a resource
status of STANDBY is found, the Content Engine server automatically sets
the resource status of the standby index area to OPEN.
If there are multiple standby index areas, the Content Engine server chooses the standby
index area with the highest priority, according to the value of its CmStandbyActivationPriority property. If two or
more index areas exist with the same priority, one of these standby
index areas is chosen randomly by the server.
Index Status
Create index requests can be written only to an index (CmTextSearchIndex object)
whose resource status (ResourceStatus property) is set to OPEN.
However, existing index entries can be updated, deleted, or queried.
The Content Engine server automatically sets
the resource status of an index to FULL when the
number of objects in the index is equal to the value of the MaxObjectsPerIndex
property or the size of the index reaches the valued specified by
the MaxSizePerIndexKbytes property. After the resource status of an
index is set to FULL, no more create index requests
can be written to the index. However, existing index entries can be
deleted or queried. When an index is full, updated index data is automatically
written to another index that has a resource status of OPEN,
if it exists. If no index can be found that has a resource status
of OPEN, a new index is created in the same index
area if the index area's maximum index limit (as specified by its
MaxIndexes property) has not been reached. If the index area's maximum
index limit is reached and no other index areas are set to OPEN,
an index area that is set to STANDBY, if present,
is automatically set to OPEN and the new index is
created in it.
If an index job closes an index for deletion,
the server sets its ResourceStatus property to CLOSED and
sets its IndexingStatus property to REPLACING. While
the IndexingStatus property is set to REPLACING,
you cannot change the ResourceStatus property of an index from CLOSED to OPEN by
using the API. When the index job is canceled, the server sets the
IndexingStatus property of the index to NORMAL; the
ResourceStatus property remains set to CLOSED unless
it is changed by the API.
An administrator can manually set
an index to a resource status of CLOSED. An index
that is set to CLOSED suppresses errors and allows
reindexing to complete without generating errors that can cause the
entire reindexing operation to fail. A closed index is closed to create
index requests; however, existing index entries can be updated, deleted,
or queried.
An administrator can manually set an index to
a resource status of UNAVAILABLE, typically when
the index is corrupted, or is otherwise inaccessible. An index that
is set to UNAVAILABLE suppresses errors and allows
reindexing to complete without generating errors that can cause the
entire reindexing operation to fail. Because search results might
be incomplete, it is recommended that an administrator inform users
when an index is set to UNAVAILABLE. After an index
is set to UNAVAILABLE, it cannot be set to any other
state; this state is a final state for the index. It is recommended
that an unavailable index be reindexed as soon as possible. The server
automatically creates a new index or opens a standby index to handle
any pending index requests for the unavailable index. After an unavailable
index is reindexed, the server deletes the unavailable index.
CmTextSearchIndex instance). For
more information, see Index
Requests.Index Requests
Each
index request is associated with a CBR-enabled object, and is an instance
of the CmTextSearchIndexRequest class. Index request
objects are created by the indexing process and cannot be created
by using the API. You can perform read and update operations on CmTextSearchIndexRequest objects,
which have properties that hold status, failure and retry information. CmTextSearchIndexRequest objects
do not have individual security assignments and receive their security
from the default instance security for the class.
The SourceObject property identifies the CBR-enabled object that is the subject of the index request, the IndexJob property identifies the index job that is associated with the index request, and the IndexRequestStatus property records the status of the index request.
Information
about index request failures is provided by the CmIndexingFailureCode,
LastFailureReason, and RetryCount properties. By using these and other CmTextSearchIndexRequest properties,
you can search for index requests that meet specific criteria; for
example, all index requests that have failed and will not be tried
again, with a description of the last failure for each one, or all
index requests that are being tried again, with a description of the
last failure for each one.
During the processing
of an index job, the objects to be indexed are fetched and the index
requests for those objects are inserted into the IndexRequests table,
which acts as queue for the pending work. For a large index job, these
operations can use a sizable amount of system resources and cause
the IndexRequests table to grow. To prevent the IndexRequests table
from growing too large, you can maintain a limit on the number of
index requests that are in progress or waiting by setting the MaxRequestQueueSize
property of an IndexJob object. Setting this limit
can prevent large index jobs from causing performance problems. The
index request limit applies only to the number of index requests for
the index job that exists in the IndexRequests table; it is not a
limit on the maximum number of index requests that an index job can create.
Skip Operation
You
can specify a skip operation for an index request. During a skip operation,
the CBR-enabled object that is the subject of an index request will
not be indexed. To specify a skip operation, set the IndexingOperation
property of the CmTextSearchIndexRequest object to SKIP.
As a result, the Content Engine server
sets the CmIndexingFailureCode property of the index request to MARKED_AS_SKIPPED and
sets the IndexationId property of the CBR-enabled object to null.
Index Partitions
An index partition determines which CBR-enabled objects can be indexed into a partitioned IBM Content Search Services index. Only CBR-enabled objects that satisfy the partitioning constraint are stored in the index. When an indexing partitioning property is specified in a text-search query, only indexes with the same index partition property names and values are searched. By configuring index partitioning in an object store, you can decrease the number of indexes that must be searched as a result of a query if your application uses index partitioning properties in the query.
Each IBM Content Search Services index maintains a list
of zero to two CmIndexPartitionConstraint objects
with its IndexPartitionConstraints property. This list is read-only
and is maintained by the Content Engine server.
Each CmIndexPartitionConstraint object corresponds
to an index partitioning property associated with an object store,
represented by a CmTextSearchPartitionProperty object.
Each property of a CBR-enabled object that is assigned as an index
partitioning property must be a custom string- or date-valued property
with a setting of SETTABLE_ONLY_ON_CREATE. You can
have no more than one string and one date index partitioning property
that is assigned to an object store.
Indexing and Special Characters
Special characters are indexed in the same position as the preceding term, unless there is a sequence of special characters. In this case, the special character sequence is indexed as unordered separate tokens (ordering of the characters is ignored). For more information about special characters in queries, see Special Characters.