Content-Based Retrieval

The Content Engine Java™ and .NET APIs include a number of interfaces for content-based retrieval (CBR) administrative functions for IBM® FileNet Content Search Engine. By using the APIs, you can configure domains and servers, establish and configure index areas, and initiate and manage index jobs.

Indexing and Index Jobs

IBM Content Search Services indexing aggregates data (in the form of indexes) to support full-text searches of the content of objects and the string-valued properties of those objects. Only objects and string-valued properties that are enabled for CBR are included in full-text searches. CBR-enablement is controlled by the Boolean value of the IsCBREnabled property on ClassDefinition and PropertyDefinitionString objects. For the content of an object to be enabled for full-text search and to allow its string-valued properties to be CBR-enabled, you must enable CBR for the ClassDefinition object that defines the object's class (Document, Annotation, Folder, or CustomObject class and subclasses only). For the value of a string-valued property of an CBR-enabled object to be enabled for full-text search, you must enable CBR for the PropertyDefinitionString object that defines the property. Indexing is done automatically for all CBR-enabled objects and properties. Because the indexing operation is a batched, asynchronous operation, its results are not immediately evident. For more information about enabling CBR, see Content Searches.

With an IndexJob object, you can track the status of an index job and initiate and control the job. Usually, you initiate an index job to rebuild an index that is corrupted, or to accommodate a configuration change. The IndexRequests property on an IndexJob object contains a listing of all index requests that are associated with the index job. The CmTextSearchIndexRequest classes provide read and update operations, and status and tracking information for index requests.

The IndexJobItem base class is subclassed to provide for particular types of index jobs:

Class index job (IndexJobClassItem): All instances of the specified class are full-text indexed. Class index jobs require a table scan on the database, even if the amount of data to be indexed is minimal. A significant amount of time is required to scan a large table. The database tables are scanned once for all classes to be indexed. To minimize the number of table scans required, use a single index job operation for all classes to be indexed for the same table.
Single object index job (IndexJobSingleItem): A single object is full-text indexed.
Reindex index job: An existing index is reindexed.

During an indexing operation, all currently indexed data is available for use in full-text searches. However, new index data might become available while a full-text query is in progress. In this case, the query returns duplicate matches because it uses both old and new indexed data. When the index job completes, old copies of indexed data are removed and duplicate matches no longer occur.

Canceling an Index Job

An index job can be canceled by setting its JobAbortRequested property to true, or as a result of an unexpected error. When an index job is canceled, all of its related index requests are deleted. If the index job is canceled by an administrator, the JobStatus property on the IndexJob object is set to CANCELED. If the index job stops as the result of an unexpected error, the JobStatus property has a value of TERMINATED_ABNORMALLY.

If you cancel an index job for a reindex operation or cancel an index job for a root class with subclasses, there are special considerations. These types of index jobs create new indexes that replace the original indexes. While the new indexes are being populated, the original indexes maintain their original entries. Therefore, duplicate entries exist until the index job completes successfully. When all the index requests for the index job are complete, the index job deletes the original indexes. However, if this index job is canceled, the original indexes remain and contain duplicates of the entries that are created by the index job in the new indexes. The Content Platform Engine error log contains a warning message that lists the indexes that would have been replaced had the index job not been canceled. For example:

An index job was cancelled on Object Store <objectstoreName>. The following full-text indexes will need to be replaced by running 
full-text index reindex job(s) or base class with subclasses index job(s):
   Full-text index name: <name> in Index Area: <index area name>
   Full-text index name: <name> in Index Area: <index area name>

It is recommended that the administrator resolve this situation by creating index jobs to reindex the original indexes, depending on the type of index job that is canceled:

Index job for a reindex operation: If this type of index job is canceled, the resource state of the original index remains set to CLOSED. It is recommended that the administrator create another index job on this CLOSED index.
Index job for a root class with subclasses: If this type of index job is canceled, there might be multiple indexes that have a resource state set to CLOSED. It is recommended that the administrator either create another index job on the root class with subclasses or create an index job for each of the CLOSED indexes. For more information, see Status of Index Areas and Indexes.

Pausing and Resuming an Index Job

If you pause an index job, by setting the JobPauseRequested property of an IndexJob object to true, the JobStatus property of the index job is updated to IndexJobStatus.PAUSED, and the dispatching of new index requests by the index job is halted. Existing index requests are not paused. If you resume an index job, by setting the JobPauseRequested property of an IndexJob object to false, the JobStatus property of the index job is updated to IndexJobStatus.IN_PROGRESS, and the dispatching of new index requests by the index job is allowed.

Indexing Error Handling

All indexing errors of any kind are recorded in the Content Engine log file. Optionally, the indexing errors also can be persisted to an object store. The ObjectStore property IndexingFailureRecordingLevel determines whether indexing errors are persisted. The default behavior is for indexing errors to be logged only, rather than persisted.

Indexing Failure Codes

If an indexing failure of a CBR-enabled object occurs, the CmIndexingFailureCode property of the CBR-enabled object is set to an IndexingFailureCode failure code constant and the failure is recorded in the Content Engine log file. If you set the value of the IndexingFailureRecordingLevel property of an ObjectStore object to PROPAGATE_TO_SOURCE, the error information is propagated to the CmIndexingFailureCode property of all CBR-enabled objects.

The indexing failure code is recorded only when the associated object is processed by the Content Search Engine and some or all of the object's content could not be full-text indexed (due, for example, to text size limits causing truncation, or use of an unsupported format type). An indexing failure code does not necessarily mean that the object was not at least partially indexed; the server attempts to index as much of the object as it can. If a CBR-enabled object is successfully indexed, its CmIndexingFailureCode property has a value of zero.

System failures do not generate an indexing failure code. In such cases, the indexing operation is retained, a description of the error is written to the Content Engine log file, the error is recorded as part of the index request last failure description, and the indexing operation is automatically tried again when the system is in a stable state.

Indexing Request Errors

The cause of an indexing request error is recorded in the associated index request object (CmTextSearchIndexRequest object). Indexing request errors are recorded in the LastFailureReason property of the index request object. When an index request associated with an object is successfully processed, any failed index requests for that same object are deleted. In other words, a CBR-enabled object does not have any failed index requests (the CmIndexingFailureCode property is zero) if it is successfully indexed. An index request is typically retained only in the case of a system failure and the index request needs to be tried again.

Indexing Job Errors

In some cases, an error might be related to index job processing, rather than index request processing. For example, at the end of some index jobs the index must be deleted, and it is possible for the deletion to fail. However, this error does not mean the index job itself failed: an index job that successfully submits all of the associated index requests always completes with a JobStatus property value of TERMINATED_NORMALLY (successful completion).

If a failure occurs that is related to an index job, the identified reason for the failure is recorded in the LastFailureReason property on the IndexJob object.

Domain and Server Information

Full-text indexing and searching requires that at least one IBM Content Search Services server configuration to be associated with your FileNet P8 domain. To create such a configuration, create a CmTextSearchServer object by using one of the Factory.CmTextSearchServer methods and specify the domain object that represents your FileNet P8 domain. Each CmTextSearchServer object that you create and associate with a domain is automatically added to the read-only CmTextSearchServerSet collection object returned by the TextSearchServers property of the Domain object. After you create an IBM Content Search Services server, you must set its TextSearchServerStatus property to ENABLED for it to be recognized by the Content Engine server. If the Content Engine server cannot communicate with the IBM Content Search Services server, it automatically sets its TextSearchServerStatus property to UNAVAILABLE. If you do not want the IBM Content Search Services server to be available, set its TextSearchServerStatus property to DISABLED. The host, port number, and connection token of the IBM Content Search Services server must be set by using the CmTextSearchServer object properties.

IBM Content Search Services servers and FileNet P8 objects have the following many-to-one relationships:

Multiple object stores can share the same IBM Content Search Services server configuration object.
Multiple IBM Content Search Services server configuration objects can exist for a FileNet P8 domain.

The properties of a CmTextSearchConfiguration object enable you to control IBM Content Search Services functions on the Content Engine server. A CmTextSearchConfiguration object is contained in the SubsystemConfiguration property of the Domain, Site, VirtualServer, and ServerInstance classes. The CmTextSearchConfiguration instance to be used is determined by these classes in this order: ServerInstance, VirtualServer, Site, and Domain.

Index Areas and Indexes

An IBM Content Search Services index area is a file system directory that contains the information necessary to perform full-text indexing that is updated and queried by IBM Content Search Services. A many-to-one relationship exists between an index area and an object store. Each index area is dedicated to a single object store, but you can have multiple index areas for an object store on a single file system, or you can distribute the indexing information for an object store in multiple index areas across file systems.

Each index area is represented by a CmTextSearchIndexArea object. The file system location of an index area is stored in its RootDirectoryPath property.

Each index area can hold multiple indexes (CmTextSearchIndex objects), which are specified by its TextSearchIndexes property. A many-to-one relationship exists between indexes and an index area; CmTextSearchIndex objects are created automatically in the associated index area, as needed. When an indexable class is instantiated, the CmTextSearchIndex object that is associated with its base class, and any index partitioning properties that are defined on the object store, can be used to reference the full-text indexing information. If no CmTextSearchIndex object is associated with the base class and index partitioning properties, a new CmTextSearchIndex object (and the corresponding index that is maintained by IBM Content Search Services) is created. The index is identified by its IndexName property.

IBM Content Search Services indexing and search servers update and query the indexes. The indexes in an index area are only accessible to the servers that are in the same site as the index area (Site property of the CmTextSearchServer and CmTextSearchIndexArea objects).

To improve indexing efficiency, you can specify which languages are supported in an object store by adding language codes to the string list of its TextSearchIndexingLanguages property. If the IBM Content Search Services server cannot determine the language of a document to be indexed in an index request, the first language code in the string list is used as the default language code for the index request. Ensure that the languages that you specify with this property match the languages of most of the documents in this object store; otherwise, you might experience a performance delay. If you do not set this property to at least one language code, and the deprecated TextSearchIndexingLanguage property was not previously set, an error occurs during indexing.

Depending on how many index areas are associated with an object store, not all the IBM Content Search Services index servers that are associated with the object store will perform indexing work. By default, without date or string partitioning configured, there is only one open full-text index for each index area that is associated with an object store. For all of the index servers to perform indexing work, the number of index areas that are associated with an object store must be equal to the number of index servers. For example, consider an object store that is associated with a Content Platform Engine server, a single index area, and two IBM Content Search Services servers. With a single index area, only one active full-text index is opened (assuming that no date or string partitioning is used). Because the full-text index has an affinity to only one IBM Content Search Services server for the lease duration, only one IBM Content Search Services index server performs indexing work on that full-text index at a time. The other index server is idle. For both index servers to be used, you must associate another index area with the object store.

Status of Index Areas and Indexes

Index areas (CmTextSearchIndexArea objects) and indexes (CmTextSearchIndex objects) have a ResourceStatus property, which specifies their availability status. This property can have a value of OPEN, CLOSED, or FULL. For CmTextSearchIndexArea objects, ResourceStatus can also have a value of STANDBY. For CmTextSearchIndex objects, ResourceStatus can also have a value of UNAVAILABLE.

Index Area Status

Indexes can be created only in an index area (CmTextSearchIndexArea object) that has a resource status (ResourceStatus property) set to OPEN. Otherwise, if there are no index areas in the object store that have a resource status setting of OPEN, no new indexes can be created in the object store.

The resource status of an index area is automatically set to FULL when the index area reaches its full capacity. This setting indicates that no new objects can be indexed in the index area and no new indexes can be created in the index area. However, existing indexes can be deleted or queried. An index area is considered to be at full capacity when the number of its indexes is equal to the value of its MaxIndexes property and all of its indexes have a resource status of either FULL or CLOSED.

If an index area is full, and another index area that has a resource status of STANDBY is found, the Content Engine server automatically sets the resource status of the standby index area to OPEN. If there are multiple standby index areas, the Content Engine server chooses the standby index area with the highest priority, according to the value of its CmStandbyActivationPriority property. If two or more index areas exist with the same priority, one of these standby index areas is chosen randomly by the server.

Index Status

Create index requests can be written only to an index (CmTextSearchIndex object) whose resource status (ResourceStatus property) is set to OPEN. However, existing index entries can be updated, deleted, or queried.

The Content Engine server automatically sets the resource status of an index to FULL when the number of objects in the index is equal to the value of the MaxObjectsPerIndex property or the size of the index reaches the valued specified by the MaxSizePerIndexKbytes property. After the resource status of an index is set to FULL, no more create index requests can be written to the index. However, existing index entries can be deleted or queried. When an index is full, updated index data is automatically written to another index that has a resource status of OPEN, if it exists. If no index can be found that has a resource status of OPEN, a new index is created in the same index area if the index area's maximum index limit (as specified by its MaxIndexes property) has not been reached. If the index area's maximum index limit is reached and no other index areas are set to OPEN, an index area that is set to STANDBY, if present, is automatically set to OPEN and the new index is created in it.

If an index job closes an index for deletion, the server sets its ResourceStatus property to CLOSED and sets its IndexingStatus property to REPLACING. While the IndexingStatus property is set to REPLACING, you cannot change the ResourceStatus property of an index from CLOSED to OPEN by using the API. When the index job is canceled, the server sets the IndexingStatus property of the index to NORMAL; the ResourceStatus property remains set to CLOSED unless it is changed by the API.

An administrator can manually set an index to a resource status of CLOSED. An index that is set to CLOSED suppresses errors and allows reindexing to complete without generating errors that can cause the entire reindexing operation to fail. A closed index is closed to create index requests; however, existing index entries can be updated, deleted, or queried.

An administrator can manually set an index to a resource status of UNAVAILABLE, typically when the index is corrupted, or is otherwise inaccessible. An index that is set to UNAVAILABLE suppresses errors and allows reindexing to complete without generating errors that can cause the entire reindexing operation to fail. Because search results might be incomplete, it is recommended that an administrator inform users when an index is set to UNAVAILABLE. After an index is set to UNAVAILABLE, it cannot be set to any other state; this state is a final state for the index. It is recommended that an unavailable index be reindexed as soon as possible. The server automatically creates a new index or opens a standby index to handle any pending index requests for the unavailable index. After an unavailable index is reindexed, the server deletes the unavailable index.

Note: Index data is deleted automatically when an index job disables full-text indexing or rebuilds an index (deleting and re-creating the CmTextSearchIndex instance). For more information, see Index Requests.

Index Requests

Each index request is associated with a CBR-enabled object, and is an instance of the CmTextSearchIndexRequest class. Index request objects are created by the indexing process and cannot be created by using the API. You can perform read and update operations on CmTextSearchIndexRequest objects, which have properties that hold status, failure and retry information. CmTextSearchIndexRequest objects do not have individual security assignments and receive their security from the default instance security for the class.

The SourceObject property identifies the CBR-enabled object that is the subject of the index request, the IndexJob property identifies the index job that is associated with the index request, and the IndexRequestStatus property records the status of the index request.

Information about index request failures is provided by the CmIndexingFailureCode, LastFailureReason, and RetryCount properties. By using these and other CmTextSearchIndexRequest properties, you can search for index requests that meet specific criteria; for example, all index requests that have failed and will not be tried again, with a description of the last failure for each one, or all index requests that are being tried again, with a description of the last failure for each one.

During the processing of an index job, the objects to be indexed are fetched and the index requests for those objects are inserted into the IndexRequests table, which acts as queue for the pending work. For a large index job, these operations can use a sizable amount of system resources and cause the IndexRequests table to grow. To prevent the IndexRequests table from growing too large, you can maintain a limit on the number of index requests that are in progress or waiting by setting the MaxRequestQueueSize property of an IndexJob object. Setting this limit can prevent large index jobs from causing performance problems. The index request limit applies only to the number of index requests for the index job that exists in the IndexRequests table; it is not a limit on the maximum number of index requests that an index job can create.

Skip Operation

You can specify a skip operation for an index request. During a skip operation, the CBR-enabled object that is the subject of an index request will not be indexed. To specify a skip operation, set the IndexingOperation property of the CmTextSearchIndexRequest object to SKIP. As a result, the Content Engine server sets the CmIndexingFailureCode property of the index request to MARKED_AS_SKIPPED and sets the IndexationId property of the CBR-enabled object to null.

Index Partitions

An index partition determines which CBR-enabled objects can be indexed into a partitioned IBM Content Search Services index. Only CBR-enabled objects that satisfy the partitioning constraint are stored in the index. When an indexing partitioning property is specified in a text-search query, only indexes with the same index partition property names and values are searched. By configuring index partitioning in an object store, you can decrease the number of indexes that must be searched as a result of a query if your application uses index partitioning properties in the query.

Each IBM Content Search Services index maintains a list of zero to two CmIndexPartitionConstraint objects with its IndexPartitionConstraints property. This list is read-only and is maintained by the Content Engine server. Each CmIndexPartitionConstraint object corresponds to an index partitioning property associated with an object store, represented by a CmTextSearchPartitionProperty object. Each property of a CBR-enabled object that is assigned as an index partitioning property must be a custom string- or date-valued property with a setting of SETTABLE_ONLY_ON_CREATE. You can have no more than one string and one date index partitioning property that is assigned to an object store.

Indexing and Special Characters

Special characters are indexed in the same position as the preceding term, unless there is a sequence of special characters. In this case, the special character sequence is indexed as unordered separate tokens (ordering of the characters is ignored). For more information about special characters in queries, see Special Characters.