CBR Queries
A content-based retrieval (CBR) query includes a full-text search function call as part of the WHERE clause for the SQL statement. The full-text search function permits you to search for text in object content and properties.
To specify a full-text search, use the CONTAINS function. All searches are case insensitive. The search must be expressed in IBM® Content Search Services query syntax or an XPath-based syntax.
The search results of a CONTAINS function call are affected by the configuration of IBM Content Search Services. For more information, see Search Results.
A CBR query can potentially run for an extended period. To limit the amount of time a query can run, set a timeout value for the query. The timeout value is the maximum time that is allowed to fetch a single page of search results for a continuable query. If a fetch exceeds the maximum time that is allowed, Content Engine stops running the query and returns an error. A single default timeout is probably not sufficient: when you write your applications, allow the user to set a timeout.
The following example shows a search function call and the other main syntactical elements of a CBR query:
SELECT d.This, c.Rank FROM Document d
INNER JOIN ContentSearch c ON d.This = c.QueriedObject
WHERE CONTAINS(d.*,'lion AND tiger')
AND d.Date >= 20101201T000000Z AND d.Date < 20101231T235959Z
ORDER BY c.Rank
OPTIONS (FULLTEXTROWLIMIT 500)
CBR query syntax has the following main elements:
- Content search table join - The query must join the ContentSearch table to other query-referenced tables. As with the other tables, ContentSearch is also a class. For more information, see ContentSearch table join.
- Full-text search function - The query must include one call to the CONTAINS search function. Note that date range is more efficiently applied to the CBR index data (following the CONTAINS clause), rather than to the full database (preceding the CONTAINS clause). For more information, see CONTAINS Function.
- General content search table column usage (optional) - The query can optionally reference ContentSearch properties as columns. In particular, the Rank property permits you to order query results that are based on result relevance. For more information about ordering query results by relevance, see Query Result Ranking. For information about other ContentSearch properties, see ContentSearch Properties.
- Query result row limits (optional) - The query can optionally limit the number of rows that the CONTAINS function call returns. For more information, see FullTextRowLimit.
Search Results
The indexes for IBM Content Search Services store the text for an object as one logical entity. Consequently, a full-text search returns one row for each indexed object that matches the search criteria. The number of times that an object matches the search criteria does not affect the number of rows returned. For example, the text in the content and properties of an object might have multiple instances of the term lion. A search for lion returns only one row for that object.
Content Engine places the rows that are returned by a full-text search into the ContentSearch table. The table is then joined with other tables in the query. For more information, see ContentSearch table join.
The following factors affect the search results of a CONTAINS function call:
- CBR-enabled classes and properties
Only the text for objects that belong to a CBR-enabled class is indexed and can be searched. Make a class CBR-enabled by setting the isCBREnabled property value to true for the relevant
ClassDefinition
. For example, to search for text inDocument
object content, set the isCBREnabled property for theDocumentClassDefinition
object to true.To search for text in an object property, the property must also be CBR-enabled. Make a property CBR-enabled by setting the isCBREnabled property value to true for the relevant
PropertyDefinition
object. Making a property CBR-enabled without also making the class CBR-enabled has no functional effect: to search for text in a property, the class to which the property belongs must be CBR-enabled.Setting a class or property to be CBR-enabled has no effect unless the relevant class is reindexed. For information about using Administration Console for Content Platform Engine to reindex classes and to set the CBR-enabled status of classes and properties, see Setting object text as CBR-queryable.
- Word stems
A full-text search returns results not only for a word search term but also for the stem of the word. For example, a search for the word tests or testing also returns results for test. This behavior only applies to terms with one or more lowercase letters: to prevent a word stem search, use all capitals for the term such as in TESTS.
- Synonyms
A full-text search returns results not only for the explicitly specified terms, but also for any synonyms for the terms. For example, you might define ibm and International Business Machines as synonyms. A search for ibm also returns objects with text that includes the phrase International Business Machines. For information about defining synonyms, see Setting synonyms. This behavior applies only to terms with one or more lowercase letters: to prevent a synonym search, use all capitals for the term, such as IBM.
- Stop words
Typically, a stop word is a word such as a or the for which search results are not wanted. This behavior only applies to terms with one or more lowercase letters; to prevent a stop word from being ignored, use all capitals for the term, such as THE. A full-text search ignores any stop words that might be specified as part of the search expression, except in the following cases when the stop word is:
- Part of a phrase. For example, the cat.
- Part of a wildcard expression. For example, the* cat.
- Fully uppercase. For example, THE cat.
- Composed only of a stopword. For example, the.
The removeStopwordsFromBooleanQuery configuration tool parameter controls whether stop words are ignored in a Boolean search expression. If the removeStopwordsFromBooleanQuery parameter is set to
true
, stop words are ignored in a Boolean search expression; otherwise, stop words are used as part of the query. For example, if removeStopwordsFromBooleanQuery is set totrue
, in a search expression that contains the phrase "the AND cat", the is ignored and all documents that contain the word "cat" are returned by the query. However, if the removeStopwordsFromBooleanQuery parameter is set tofalse
, only documents that contain both the word "the" and the word "cat" are returned by the query. The default value isfalse
.For information about defining stop words, see Omitting stop words from queries. For information about the configuration tool parameters, see Configuration tool parameters.
- Excluded XML nodes
As with any other text, only indexed XML document text can be searched. Selected XML nodes for which search results are not wanted can be excluded from being indexed. For more information, see Omitting XML nodes from indexes.
ContentSearch Table Join
To run a CBR query, Content Engine runs the full-text search separately from the query that is represented by the full SQL statement. The full-text search is expressed in statements such as lion AND tiger. These search expressions are passed directly to IBM Content Search Services, and the returned search results are copied into a temporary database table. The query is then run as a standard relational query.
As with other tables referenced in a query such as Document, the temporary table represents
a class of objects. Each row in the table represents a separate instance of the class. You reference
the temporary table as the ContentSearch
class and join the table to other tables
in the query. Only one ContentSearch
join is allowed per SQL statement because only
one search function call is allowed. For information about ContentSearch
properties, see ContentSearch Properties. The properties can be referenced in the query as columns.
ContentSearch
is an internal class and, consequently, objects for the class cannot
be instantiated.
The following example demonstrates an inner join in a full-text search:
SELECT … FROM Document D INNER JOIN ContentSearch CS ON D.This=CS.QueriedObject
WHERE CONTAINS(*,'text') AND D.Creator = 'Frank'
This query is run by first evaluating the CONTAINS(*, 'text') clause and writing that data to a temporary table. Then, the remainder of the query is run:
SELECT … FROM Document D INNER JOIN ContentSearch CS ON D.This=CS.QueriedObject
WHERE CS.QueriedObject IS NOT NULL AND D.Creator = 'Frank'
Note that the CONTAINS clause
is replaced by CS.QueriedObject IS NOT NULL
.
In the previous example, the CONTAINS clause is used with an AND operator. Using an OR operator yields confusing results due to the presence of the inner join (which is an effective AND). For this reason, an AND operator is required with an INNER JOIN, and an OR operator is disallowed.
In this example of an outer join, the outer join first executes the CONTAINS clause and builds the temporary table in the database that holds the full-text search results, then executes the query:
SELECT … FROM Document D LEFT OUTER JOIN ContentSearch CS ON D.This=CS.QueriedObject
WHERE CONTAINS(*,'text') OR D.Creator = 'Frank'
When you are using outer joins, only the ContentSearch
class
can be joined conditionally. You cannot use a right outer join in
the following query example, because the query results can potentially
return only ContentSearch
data, which is not allowed.
Outer joins must also use an OR operator when the CONTAINS clause
is combined with other conditions; use of an AND operator is disallowed:
SELECT … FROM Document D LEFT OUTER JOIN ContentSearch CS ON D.This=CS.QueriedObject
WHERE CS.QueriedObject IS NOT NULL OR D.Creator = 'Frank'
CONTAINS Function
Use the CONTAINS function to search for text in object content and properties. The function must be called from the WHERE clause of the SQL statement for the query. At most, one function call can be made per SQL statement.
The CONTAINS function has the following grammar:
CONTAINS(<property_spec>, <string_literal>[,<dialect>])
The following function call examples show some of the IBM Content Search Services query syntax:
CONTAINS(d.*, 'lion NOT tiger')
CONTAINS(d.*, 'lion* tiger')
CONTAINS(d.*, 'lion tiger~0.8')
CONTAINS(d.*, '@xmlxp:''/zoo/mammal[ .contains("lion")]''')
CONTAINS(d.*,'lion AND tiger') AND d.Date >= 20101201T000000Z AND d.Date < 20101231T235959Z
CONTAINS Parameter: <property_spec>
The CONTAINS <property_spec> parameter specifies the class of the objects to be searched and the scope of the search for each object: the choice is between searching a single property or all properties plus object content. The functional effectiveness of the specified search scope depends on whether the relevant class and properties are CBR-enabled. For more information, see Search results.
To search a specific property, specify the property name as shown in the following example:
SELECT d.This FROM Document d
INNER JOIN ContentSearch c ON d.This = c.QueriedObject
WHERE CONTAINS(d.DocumentTitle,'lion')
To search all properties plus the object content, use an asterisk as shown in the following example:
SELECT d.This FROM Document d
INNER JOIN ContentSearch c ON d.This = c.QueriedObject
WHERE CONTAINS(d.*,'lion')
CONTAINS Parameter: <string_literal>
The CONTAINS <string_literal> parameter accepts a full-text search expression. The search expression is sent to IBM Content Search Services for processing when the CBR query is run. The search expression syntax must conform to the syntax described in the following sections. An XPath-based syntax can also be used to search XML documents. For information about this syntax, see XML Searches. You can combine XML searches with fielded searches and text searches.
Enclose the CONTAINS search expression in single quotation marks as shown in the following example:
SELECT d.This FROM Document d
INNER JOIN ContentSearch c ON d.This = c.QueriedObject
WHERE CONTAINS(d.*,'DocumentTitle:zoo AND lion')
Query terms and phrases
Queries can contain terms and operators. A term is a single word such as database. A phrase is a group of words that are contained in quotation marks, such as "computer software". Phrases are searched as exact expressions.
The following table shows how phrases (exact matches) are handled differently than search terms without quotation marks.
Search query | Terms without quotation marks | Exact match (phrases) |
---|---|---|
Lemmatization | Done | Not done |
Stop words | Removed | Not removed |
Synonyms | Expanded | Not expanded |
Boolean operators | Supported | Not supported (interpreted as search terms) |
Operators: "- " and "% " |
Supported | Not supported (interpreted as search terms) |
Wildcard characters: "* " and "? " |
Supported | Supported |
Boolean Searches
Boolean operators allow terms to be combined through logical operators. The following Boolean operators are supported: AND, &&, OR, | |, NOT and "-".
Boolean operators are subject to the following restrictions:
- You must include a white space character before and after the AND, &&, OR, ||, and NOT Boolean operators.
- Boolean operators must be specified in all uppercase (AND, OR, NOT) or they are treated as search terms and ignored.
- An exclamation point (!) cannot be used in place of the NOT operator.
- Because all terms are automatically included in a query, the plus sign (+) is not supported and, if used, is ignored.
Precede a term or phrase with the NOT operator or a minus sign (-) to indicate that the term or phrase must be absent from a document for a match to occur. These operators, which are equivalent, act as filters to remove documents and must be associated with a query that returns positive results. For example, the following queries return documents that include the term computer and not the term hardware:
computer NOT hardware
computer –hardware
You can use parentheses to control the Boolean logic in a query. For example, the following query finds documents that contain either WebSphere or Lotus and website:
(
WebSphere
OR Lotus) AND website
The following table shows examples of search queries with Boolean operators.
Search query | Results |
---|---|
|
Narrows the search to include documents that must contain both the term database and the term disk anywhere in the document. |
|
Widens the search to include documents that contain either the term database or the phrase log file anywhere in the document. |
|
Searches for documents that contain database and that do not contain the phrase log file anywhere in the document. |
|
Searches for documents that contain either the term database or the phrase data base and that do not contain the phrase log file anywhere in the document. |
Fielded Data
IBM Content Search Services supports indexing fields and searching string-based data in fields. When you perform a search, you can optionally specify the field in which to search.
A field name must correspond to the symbolic name of an indexed object property that is directly or indirectly specified by the <property_spec> parameter. The corresponding property must also be CBR-enabled. You cannot use fielded search within xmlxp expressions.
If you do not specify a field, IBM Content Search Services searches for the query text in the textual content of the entire document. For XML documents, IBM Content Search Services looks for the query text in elements only.
Specify the field in which to search by typing the field name followed by a colon (:). Then, specify the term or phrase that you are searching for. For example, the value of the symbolic name of the CBR-enabled property named "DocumentTitle" can be searched according to the following query:
DocumentTitle:"Budget Proposal"
Use parentheses to group multiple clauses in a field. For example, the following query searches for titles that contain both the word tutorial and the phrase "computer software":
DocumentTitle:(tutorial "computer software")
You can also use fields with searches that use Boolean operators by enclosing the search terms in parenthesis. For example, the following queries search for titles that do not contain the word apple:
DocumentTitle:(-apple)
DocumentTitle:(-apple banana)
The following query searches for titles that contain the word apple or banana:
DocumentTitle:(apple OR banana)
The following query searches for titles that contain the word apple and do not contain the word banana:
DocumentTitle:(apple AND NOT banana)
You can also use fielded data with Proximity Searches and Fuzzy Searches.
Special Characters
Special characters can serve different functions in the query syntax. For example, a question mark (?) can be used as a wildcard character. Specify special characters in the CONTAINS <string_literal> parameter, and escape them if necessary. To escape most special characters, precede the character with a backslash (\). For example:
- To search for the string "where?", use "where\?".
- To search for the string "c:\temp", use "c\:\\temp".
To escape a single quotation mark, use another single quotation mark. For example:
- To search for the string "owner's name", use "owner''s name".
The special characters in the following table must be escaped.
Special character | Behavior when not escaped |
---|---|
Asterisk (*) | Used as a wildcard character. |
At sign (@) | A syntax error is generated when an at sign is the first character of a query. In xmlxp expressions, the at sign is used to refer to an attribute. |
Brackets ([]) | Used in xmlxp expressions to search the contents of elements and attributes. |
Braces ({}) | Generates a syntax error. |
Backslash (\) | |
Caret (^) | Used for weighting (boosting) terms. |
Colon (:) | Used to search the contents of a field. |
Equal sign (=) | Generates a syntax error. |
Exclamation point (!) | A syntax error is generated when an exclamation point is the first character of a query. |
Forward slash (/) | Used in xmlxp expressions as an element path separator. |
Greater than symbol (>) | Used in xmlxp expressions to compare the value of an attribute. Otherwise, a syntax error is generated. |
Less than symbol (<) | Used in xmlxp expressions to compare the value of an attribute. Otherwise, a syntax error is generated. |
Minus sign (-) | When a minus sign is the first character of a term, only documents that do not contain the term are returned. |
Parentheses | Used for grouping. |
Percent sign (%) | Specifies that a search term is optional. |
Plus sign (+) | |
Question mark (?) | Used as a wildcard character. |
Semicolon (;) | |
Single quotation mark (') | Used to contain xmlxp expressions. To escape a single quotation mark, use another single quotation mark instead of a backslash. |
Tilde (~) | Used for proximity and fuzzy searches. |
Vertical bar (|) |
Special characters that do not have a meaning in the query syntax need not be escaped, but it is safe to do so. The following examples are of special characters that do not need to be escaped:
- Comma (,)
- Dollar sign ($)
- Period (.): Used to contain xmlxp expressions to search the content of elements.
- Pound sign (#)
- Underscore (_)
Special characters are queried according to the following criteria:
- When a special character is adjacent to a word, documents that contain the special character and word in the same order are returned. For example, a query for 30$ finds documents that contains the expression 30$, but does not find documents that contain the expression $30.
- When a special character is not adjacent to a word, it is queried as a separate term. For example, a query for 30 $ (with a separating space) finds documents that contain 30$, $30, 30 and $.
- When a word that is considered to be a stop word is adjacent to a special character in a query, it is not removed. For more information, see Stop words.
- When a word is adjacent to a special character in a query, it is still lemmatized. For example, a query for cats&dogs (in English) finds documents that contain cat&dog.
- When a special character separates two words, the sequence of tokens is searched as a sequence. For example, searching for jack_jones finds documents that contain jack_jones but not documents that contain jack_and_jones.
For information about using special characters in wildcard search expressions, see Wildcard Searches.
Indexing and Special Characters
During tokenization and language processing, IBM Content Search Services identifies and indexes special characters as punctuation. Special characters are token delimiters. For example, "jack_jones" is tokenized as three separate tokens: jack,"_, and jones. Emails, URLs, and file paths are broken down into tokens. For example:
- Jack_jones@ibm.com is tokenized as jack _ jones @ ibm . com
- http://www.ibm.com is tokenized as http :// www . ibm . com
Special characters do not occupy a token position in the file. For example, jack_jones is indexed with the underscore in the same token position as jack. Special characters also do not occupy a token position when spaces are included. For example, jack_jones is indexed in the same way as jack _ jones.
The token position is used for exact phrase search and for proximity search. For example, if a document contains the expression jack_jones, searching for the exact phrase ""jack jones"" finds this document.
When a sequence of special characters is indexed separately, they are searched in no particular order. For example, searching for "#$" also finds documents that contain "$#".
Special Characters in Chinese, Japanese, and Korean Languages
To find a sequence of characters in Chinese, Japanese, and Korean (CJK) languages that includes special characters, the query expression must include the special characters. If you omit the special characters from the query expression, the character sequence might not be found. In non-CJK languages, the character sequence is always found, even if the query expression omits the special characters. For example, if an indexed document contains john_smith, you can search for john_smith or "john smith" (exact match, without the underscore) and both queries return the document that contains john_smith.
Wildcard Searches
IBM Content Search Services supports wildcard searches in terms, fielded data, and phrases. You can place wildcard characters before, within, or after a term. Use a question mark (?) to perform a single character wildcard search. For example, the following term query finds documents that contain the terms mare, mere, mire, and more:
m?re
The following phrase query finds documents that contain "uri requirements" and "url requirements":
"ur? requirements"
You can use special characters in wildcard search expressions. For example, searching for ja*_ finds documents that contain jack_jones. However, you cannot use wildcard characters to find special characters. For example, searching for "ca*s" finds documents that contain cats, categories, or cas, but not documents that contain ca_s.
To use a single wildcard character search at the beginning of a term, you must use double quotation marks. For example:
"?more"
Use an asterisk (*) to perform a multiple character wildcard search. A multiple character wildcard search looks for 0 (zero) or more characters. For example, the following term query finds all documents with DocumentTitle property values that begin with tech:
DocumentTitle:tech*
The following phrase query finds documents with DocumentTitle property values that contain "maintenance contacts" and "maintenance contracts":
DocumentTitle:"maintenance cont*"
You can use wildcard characters as the values of XML elements and attributes. However, you cannot use wildcard characters in the names of XML elements and attributes. For more information, see XML Searches.
Note: Using a multiple character wildcard (*) at the beginning of a search term can have a negative impact on the performance of a query when many matching terms are found. There is a configurable limit on the number of terms that can be returned; an error is returned when the limit (10,000 terms by default) is exceeded. To change this limit, use the configuration tool to specify a new value for the queryExpansionLimit parameter. For more information, see Configuration tool usage .
Specifying Optional Terms
You can use a percent sign (%) to indicate that a search term is optional. For example, the following query returns all documents that contain the word cat, and optionally contain the word dog. This query provides a higher score on documents that also contain the word dog:
cat %dog
The optional operator is in effect only if there is more than one search term in an expression. If a query contains a single term that is marked as optional, the query finds documents that include the specified term. That is, if there is only one term in an expression or within a set of parenthesis, the optional operator is ignored. For example, each of the following queries is equivalent in both search results and score:
%dog
%(dog)
(%dog)
dog
If all the search terms in an expression are optional, the query finds documents that include at least one of the specified terms, which is equivalent to an OR query. For example, each of the following queries is equivalent, in both search results and score, and return documents that contain either the words cat or dog:
%cat %dog
%cat OR %dog
cat OR dog
In the following queries, the words dog and horse are marked as optional, but because they are the only terms within a set of parenthesis, at least one of them must be found in a document for it to be returned. For example, each of the following queries is equivalent. Each document that is returned must contain the word cat and must contain either the word dog or horse (or both):
cat (%dog %horse)
cat (%dog OR %horse)
cat (dog OR horse)
Proximity Searches
A proximity search finds documents that contain terms within a specified number of words of each other. To perform a proximity search, enclose the terms in quotation marks and follow the terms with a tilde symbol (~) and a positive integer. For example, the following query searches for documents that contain IBM and WebSphere within seven words of each other:
"IBM
WebSphere
"~7
- Lemmatization is not done.
- Stop words are not removed
- Synonyms are not expanded.
- Boolean operators are not supported.
- Wildcard characters are supported.
Proximity search is supported for individual terms, not for phrases. A word that follows a sentence break is not considered to be adjacent to words in the previous sentence. A word after a sentence break is considered to be 10 positions apart from the last word of the previous sentence. A sentence break is identified by a period (.), a question mark (?), or an exclamation point (!) followed by a space. For example, the period in ibm.com is not a sentence break. The position of the punctuation is typically the same as the character before it (the punctuation does not take a position by itself).
You can use a proximity search with fielded data. For example, the following query searches for DocumentTitle property values that contain the words apple and banana within two words of each other:
DocumentTitle:("apple banana"~2)
Weighted searches (boosting terms)
Follow a search term by a boost value to influence how documents that contain a specified term are ranked in the search results. Use the caret symbol (^) with a number (the boost factor) at the end of the term that you are searching for. For example, the following query finds documents that include the terms IBM and Germany and increases the relevance of these documents by a factor of five in the search results:
IBM Germany^5.0
Boosting terms with fielded data
fruit:apple^5.0
fruit:(apple banana^5.0)
Fuzzy Searches
A fuzzy search query searches for character sequences that are not only the same but similar to the query term. Use the tilde symbol (~) at the end of a term to do a fuzzy search. For example, the following query finds documents that include the terms analytics, analyze, analysis, and so on.
analytics~
You can add an optional parameter to specify the required similarity. Specify a value greater than or equal to 0 and less than 1. The value must be preceded by a 0 and decimal point, for example, 0.8. A value closer to 1 matches terms with a higher similarity. If the parameter is not specified, the default is 0.5.
analytics~0.8
Fuzzy search syntax can be specified per term, but not on a phrase. To apply fuzzy search to multiple words in a query, you must apply a fuzzy search factor for each term. For example, the following query finds documents that include terms that are similar to summer and time.
summer~0.7 time~0.7
In the case of a fuzzy search with fielded data, the parameter that is used to specify the required similarity is NOT optional. Specify a value greater than or equal to 0 and less than 1. For example, the following query searches for DocumentTitle property values that contain words that are similar to the word analyze:
CONTAINS(d.DocumentTitle,'analyze~0.3')
- Special characters are not supported in fuzzy search queries.
- Terms in fuzzy search queries do not go through language processing (lemmatization, synonym expansion, and stop word removal). Therefore, fuzzy search queries do not find terms that are similar to synonyms.
- If wildcard characters are included in the fuzzy search terms, only the wildcard search is done.
XML Searches
The CONTAINS <string_literal> parameter accepts an XPath-based expression to search for terms in elements and attribute values in XML files.
The following search features are supported by XML search:
- Boolean operators
- Exact match
- Fuzzy search
- Lemmatization
- Proximity search
- Stop words
- Synonyms
- Wildcard characters
- Text searches
- Fielded searches
The search expression is sent to IBM Content Search Services for processing when the CBR query is run. IBM Content Search Services implements only a subset of the XPath specification as described here.
You can also specify a range of field values. To define a range, specify the lower and upper bounds and indicate whether the range is inclusive (by using square brackets) or exclusive (by using curly brackets). Sorting is lexical. For example, 2 is greater than 10.
XPath-based searches are subject to the following restrictions:
- Only
Document
andAnnotation
objects with one content element can be searched. - The content type of the content element must be "text/xml" or "application/xml".
- Only one XML node can be specified as part of the path search.
- The presumed XML encoding for an XML document is UTF-8. Explicitly identify the XML encoding in the XML document for any other encoding.
As with any other string literal in a query, you enclose a search expression within single
quotation marks. Single quotation marks must also be used as part of the XPath syntax for the search
expression. These XPath-required single quotation marks are shown without escape characters in the
search expression examples in this topic. When you pass an XPath-based search expression to the
CONTAINS function, escape the XPath-required single quotation marks by using two consecutive single
quotation marks. In the following example, the search expression @xmlxp:'/zoo/mammal[.
contains("lion")]'
is passed to the CONTAINS function:
SELECT d.This FROM Document d
INNER JOIN ContentSearch c ON d.This = c.QueriedObject
WHERE CONTAINS(d.*, '@xmlxp:''/zoo/mammal[. contains("lion")]''')
As shown in the example, use a forward slash to separate the XML elements in the path. After the path specification, use brackets with an enclosed period to search the content of XML elements.
By definition, XML element and attribute names are case-sensitive. However, element and attribute values are not case-sensitive.
Specifying element paths in XML queries
You can specify element paths in XML queries by using the following query syntax:
/tag[. contains("word")]
- Definition: Finds a word in the top-level element.
- Example:
@xmlxp:’/email[. contains("contract")]’
- Result: Returns documents that contain the word contract in the top-level <email> element.
/tag1/tag2[. contains("word")]
- Definition: Finds a word in the first child element of the top-level element.
- Example:
@xmlxp:’/email/Subject[. contains("contract")]’
- Result: Returns documents that contain the word contract in the <Subject> element directly under the top-level <email> element.
//tag[. contains("word")]
- Definition: Finds a word in an element at any level.
- Example:
@xmlxp:’//Recipient[. contains("John")]’
- Result: Returns documents that contain the word John in a <Recipient> element at any level.
/tag1//tag2[. contains("word")]
- Definition: Finds a word in a given child element of an element at any level.
- Example:
@xmlxp:’/Email//To[. contains("John")]’
- Result: Returns documents that contain the word John in a <To> element under a root <Email> element at any level.
/*/tag[. contains("word")]
- Definition: Finds a word in a given child element of an unspecified root element.
- Example:
@xmlxp:’/*/To[. contains("John")]’
- Result: Returns documents that contain the word John in a <To> element under any root element.
/tag1/*/tag2[. contains("word")]
- Definition: Finds a word in a given child element of an unspecified element directly under a specified root element.
- Example:
@xmlxp:’/Email/*/To[. contains("John")]’
- Result: Returns documents that contain the word John in a <To> element under any element that is under the root <Email> element.
Searching the content of elements and attributes
You can search the contents of elements in XML documents by using a period (.) in an xmlxp expression as follows:
/tag[.contains ("word")]
You can combine search expressions at different levels of an element path, for example:
/book[. contains ("whale")]/name[. contains("Moby")]
To search the contents of attributes in XML documents, use @attribute_name in an xmlxp expression as follows:
/tag[@attribute contains ("word")]
The following query expressions apply to searching the contents of both elements and attributes. To search attribute values, replace the period (.) with @attribute_name in the square brackets.
/tag[.contains("word")]
: Finds a single term./tag[.contains("word1 word2")]
: Finds multiple terms under the same element./tag[.contains("""word1 word2""")]
: Finds an exact phrase./tag[.contains("word1 %word2")]
: Finds multiple terms with an optional term under the same element./tag[.excludes("value")]
: Finds elements that do not include the specified words./tag[.contains("word~")]
: Fuzzy search: finds terms that are similar to the given term./tag[.contains("""word1 word2""~4")]
: Proximity search: finds two terms within a specified number of words of each other.
Examples of expressions with Boolean operators
-
/tag[.contains("word1 AND word2")]
-
/tag[.contains("word1 OR word2")]
-
/tag[.contains("word1 NOT word2")]
Examples of expressions with wildcard characters
-
/tag[.contains("word*")]
-
/tag[.contains("word?")]
-
/tag[.contains("word1*word2")]
-
/tag[.contains("word1?word2")]
-
/tag[.contains("*word")]
-
/tag[.contains("?word")]
For more information about wildcard characters, see Wildcard Searches.
Alternative syntax structure
You can include an element path in the brackets of an xmlxp expression. For example, the following queries are equivalent and return the same results:
@xmlxp:’/doc/email[recipients contains ("John")]’
@xmlxp:’/doc/email/recipients[.contains ("John")]’
You can use this syntax to express constraints on different subpaths in the same xmlxp expression, for example:
@xmlxp:’/doc/email[recipients/to contains ("John") AND sender contains ("Jack")]’
Numeric comparisons of attribute and element values
You can use the following construct to compare the numeric values of attributes:
/tag[@attribute operator numeric_value]
The following operators are supported: > < >= <= = !=
Numbers can be positive and negative integers (for example, 15 and -15), decimal fractions (for example, 15.1), and scientific notation (for example, 1.510000e+1). For example, the following queries are equivalent:
@xmlxp:’/book/name[@price > 15]’
@xmlxp:’/book/name[@price > 15.0]’
@xmlxp:’/book/name[@price > 1.500000e+1]’
Date and time comparisons
Date and DateTime element and attribute values can be searched. The following operators are supported for comparing Date and DateTime values: > < >= <= = !=
Date and DateTime values are searchable when they are provided in the following ISO 8601 formats:
- Date: yyyy-MM-dd
- DateTime: yyyy-MM-ddTHH:mm:ss.SSSSSSSS. T is an optional time delimiter that can be replaced by one or more white spaces. Milliseconds (0 - 8 digits) are optional.
For more information about supported date and time formats, see http://www.w3.org/TR/xmlschema-2/#dateTime.
The following examples are valid Date and DateTime values:
2011-07-04
2011-07-04 16:41:06
2011-07-04T16:41:06
2011-07-04T16:41:06.123
- An element or attribute must contain the Date or DateTime value only, without additional characters (other than white space).
- Elements that contain Date or DateTime values cannot have descendant elements.
The following examples show Date and DateTime values in the context of an XML document:
-
<elementName>2011-07-04 16:41:06</elementName>
-
<elementName attributeName="2011-07-04"></elementName>
During a search, the Date or DateTime value must be enclosed within an
xs:date()
or xs:dateTime()
function call to be recognized as the
correct data type.
An XML DateTime data type in an XML document can specify a timezone value
in the form ((’+’ | ’-’) hh ’:’ mm) |
’Z’
. However, when a DateTime is indexed, the IBM Content Search Services server truncates time zone values during indexing. Time zones cannot be specified in an
xs:date()
or xs:dateTime()
value in a query. Therefore, time zones
are not considered during XML searches that involve Date or DateTime data types.
A time element of 24:00:00 is treated as a value between the last instant of that day and the first instant of the next day.
The following examples show Date and DateTime comparisons in xmlxp terms:
@xmlxp:’/Book[@publishDate > xs:date(“2010-01-01?)]’
: Top-level tag is Book. Book has an attribute publishDate that is greater than the date of 2010-01-01.@xmlxp:’/Book[purchaseTime > xs:dateTime(“2009-05-20T13:00:00?)]’
: Top-level tag is Book. Book has a direct child purchaseTime that is a DateTime expression greater than 2009-05-20T13:00:00.000000.
Boolean expressions in queries
You can use the following Boolean operators to build complex query expressions with the following xmlxp constraints: AND, OR, and XOR. For example, the following query includes AND and OR Boolean operators:
@xmlxp:’//BBB[.contains("E")]’ AND @xmlxp:’/AAA/BBB[@id = "2"]’ OR @xmlxp:’/AAA/BBB[@id = "2"]’
You can also use Boolean operators in the value of an @xmlxp constraint.
The following syntax examples show how you can use Boolean operators with search terms:
@xmlxp:’//body[.contains("word1 AND word2")]’
: Returns results when both words are found in the <body> element.@xmlxp:’//body[.contains("word1 word2")]’
: Returns results when both words are found in the <body> element. This construction is the same as using AND.@xmlxp:’//body[.contains("word1 OR word2")]’
: Returns results when either word is found in the <body> element.@xmlxp:’//body[.contains("word1 NOT word2")]’
: Returns results when word1 is found, but word2 is not found, in the <body> element.@xmlxp:’/tag1[tag2 contains ("word2") AND tag3 contains ("word3")]’
: Returns results when word2 is found under the <tag2> element and word3 is found under the <tag3> element, and both are found under the /tag1 element path.
Finding elements or attributes
You can check for the existence of elements or attributes without specifying their values.
For example, the following query returns results when an <email> element is found directly under a <doc> element:
@xmlxp:’/doc/email’
The following query returns results when an <email> element with an id attribute is found directly under a <doc> element:
@xmlxp:’/doc/email/@id’
Using namespaces in a search
You can refer to XML namespaces during a search. Namespaces in XML documents are mechanisms for scoping element and attribute names to prevent naming conflicts, for example, between XML documents from different sources. For more information about namespaces, see http://www.w3.org/TR/xml-names.
For example, if you want to define an explicit namespace such
as http://example.com/ns/abc
in an XML document and refer to it in an xmlxp query,
format the XML as follows:
<?xml version=’1.0’?>
<doc xmlns:ns1="http://example.com/ns/abc">
<ns1:p>word</ns1:p>
</doc>
The corresponding xmlxp query is written as follows:
@xmlxp:’declare namespace ns1 = "http://example.com/ns/abc";
/doc/ns1:p[. contains ("word")]’
Default namespaces
To define a default namespace in an XML document such
as http://example.com/ns/abc
and refer to it an xmlxp query, specify the following
XML:
<?xml version=’1.0’?>
<doc xmlns="http://example.com/ns/abc">
<p>word</p>
</doc>
The corresponding xmlxp query is written as follows:
@xmlxp:’declare default element namespace "http://example.com/ns/abc";
/doc/p[. contains ("word")]’
Attribute namespaces
An attribute can also have a namespace. In this example, the element and attribute both have a namespace and the namespaces are different:
<dog xmlns:an="http://example.org/animals" xmlns:sz="http://example.org/sizes">
<an:breed sz:size="Medium">Mutt</an:breed>
</dog>
The following xmlxp query uses a namespace for the attribute name:
@xmlxp:’declare namespace sz = "http://example.org/sizes";
/dog/breed[@sz:size contains("Medium")]’
The following xmlxp query uses a namespace for both the attribute and element:
@xmlxp:’declare namespace an = "http://example.org/animals";
declare namespace sz = "http://example.org/sizes";
/dog/an:breed[. contains ("Mutt") and @sz:size contains("Medium")]
Combining XML search with fielded search
You can combine xmlxp syntax with searches on metadata fields. For example, the following search finds emails that were sent to a recipient named Paul and the CBR-enabled DocumentTitle property has a value of "IBM":
@xmlxp:’//recipients/to[.contains("Paul")]’ DocumentTitle: IBM
You can also include Boolean operators as shown in the following examples:
@xmlxp:’//recipients/to[.contains("Paul")]’ AND DocumentTitle: IBM
@xmlxp:’//recipients/to[.contains("Paul")]’ OR DocumentTitle: IBM
@xmlxp:’//recipients/to[.contains("Paul")]’ AND NOT DocumentTitle: IBM
Combining XML search with textual search
You can combine xmlxp syntax with regular text searches. For example, the following search finds emails that were sent to a recipient named Paul that contain the word important.
@xmlxp:’//recipients/to[.contains("Paul")]’ important
Combining XML search with proximity search
You can combine xmlxp syntax with a proximity search according to the following syntax:
SELECT d.This FROM Document d
INNER JOIN ContentSearch c ON d.This = c.QueriedObject
WHERE CONTAINS(d.*, '@xmlxp:''/email1/body1[.contains("""body email""~2")]''')
For example, the following search finds an email subject that contains "sun" and "sand" within two words of each other:
SELECT d.This FROM Document d
INNER JOIN ContentSearch c ON d.This = c.QueriedObject
WHERE CONTAINS(d.*, '@xmlxp:''/icc_document/icc_email/icc_subject[.contains("""sun sand""~2")]''')
CONTAINS Parameter: <dialect>
The CONTAINS <dialect> parameter is an optional parameter that identifies the syntax of the search.
Pass one of the following arguments to the parameter:
- Lucene (for IBM Content Search Services)
The following example shows a query with a search function call that specifies the IBM Content Search Services syntax:
SELECT d.This FROM Document d
INNER JOIN ContentSearch c ON d.This = c.QueriedObject
WHERE CONTAINS(d.*, 'lion' , 'Lucene')
The <dialect> parameter is optional. By default, the object store CBRSearchType property determines the Content Search Engine that runs a search for a CBR query. The intended Content Search Engine determines the presumed search syntax for a query. The <dialect> parameter works in reverse: if you explicitly identify the search syntax, this identification overrides the CBRSearchType property value to determine the Content Search Engine that runs the search. The intended Content Search Engine must be enabled to run the CBR query successfully.
Full-Text Joins
Whenever a CONTAINS clause is used, a join is performed
on the (internal) ContentSearch
class.
ContentSearch
class cannot be retrieved.
However, values for the properties of this class can be specified, and are defined in the metadata
for this class. For more information, see ContentSearch Properties. The join is necessary
because when a query with a full-text search is run, the full-text
search is performed first, then the data from the full-text search
is copied to a temporary database table that is referenced by the ContentSearch
class name. The
remainder of the search statement is then run against the repository,
joining to this temporary table to access the full-text search data.
Only one join to the ContentSearch
class
is allowed per query statement, because only one CONTAINS clause is
allowed.
The following example demonstrates an inner join in a full-text search:
SELECT … FROM Document D INNER JOIN ContentSearch CS ON D.This=CS.QueriedObject
WHERE CONTAINS(*,'text') AND D.Creator = 'Frank'
This query is run by first evaluating the CONTAINS(*,
'text')
clause and writing that data to a temporary table. Then
the remainder of the query is run:
SELECT … FROM Document D INNER JOIN ContentSearch CS ON D.This=CS.QueriedObject
WHERE CS.QueriedObject IS NOT NULL AND D.Creator = 'Frank'
Note that the CONTAINS clause is replaced by CS.QueriedObject
IS NOT NULL
.
In the previous example, the CONTAINS clause is used with an AND operator. An OR operator would yield confusing results due to the presence of the inner join (which is an effective AND). For this reason, an AND operator is required with an INNER JOIN, and an OR operator is disallowed.
An example of an outer join is:
SELECT … FROM Document D LEFT OUTER JOIN ContentSearch CS ON D.This=CS.QueriedObject
WHERE CONTAINS(*,'text') OR D.Creator = 'Frank'
The outer join first runs the CONTAINS clause and builds the temporary table in the database that holds the full-text search results, then runs the query:
SELECT … FROM Document D LEFT OUTER JOIN ContentSearch CS ON D.This=CS.QueriedObject
WHERE CS.QueriedObject IS NOT NULL OR D.Creator = 'Frank'
When you are using outer joins, only the ContentSearch
class
can be joined conditionally. The previous query above would not be
allowed to use a right outer join, because that would potentially
return only ContentSearch
data, which is not allowed. Outer
joins must also use an OR operator when the CONTAINS clause is combined
with other conditions. Use of an AND operator here is disallowed.
ContentSummary Queries
When you create a full-text (CBR) query, avoid referencing the ContentSummary column. The retrieval of this text column can slow the execution of the search.
FullTextRowLimit
FullTextRowLimit indicates the number of rows that a CBR query returns before it runs the remainder of the query. The following example shows how to specify a value for the FULLTEXTROWLIMIT option:
SELECT d.This, c.Rank FROM Document d
INNER JOIN ContentSearch c ON d.This = c.QueriedObject
WHERE CONTAINS(d.*,'lion AND tiger')
ORDER BY c.Rank
OPTIONS (FULLTEXTROWLIMIT 500)
There is one row for each object, regardless of the number of content elements. As noted in Full-Text Joins, data from the full-text index is copied to a temporary table, and that table is then joined with the remainder of the object store database to run the query.
A query might have a CONTAINS clause that matches many thousands of rows in the full-text index. However, a user might not want to pull all these rows into the temporary database table before running the remainder of the query. In this case, you can set FullTextRowLimit to a lesser value, enabling the user to see a subset of the matches more quickly.
Do not set the FullTextRowLimit too low. For example, suppose that a query has the WHERE clause "WHERE color = 'red' and CONTAINS(*, 'blue')". This query might have a thousand rows that match "CONTAINS(*, 'blue')", but only 10 rows that match "color = 'red'" and "CONTAINS(*, 'blue')". If FullTextRowLimit is set to 500, then only 500 rows are pulled from the full-text index and written to the temporary table. However, because those 500 rows might not be the same rows that also have "color = 'red'" in the database, all 10 rows might not be found as a result of the query.
If a value for FullTextRowLimit
is not supplied, the value of the ObjectStore.FullTextRowDefault
property
(stored in the GCD) is used.
If
a value for FullTextRowLimit is supplied and this value is greater
than the value of ObjectStore.FullTextRowMax
, then
the value of ObjectStore.FullTextRowMax
is used instead.
The FullTextRowMax property is present on the ObjectStore class, enabling
the system administrator to prevent queries from using too much processing
time. The ObjectStore.FullTextRowMax property defaults to 10,000 unless
it is changed by the system administrator.
FullTextRowLimit and Optimized Queries
A full-text query is optimized if the following characteristics are true:
- The query is continuable.
- The query is ordered by descending
ContentSearch.Rank
value, or the query is unordered (without an ORDER BY clause). - The query does not have a DISTINCT clause.
- The query uses an INNER JOIN to include the
ContentSearch
class. - The query is not used in a merged scope object store search.
An optimized full-text query can return results faster in some instances, and also allows the user to retrieve more rows without reaching the FullTextRowLimit value, as well as possible timeout errors or out of memory conditions.
If a full-text query is optimized, when a request for the first page is generated, the server retrieves the FullTextRowLimit number of rows from the full-text search engine and stores them in the temporary database table, then runs the relational part of the query. However, if an insufficient number of rows are found to fill the current page, the server retrieves the next set of rows from the full-text search engine, repeating the process until enough rows are found. When the server pulls the second and subsequent set of rows from the full-text search engine, the server has the percentage of retrieved rows that are actually used in the resultant data to be returned to the user, and, therefore, it no longer uses the FullTextRowLimit, and instead retrieves the number of rows based on the prior percentage used.
When the second or subsequent page is requested for an optimized full-text query, the server also retrieves the next set of rows from the full-text search engine. Therefore, this type of query can be used to browse through an unlimited number of rows that match a content search.
Optimization might not be suitable for all types of queries:
ordering by descending ContentSearch.Rank
slows the
relational part of the query, and some callers might not want the
data to be returned in descending order.
A continuable query that is not optimized will not continue past the number of rows that are specified by the value of FullTextRowLimit. When the query is not optimized, the server cannot retrieve the next set of full-text hits for subsequent pages past the FullTextRowLimit. Instead, the server retrieves the same number of rows for each page that is processed, by using the FullTextRowLimit value for the number of rows.
FullTextRowLimit and Non-Optimized Queries
Queries that are not continuable pull only one set of a FullTextRowLimit number of rows from the full-text search engine. The value of FullTextRowLimit must be chosen carefully in this case.
Queries that are not optimized might return a subset of the number of rows that match the query if more than the FullTextRowLimit number of matches exists in the full-text index. If this condition occurs, the server throws a CBR_FULLTEXTROWLIMIT_EXCEEDED exception to notify the client that not all matches have been returned. This exception will be thrown when the caller iterates through the rows returned for the query, after the last row is found.
If a value for
FullTextRowLimit is not supplied, the value that is used is the value
of the ObjectStore.FullTextRowDefault
property that
is stored in the Global Configuration Data (GCD) database.
If a value for FullTextRowLimit is
supplied and this value is greater than the value of the ObjectStore.FullTextRowMax
property,
the ObjectStore.FullTextRowMax
value is used instead.
The FullTextRowMax property enables system administrators to prevent
queries from using too much processing time. If the FullTextRowMax
property is not set, its default is 10,000.
Query Result Ranking
Use the Rank property in the ORDER BY clause of a query to order the returned objects by query relevance. The Rank property belongs to the ContentSearch class.
The following full-text search example orders the search results in descending order of relevance:
SELECT d.This, c.Rank
FROM Document d INNER JOIN ContentSearch c
ON d.This = c.QueriedObject
WHERE CONTAINS(d.DocumentTitle, 'lion tiger^5')
ORDER BY c.Rank DESCENDING
In this example, the DocumentTitle property is the searched text for each object. The search terms are lion and tiger. (^5 indicates the boost factor for tiger). The Rank property value for a returned object is calculated based on the searched text and the terms in accordance with the following factors:
- Term instance frequency - Term instance frequency is the number of term instances in the searched text for the object: objects with more term instances tend to receive a higher ranking. In the example, objects with more instances of lion and tiger in the DocumentTitle property would tend to receive a higher Rank property value.
- Term inverse matched-object frequency - Term inverse matched-object frequency is the inverse of the number of objects with a term in the searched text: objects with rarer terms tend to receive a higher ranking. In the example, suppose that the DocumentTitle property of all searched objects contains fewer instances of lion than tiger. Consequently, any object that has one or more instances of lion in the DocumentTitle tend to receive a higher Rank property value.
- Term presence ratio - Term presence ratio is the fraction of the number of terms that the searched text for an object contains: objects with more of the terms tend to receive a higher ranking. In the example, objects that have both lion and tiger in the DocumentTitle property tend to receive a higher Rank property value. Objects that have just one of the terms tend to receive a lower Rank property value.
- Term boosting - Term boosting is the importance of a term relative to other terms as specified in the search expression: objects with a boosted term in the searched text tend to receive a higher ranking. In the example, the specified boost factor is 5 for tiger and 1 for lion. Objects that have tiger in the DocumentTitle property tend to receive a higher Rank property value than objects with lion. Although the boost factor must be positive, it can be less than 1 (for example, 0.2).
- Field inverse length - Field inverse length is the inverse of the number of words in a field: objects with searched fields that have fewer words tend to receive a higher ranking. In the example, objects that have a DocumentTitle property with fewer words tend to receive a higher Rank property value. The term field refers to a an IBM Content Search Services field. Each object property is a field, and the object content is part of the default field that includes all object text.
CBR Query Optimization
This feature specifies how searches that combine both a content-based retrieval (CBR) search and a relational search on a database (DB) are run. By default, the Content Engine always performs the CBR search first and the DB search second. The CBR-first approach is most efficient when there are few full text hits. Efficiency decreases, however, when there are many full-text hits, and there are fewer database hits than full-text hits.
To provide control over how combined searches are run, the CBRQueryOptimization property can be set on the object store. As an alternative to the default CBR-first option, you can set the property to the dynamic switching option. In dynamic switching mode, the Content Engine dynamically determines whether to run the CBR search first or the DB search first, optimizing performance for these types of searches.
In dynamic switching mode, the Content Engine switches
from CBR first to DB first based on an estimated number of CBR hits. The estimate is compared to a
threshold value, set in the CBRQueryDynamicThreshold property. If the number of full-text estimated hits is less than
or equal to the CBRQueryDynamicThreshold
value, the CBR search is run first
(CBR-first search). If the number of full-text estimated hits is larger than the
CBRQueryDynamicThreshold
value, the database search is run first (DB-first
search).
Dynamic switching operation is impacted by various search options, including requests for rank ordering. The CBRQueryRankOverride property on the object store determines how the server responds to CBR search requests for rank order, and can impact server performance.
There is no CBRQueryOptimization option for the server to always process combined searches as DB first (although query users can specify an SQL override option to force DB first). Nonetheless, you can configure the server for DB first behavior by setting the following combination of object store properties as specified:
- CBRQueryOptimization:
DYNAMIC_SWITCHING
(1) - CBRQueryDynamicThreshold: 0 (zero)
- CBRQueryRankOverride:
ENABLED
(1)
However you configure the server to process combined CBR and database searches, follow the best practices to maximize performance. If you are considering the dynamic switching (optimized) mode, be aware of the limitations of running in this mode. In addition, some searches cannot be optimized in this mode; see Unsupported Searches.
SQL Query Overrides
The following SQL query options can be used to override the CBRQueryOptimization setting on the object store:
CBR_CONTENT_FIRST
- The CBR search occurs first (CBR first). This setting overrides dynamic switching if configured on the object store.CBR_DB_FIRST
- The database search occurs first (DB first). This setting overrides dynamic switching or CBR first if configured on the object store.CBR_DYNAMIC_THRESHOLD
N, where N is an integer from 0 (zero) to Integer.Max_Value.The search is run by using dynamic switching with a threshold value of N.
There are measures that you must take to ensure that DB-first searches run efficiently. See Best Practices.
CBR_DB_FIRST
or CBR_DYNAMIC_THRESHOLD
option
with an unsupported
search, an exception is thrown.In the following example, the search is run as DB first, overriding the object store configured values.
SELECT d.Id, d.DocumentTitle FROM Document d
WHERE d.DocumentTitle LIKE 'Specification FaultTree%'
AND CONTAINS(d.*, 'Tennesee AND Contract')
OPTIONS (CBR_DB_FIRST)
Best Practices
- To ensure that DB first searches run efficiently, set database criteria on indexed properties. DB first searches require database indexes on at least one property in the WHERE clause for good performance. Otherwise, queries will perform inefficiently during the database only portion of the search.
- Specify limiting conditions on database criteria to achieve relatively small hit counts. DB first searches are most effective when the database hit count is relatively small. For example, if the indexed database property criteria restricts the database hits to less than the client search pagesize of say 200, and there are expected to be more than a few tens of thousand CBR hits, then a DB first search can help the search performance.
- If the CBR hits are spread across many collections, such that
the total CBR hits divided by the number of collections is too small,
the dynamic count estimate might not be as accurate. The following
table presents the minimum threshold that is viable for a good estimate when
multiple collections are used. When the total CBR hits exceed the
threshold in this table by a sufficient amount, the dynamic count
estimate works best.
Number of Collections Recommended Minimum Threshold 1 600 10 6 K 20 12 K 50 30 K 100 60 K 200 120 K 500 300 K
Dynamic Switching: Limitations
While dynamic switching mode protects against inefficient performance due to too many full-text hits, there are limitations in running in this mode:
- For some searches, the dynamic count estimate might not be highly accurate, and, in general, is only an approximation. The dynamic count estimate works best to protect against a long running CBR-first search when there are at least 200 K items in the collection, and the CBR hit count exceeds the threshold sufficiently. For example, with a single collection and a threshold of 10,000, if there are 20,000 or more CBR hits, dynamic switching will likely choose DB first. But if there are only 15,000 CBR hits, there might be some searches that remain CBR first due to the possible inaccuracy of the estimate.
- To ensure a quick dynamic count estimation time, counting will not continue past 12 collections if the CBR hits are not occurring uniformly across all collections. The dynamic count estimate works best for CBR hits that are occurring uniformly in each collection as when searching on a common word that can occur in most documents. Contact IBM Support for possible solutions to override the default behavior.
- DB first searches do not work for searches that use the following
conditions:
- An outer join with CBR (
ContentSearch
class) - Property conditions OR'd with the CBR Contains condition
- Outer joins present in a multiple join query if the CBR inner join is not the last one on the right
- Content search property conditions (such as rank > 100)
- Distinct query with rank in the select list
- Custom search functions
Searches that use these conditions must be run as CBR first. If the CBRQueryOptimization property is set to
DYNAMIC_SWITCHING
, a search that uses any of these conditions are run as CBR first. However, if the search overrides the CBRQueryOptimization property with the SQL optionCBR_DB_FIRST
orCBR_DYNAMIC_THRESHOLD
, then the search results in an error. - An outer join with CBR (
- In dynamic switching mode, ordering by rank is not possible for
database first searches.
If you want results to always return in rank order, then set the CBRQueryRankOverride property to
DISABLED
(0). This setting forces CBR-first only searches for queries that combine a CBR search and a DB search. However, disabling rank override prevents optimized execution of these types of queries.To maintain optimized execution, set the CBRQueryRankOverride property to
ENABLED
(1) orREQUIRED
(2), and the server will dynamically switch between CBR first and DB first. When the property is set toENABLED
, CBR-first searches are returned in rank order, but DB-first searches will not. When the property is set toREQUIRED
, DB-first and CBR-first searches are not returned in rank order.A search that requests rank order, and that overrides the CBRQueryOptimization property with the SQL option of
CBR_DB_FIRST
, results in an error. - A non-merged scope search that requests rank order, and that uses
a SQL override of
CBR_DYNAMIC_THRESHOLD
, runs as follows. When the CBRQueryRankOverride property is set toDISABLED
(0) orENABLED
(1), CBR-first searches are returned in rank order, but DB-first searches will not. When the property is set toREQUIRED
(2), DB-first and CBR-first searches are not returned in rank order. - Dynamic switching is not supported for merged scope searches that
are ordered by rank. In order for the server to return results in
rank order, it must execute CBR first only.
However, you can allow dynamic switching for merged scope searches ordered by rank if you suppress the rank ordering. Set the CBRQueryRankOverride property to
REQUIRED
(2) in one or more object stores. Rank order is removed from each object store's search. (Dynamic switching can be set with the CBRQueryOptimization property or with the SQL override option ofCBR_DYNAMIC_THRESHOLD
).The tradeoff of operating in dynamic switching mode versus returning results in rank order exists because merged scope searches cannot return combined results in a requested order unless the queries run in each of the object stores use the same ordering. Merged and ordered results cannot be returned if one object store removes rank order and another object store does not. With dynamic switching enabled, each object store's CBRQueryRankOverride setting might be different. Therefore, merged scope searches that are ordered by rank cannot dynamically switch, unless the CBRQueryRankOverride property is set to
REQUIRED
in one or more object stores.
Dynamic Switching: Unsupported Searches
Some searches that combine both full text and relational query conditions cannot be optimized. The following table lists the searches that cannot be optimized, and states how the server responds to an unsupported search, with or without an SQL query override of the CBRQueryOptimization setting on the object store. As indicated by the dash ( - ), the CBRQueryOptimization setting is not applicable when an SQL query override is used.
Unsupported searches | CBRQueryOptimization setting | SQL query option override | Server behavior |
---|---|---|---|
Uses outer join with CBR (ContentSearch class) | DYNAMIC_SWITCHING | none | CBR first only |
- | CBR_DB_FIRST or CBR_DYNAMIC_THRESHOLD | Throws error: RETRIEVE_OPTIMIZED_SEARCH_INVALID_OUTER_JOIN | |
Uses property conditions OR'd with the CBR Contains condition | DYNAMIC_SWITCHING | none | CBR first only |
- | CBR_DB_FIRST or CBR_DYNAMIC_THRESHOLD | Throws error: RETRIEVE_OPTIMIZED_SEARCH_INVALID_OUTER_JOIN | |
Uses content property conditions | DYNAMIC_SWITCHING | none | CBR first only |
- | CBR_DB_FIRST or CBR_DYNAMIC_THRESHOLD | Throws error: RETRIEVE_OPTIMIZED_SEARCH_INVALID_PROPERTY_CONDITIONS | |
Uses multiple join with an outer join and a content search join that is not the last join on the right | DYNAMIC_SWITCHING | none | CBR first only |
- | CBR_DB_FIRST or CBR_DYNAMIC_THRESHOLD | Throws error: RETRIEVE_OPTIMIZED_SEARCH_INVALID_MULTIPLE_JOIN | |
Uses distinct with rank in the select list, and rank override is not set to REQUIRED (2) in the object store | DYNAMIC_SWITCHING | none | CBR first only |
- | CBR_DB_FIRST or CBR_DYNAMIC_THRESHOLD | Throws error: RETRIEVE_OPTIMIZED_SEARCH_INVALID_SELECT_DISTINCT | |
Uses distinct with rank in the select list, and rank override is set to REQUIRED (2) in the object store | DYNAMIC_SWITCHING or NONE | none | Throws error: RETRIEVE_CANT_SELECT_DISTINCT_RANK_WITH_OVERRIDE_REQUIRED |
Uses order by rank, and rank override is set to DISABLED (0) | DYNAMIC_SWITCHING | none | CBR first only |
Uses order by rank, and rank override is set to ENABLED (1) | DYNAMIC_SWITCHING | none | CBR-first searches are returned in rank order but DB-first searches are not. |
Uses order by rank, and rank override is set to REQUIRED (2) | DYNAMIC_SWITCHING | none | CBR-first and DB-first searches are not returned in rank order. |
Uses order by rank, and rank override is set to any value | - | CBR_DB_FIRST | Throws error: RETRIEVE_OPTIMIZED_SEARCH_INVALID_RANK_ORDER |
Uses order by rank, and rank override is set to DISABLED (0) or ENABLED (1) | - | CBR_DYNAMIC_THRESHOLD | CBR-first searches are returned in rank order but DB-first searches are not. |
Uses order by rank, and rank override is set to REQUIRED (2) | - | CBR_DYNAMIC_THRESHOLD | CBR-first and DB-first searches are not returned in rank order. |
Merged scope search: uses order by rank with rank override not set to REQUIRED (2). | DYNAMIC_SWITCHING in any object store or NONE in all object stores | none | CBR first only, in rank order |
- | CBR_DB_FIRST | Throws error: RETRIEVE_OPTIMIZED_SEARCH_INVALID_RANK_ORDER | |
- | CBR_DYNAMIC_THRESHOLD | Throws error: RETRIEVE_OPTIMIZED_SEARCH_INVALID_MERGED_SCOPE_RANK_ORDER |
This topic is shared by ICS, Filenet 5.5.10. As of: 2023-05-10