IBM Support

Restriction on the Fuzzy text search in CJK language on IBM Text Search

Question & Answer


Question

Why the Fuzzy text search in CJK language happens to return no result on IBM Text Search while the Basic text search returns results? Any restriction?

Cause

In the CJK language environment, several words in a sentence are concatenated without a delimiting character such as a whitespace character. To divide a sentence into units for text indexing, IBM Text Search utilizes the N-gram segmentation technique or the morphological analysis technique in the text index and in the Basic text search, but not in the Fuzzy text search.

Answer

During a Fuzzy text search operation in the CJK language environment, however, the whole search term is used as one single search unit without segmentation. Therefore, Fuzzy test search results are often unexpected when the search term contains much more characters than the segmentation units in the text index.

For example, with two characters N-gram text index (biGram text index) setting, a document containing this phrase - "$Kyo$U$Ha$Yo$I$Hi$Da" in the Japanese language is segmented into these indexing units: ($Kyo$U), ($U$Ha), ($Ha$Yo), ($Yo$I), ($I$Hi), ($Hi$Da) and ($Da). On this index, a fuzzy text search - "$Yo$I$Hi~0.5" will not find the above document a match because the search term "$Yo$I$Hi" is used as one search unit which is not found similar enough to any of the index units.

In the case of the morphological text index, "$Kyo$U$Ha$Yo$I$Hi$Da" is segmented into these indexing units: ($Kyo$U-$Kyo$U), ($Ha-$Ha), ($Yo$I-$Yo$I), ($Hi-$Hi), ($Da-$Da) and a fuzzy text search - "$Yo$I$Hi~0.5" will not find the above document a match as well.

Note: "$Kyo$U$Ha$Yo$I$Hi$Da" is 0x8da1.0x93fa.0x82cd.0x97c7.0x82a2.0x93fa.0x82be in JIS code (IBM-932 code page), "今日は良い日だ" in Japanese.

To achieve an effective fuzzy search in the CJK language environment:

  • The number of characters in a search term should not be greater than the number of character NGram applied in the index. Two character NGram "biGram" is used by default.
  • The smallest morphologically parsed word should be used as a search term on the morphological text index.

[{"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Extenders - Text","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"10.1;10.5;9.7","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 June 2018

UID

swg21986910