[Db2] Text Search: How it works on Japanese multi-byte environment

Question & Answer

Question

Could it be possible to provide an example scenario/script how Db2 Text Search works on a Japanese multi-byte environment?

Answer

See the samples for creating and using an index on Chinese, Japanese or Korean texts here :

Sample: Creating N-gram and morphological indexes for plain text

and here

Sample: Creating N-gram and morphological indexes for rich text and proprietary formats

Here is a simple test case on Db2 V11.5 with RedHat Enterprise Linux 7.7

It can be used on AIX too with an appropriate locale such as ja_JP.UTF-8.

Note: The default indexing/search mechanism for Japanese is determined with the setting for CJKSEGMENTATION in the sysibmts.tsdefaults table.
----------
#!/bin/sh

export LANG=ja_JP.utf8

db2 -v "drop db db1"
db2 -v "create db db1 using codeset utf-8 territory ja"
db2 -v "connect to db1"
db2 -v "create tablespace systoolspace in ibmcatgroup managed by automatic storage using stogroup ibmstogroup extentsize 4"
db2ts "start for text"
db2ts "enable database for text connect to db1"
db2 -v "create table t1 (c1 varchar(6) NOT NULL, doc vargraphic(512), primary key(c1))"
export DB2DBDFT=DB1

db2 -v "insert into t1 values ('1','芥川龍之介１８９２羅生門 ')"
db2 -v "insert into t1 values ('2','夏目漱石ＫＯＫＯＲＯこころ ')"
db2 -v "insert into t1 values ('3','太宰治１９０９人間失格 ')"

db2ts "create index idx_text for text on t1(doc) connect to db1"
db2ts "update index idx_text for text connect to db1"

# case #1, returns 0 row each
db2 -v "select * from t1 where contains(doc,'"芥川龍之介１８９"')=1"
db2 -v "select * from t1 where contains(doc,'"夏目漱石ＫＯＫＯ"')=1"
db2 -v "select * from t1 where contains(doc,'太宰治１９０')=1"

# case #2, returns 1 row each
db2 -v "select * from t1 where contains(doc,'"芥川龍之介１８９２"')=1"
db2 -v "select * from t1 where contains(doc,'"夏目漱石ＫＯＫＯＲＯ"')=1"
db2 -v "select * from t1 where contains(doc,'太宰治１９０９')=1"

# case #3, returns 1 row each
db2 -v "select * from t1 where contains(doc,'"芥川龍之介１８９*"')=1"
db2 -v "select * from t1 where contains(doc,'"夏目漱石ＫＯＫＯ*"')=1"
db2 -v "select * from t1 where contains(doc,'太宰治１９０*')=1"
----------

Here is the key inserted data in the doc column of table t1.
----------
'芥川龍之介１８９２羅生門 '
'夏目漱石ＫＯＫＯＲＯこころ '
'太宰治１９０９人間失格 '
----------

Text Search creates the index with following rule:
* Double byte number and ascii characters are converted into single one.
when there is number(s) or ascii character(s) in the middle of sentence, will be separated by a space.
this rule should be applied to Chinese, Japanese or Korean (CJK) documents.

* Japanese double byte Katakana and single byte Katakana are indexed as they are

but select statement will handle both characters as same.

Based on this rule, here is created data in the index idx_text.
----------
'芥川龍之介1892    羅生門 '
'夏目漱石KOKORO   こころ '
'太宰治1909   人間失格 '
----------

Select statements returns as below:
case #1: each select returns 0 row because it does not match
case #2: each select returns 1 row because it exactly matches
case #3: each select returns 1 row because wildcard '*' matching

As above results, may need to pay attention for number or ascii character(s) on Japanese multi-byte environment.
In the above script, it works as designed and expected.

Note1:
In the script, Text Search index is created as below option:
db2ts "create index idx_text for text on t1(doc) connect to db1"
If replace the line of script with one of below statements, the result is exactly same as above one.
db2ts "create index idx_text for text on t1(doc) language ja_JP index configuration (cjksegmentation 'ngram') connect to db1"
db2ts "create index idx_text for text on t1(doc) language AUTO index configuration (cjksegmentation 'ngram') connect to db1"

Note2:

This behavior might be changed without notice in the future. We can confirm whether this technote is valid or not by running steps above. Please contact your Sales Rep to submit a potential design change towards a future release. Or please open a ticket, Request For Enhancement at https://www.ibm.com/developerworks/rfe/

For opening a ticket, here is a good article, may help. How can I submit a suggestion, Document Change Request (DCR) or Request For Enhancement (RFE) for DB2 LUW?
http://www-01.ibm.com/support/docview.wss?uid=swg21987419

Related Information

Special characters in CJK languages

Db2 Text Search

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"ARM Category":[],"Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Tips

[Db2] Text Search: How it works on Japanese multi-byte environment

Question & Answer

Question

Answer

Related Information

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?