IBM Support

[Db2] Text Search: How it works on Japanese multi-byte environment

Question & Answer


Question

Could it be possible to provide an example scenario/script how Db2 Text Search works on a Japanese multi-byte environment?

Answer

See the samples for creating and using an index on Chinese, Japanese or Korean texts here :
and here
Here is a simple test case on Db2 V11.5 with RedHat Enterprise Linux 7.7
It can be used on AIX too with an appropriate locale such as ja_JP.UTF-8.
Note: The default indexing/search mechanism for Japanese is determined with the setting for CJKSEGMENTATION in the sysibmts.tsdefaults table.
----------
#!/bin/sh
export LANG=ja_JP.utf8
db2 -v "drop db db1"
db2 -v "create db db1 using codeset utf-8 territory ja"
db2 -v "connect to db1"
db2 -v "create tablespace systoolspace in ibmcatgroup managed by automatic storage using stogroup ibmstogroup extentsize 4"
db2ts "start for text"
db2ts "enable database for text connect to db1"
db2 -v "create table t1 (c1 varchar(6) NOT NULL, doc vargraphic(512), primary key(c1))"
export DB2DBDFT=DB1
db2 -v "insert into t1 values ('1','芥川龍之介1892    羅生門 ')"
db2 -v "insert into t1 values ('2','夏目漱石KOKOROこころ ')"
db2 -v "insert into t1 values ('3','太宰治1909   人間失格 ')"
db2ts "create index idx_text for text on t1(doc) connect to db1"
db2ts "update index idx_text for text connect to db1"
# case #1, returns 0 row each
db2 -v "select * from t1 where contains(doc,'"芥川龍之介189"')=1"
db2 -v "select * from t1 where contains(doc,'"夏目漱石KOKO"')=1"
db2 -v "select * from t1 where contains(doc,'太宰治190')=1"
                    
# case #2, returns 1 row each
db2 -v "select * from t1 where contains(doc,'"芥川龍之介1892"')=1"
db2 -v "select * from t1 where contains(doc,'"夏目漱石KOKORO"')=1"
db2 -v "select * from t1 where contains(doc,'太宰治1909')=1"
# case #3, returns 1 row each
db2 -v "select * from t1 where contains(doc,'"芥川龍之介189*"')=1"
db2 -v "select * from t1 where contains(doc,'"夏目漱石KOKO*"')=1"
db2 -v "select * from t1 where contains(doc,'太宰治190*')=1"
----------
Here is the key inserted data in the doc column of table t1.
  ----------
  '芥川龍之介1892    羅生門 '
  '夏目漱石KOKOROこころ '
  '太宰治1909   人間失格 '
  ----------
Text Search creates the index with following rule:
  * Double byte number and ascii characters are converted into single one.
    when there is number(s) or ascii character(s) in the middle of sentence, will be separated by a space.
    this rule should be applied to Chinese, Japanese or Korean (CJK) documents.
  * Japanese double byte Katakana and single byte Katakana are indexed as they are
    but select statement will handle both characters as same.
Based on this rule, here is created data in the index idx_text.
  ----------
  '芥川龍之介1892    羅生門 '
  '夏目漱石KOKORO   こころ '
  '太宰治1909   人間失格 '
  ----------
Select statements returns as below:
  case #1: each select returns 0 row because it does not match
  case #2: each select returns 1 row because it exactly matches
  case #3: each select returns 1 row because wildcard '*' matching
As above results, may need to pay attention for number or ascii character(s) on Japanese multi-byte environment.
In the above script, it works as designed and expected.
Note1:
In the script, Text Search index is created as below option:
  db2ts "create index idx_text for text on t1(doc) connect to db1"
If replace the line of script with one of below statements, the result is exactly same as above one.
  db2ts "create index idx_text for text on t1(doc) language ja_JP index configuration (cjksegmentation 'ngram') connect to db1"
  db2ts "create index idx_text for text on t1(doc) language AUTO index configuration (cjksegmentation 'ngram') connect to db1"
Note2:
This behavior might be changed without notice in the future. We can confirm whether this technote is valid or not by running steps above.  Please contact your Sales Rep to submit a potential design change towards a future release.  Or please open a ticket, Request For Enhancement at https://www.ibm.com/developerworks/rfe/
For opening a ticket, here is a good article, may help.  How can I submit a suggestion, Document Change Request (DCR) or Request For Enhancement (RFE) for DB2 LUW?
http://www-01.ibm.com/support/docview.wss?uid=swg21987419

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"ARM Category":[],"Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
03 June 2020

UID

ibm16213691