Sample: Creating N-gram and morphological indexes for rich text and proprietary formats

About this task

Use the following instructions to setup and synchronize Db2® Text Search indexes for morphological and N-gram indexing in the SAMPLE database. Search for meaningless Chinese words.

Procedure

  1. Create two tables for morphological and N-gram indexing.
    The tables contain columns k and b, where column k is the primary key, and column b will have rich text data.
    db2 "create table richtext_morpho(
    k varchar(50)not null, 
    b blob (1G),
    primary key(k)
    )"
    
    db2 "create table richtext_ngram(
    k varchar(50)not null, 
    b blob (1G),
    primary key(k)
    )"
    
  2. Issue the CREATE INDEX command to create a text search index on column b of table RICHTEXT_MORPHO. The name of the text search index is MORPHOINDEX.
    db2ts " CREATE INDEX db2ts.morphoindex FOR TEXT 
    ON richtext_morpho (b) LANGUAGE zh_CN FORMAT INSO 
    INDEX CONFIGURATION (CJKSEGMENTATION 'morphological') 
    CONNECT TO sample";
  3. Issue the CREATE INDEX command to create a text search index on column b of table RICHTEXT_NGRAM. The name of the text search index is NGRAMINDEX.
    db2ts " CREATE INDEX db2ts.ngramindex FOR TEXT 
    ON richtext_ngram (b) LANGUAGE zh_CN FORMAT INSO  
    INDEX CONFIGURATION (CJKSEGMENTATION 'ngram') 
    CONNECT TO sample";
  4. Load data into the two tables.
    db2 "import from ./data/cjk_richtext.del of DEL lobs from ./data/ 
    replace into richtext_morpho ";
    
    db2 "import from ./data/ cjk_richtext.del of DEL lobs from ./data/ 
    replace into richtext_ngram ";
    
    The cjk_richtext.del file has the entries:
    "rt_CJK.pdf","rt_CJK.pdf.0.864885/",
    "rt_CJK.pdf.doc","rt_CJK.pdf.doc.0.90112/",
    "rt_CJK.pdf.txt","rt_CJK.pdf.txt.0.37913/"
    
    The rt_CJK.pdf, rt_CJK.pdf.doc and rt_CJK.pdf.txt files all have the same content. One segment of the content in Simplified Chinese is as follows:
    "如何获得许可证密钥
    IBM Rational License Key Center 是一种许可证密钥在线提供服务,可以很方便地为您生成 Rational
     密钥。 但是必须成为您公司的 IBM Rational
     License Key Center 帐户的成员,才可以访问许可证密钥。为您下订单的人员被设置为帐户管理员,
    并会通过电子邮件向其发送用于访问 License 
    Key Center 的密码。有两种方法可以使您成为公司帐户的成员:
    方法 1 - 与为您下订单的人员联系,让其使用"帐户成员"功能将您添加为公司帐户成员。一旦成功添加,
    您将收到一封来自 License Key Center 
    的电子邮件,其中包含了您的密码和登陆说明。
    方法 2 - 除了让 License Key Center 管理员将您添加为公司 License Key Center 
    帐户的成员之外,也可以自己进行添加"
    
    Figure 1. Sample segment of content in Simplified Chinese
    Sample segment of content in Simplified Chinese
  5. Synchronize the text search indexes with data from the corresponding table by issuing following commands:
    db2ts "UPDATE INDEX db2ts.morphoindex FOR TEXT 
    CONNECT TO sample" 
    
    db2ts "UPDATE INDEX db2ts.ngramindex FOR TEXT 
    CONNECT TO sample" 
    
  6. A search for linguistically meaningful Chinese words is successful here for both morphological and N-gram segmentation.
    Figure 2. Query results for linguistically meaningful Chinese words
    Query results for linguistically meaningful Chinese words
    The output indicates that the result from morphological segmentation is the same as N-gram segmentation
  7. Search for meaningless Chinese words to see the difference between morphological and N-gram segmentation.
    Figure 3. Query results for meaningless Chinese words
    Query results for meaningless Chinese words
    Only N-gram segmentation returns a book name.