DB2 10.5 for Linux, UNIX, and Windows

Sample: Creating N-gram and morphological indexes for plain text

About this task

Use the following instructions to setup and synchronize DB2® Text Search indexes for morphological and N-gram indexing in the SAMPLE database. Search for linguistically meaningful Chinese words.

Procedure

  1. Create two tables for morphological and N-gram indexing. The tables have columns for the book name, author, story, ISBN number and the year the book was published.
    db2 "CREATE TABLE morphobooks (
    isbn VARCHAR(18) not null PRIMARY KEY, 
    bookname VARCHAR(30), 
    author VARCHAR(30), 
    story blob(1G), 
    year integer
    )" 
    
    db2 "CREATE TABLE ngrambooks (
    isbn VARCHAR(18) not null PRIMARY KEY, 
    bookname VARCHAR(30), 
    author VARCHAR(30), 
    story blob(1G), 
    year integer
    )" 
  2. Issue the CREATE INDEX command to create a text search index on the STORY column of MORPHOBOOKS table. The name of the text search index is MORPHOINDEX.
    db2ts " CREATE INDEX db2ts.morphoindex FOR TEXT 
    ON morphobooks (story) LANGUAGE zh_TW 
    INDEX CONFIGURATION (CJKSEGMENTATION 'morphological') 
    CONNECT TO sample";
  3. Issue the CREATE INDEX command to create a text search index on the STORY column of NGRAMBOOKS table. The name of the text search index is NGRAMINDEX.
    db2ts " CREATE INDEX db2ts.ngramindex FOR TEXT 
    ON ngrambooks (story) LANGUAGE zh_TW 
    INDEX CONFIGURATION (CJKSEGMENTATION 'ngram') 
    CONNECT TO sample";
  4. Load data into the two tables.
    db2 "import from ./data/books.del of DEL lobs from ./data/ 
    replace into morphobooks";
    
    db2 "import from ./data/books.del of DEL lobs from ./data/ 
    replace into ngrambooks";
    The books.del file has the entry:
    "0-13-086755-4", "book1", "Julie", "books_zh_TW1.lob.0.449/", 2004
    The Books_zh_TW1.lob large object has the following content:
    Figure 1. Content of the Books_zh_TW1.lob object
    Content of the Books_zh_TW1.lob object
  5. Synchronize the text search indexes with data from the corresponding table by issuing following commands:
    db2ts "UPDATE INDEX db2ts.morphoindex FOR TEXT CONNECT TO sample"; 
    
    db2ts "UPDATE INDEX db2ts.ngramindex FOR TEXT CONNECT TO sample"; 
  6. A search for linguistically meaningful Chinese words is successful here for both morphological and N-gram segmentation.
    Figure 2. Query results for meaningful Chinese words
    Query results for meaningful Chinese words
    The output indicates that the result from morphological segmentation is the same as N-gram segmentation
  7. Search for meaningless Chinese words to see the difference between morphological and N-gram segmentation.
    Figure 3. Query results for meaningless Chinese words
    Query results for meaningless Chinese words
    Only N-gram segmentation returns a book name.