You can configure the JDBC database crawler to join multiple
structured tables that have the same key fields.
About this task
When you configure crawler properties for a JDBC database crawler, you can specify
a plug-in for crawling multiple structured tables that are related
to each other through key fields. Without this plug-in, rows in a
database table are treated like individual documents and the values
of the database columns are searchable as individual fields. With
this plug-in, rows from multiple tables in a relational database that
have the same key fields are joined and treated as a single document.
The crawler adds data that it retrieves from the joined tables to
the metadata for the original row of a database table. When a user
searches the database, this additional data appears as additional
fields when the document is displayed in the search results.
- Data types that cannot be crawled
- The crawler cannot crawl fields in the tables that you join that
contain these binary data types:
BLOB
CHARACTER FOR BIT DATA
VARCHAR FOR BIT DATA
LONG VARCHAR FOR BIT DATA
- Limitations on the scope of the crawl space
- The tables to be joined must be in the same relational database.
You cannot join tables across databases.
- If a table in a database is configured to be joined with other
tables, this setting is universal for all crawlers in a collection
that are enabled to use the plug-in. However, you can create multiple
collections and configure separate crawlers to crawl different root
tables and join different tables.
- Restrictions on the use of other plug-ins
- If you configure the crawler to use the plug-in for crawling multiple
structured tables, you cannot associate another plug-in with the crawler.
For example, you cannot specify a custom plug-in for applying business
and security rules. You cannot associate more than one plug-in with
a crawler.
- Restrictions on number of tables, rows, fields, and keys
- The maximum number of joined tables per database is five, and
the sum of the rows in those tables must be less than 1,000,000. The
maximum number of fields that can be read from a table is 10. To join
tables, a key pair is used. That means it is not possible to join
tables by using multiple keys.
- Ensuring that changes in joined tables are crawled
- If the rows in a root table do not change between crawls, and
the crawler is not configured to do a full crawl, the crawler ignores
the unchanged rows. If rows in a table that is joined to the root
table change, even though the root table does not, you need to do
one of the following actions to ensure that the changes are detected
and crawled:
- A root table in the target database must have a timestamp field. Configure
the target database to have a timestamp field that gets updated when
a row in the root table changes or when rows in any of the joined
subsidiary tables change. When you set up the JDBC database crawler, be sure to
specify this timestamp field as the field that the crawler uses to
determine whether changes in the tables occurred.
- Specify that the crawler is to do a full crawl when you configure
the crawler schedule. This option ensures that all of the tables are
crawled each time regardless of whether any changes occurred.
Procedure
To set up the JDBC database crawler
to crawl multiple structured tables:
- In the administration console, create a JDBC database crawler. Select only
the parent table to be crawled in the crawler configuration.
- Log in as the default Watson Content Analytics administrator
on the crawler server.
- Copy the ES_INSTALL_ROOT/default_config/crawler_rdb_plugin.xml file
to create the ES_NODE_ROOT/master_config/crawler_rdb_plugin.xml file.
- Edit the ES_NODE_ROOT/master_config/crawler_rdb_plugin.xml file
with a text editor that supports XML files with UTF-8 character encoding,
such as Notepad. When you edit the configuration file:
- Enter changes appropriate for the database tables that the JDBC database is to crawl.
- Enter all column names of the database tables in uppercase letters.
- Save all changes in UTF-8 encoding format.
- Edit the <Server DBURL="jdbc:db2://db_server_url:50000/SAMPLE"> element
and replace jdbc:db2://db_server_url:50000/SAMPLE with
the URL of the JDBC database to be crawled.
- If the database to be crawled is not a DB2® database, edit the <JDBCDriver>com.ibm.db2.jcc.DB2Driver</JDBCDriver> element
and replace com.ibm.db2.jcc.DB2Driver with the appropriate
JDBC driver. For example, for an Oracle database, enter oracle.jdbc.driver.OracleDriver. Be sure to specify the same driver that you specified when you
created the crawler.
- Edit the <User>username</User> element
and replace username with a user ID that has authority
to access the database to be crawled.
- Edit the <Password Encryption="True">encrypted_password</Password> element
and replace encrypted_password with an encrypted
password for the specified user ID. You can copy the encrypted
password from the ES_NODE_ROOT/master_config/col_collection_name.JDBC_crawler_name/jdbccrawler.xml file
and paste it here. If the password does not need to be encrypted,
replace Encryption="True" with Encryption="False",
and replace encrypted_password with a plain text
password.
- If you leave the <Delimiters Use="True"> element
as is, multiple terms in a column are separated by comma characters
(,) defined in the <Delimiter> element. Sets of
terms per table are separated by semicolon characters (;) defined
in the <SecondDelimiter> element. If you set <Delimiters
Use="True"> to <Delimiters Use="False">,
delimiter characters are not used and multiple metadata fields with
the same field name are added as document metadata.
- If you use the <Delimiters Use="True"> element,
edit the <Delimiter>,</Delimiter> element and
the <SecondDelimiter>;</SecondDelimiter> element
to specify the characters that you want to use as value separators.
- Edit the <RelationMap Root="DB2INST1.TABLE_0"> element
and replace DB2INST1.TABLE_0 with the name of a root
table that is to be crawled.
- Edit the <Relation Parent="DB2INST1.TABLE_0"
ParentAlias="T0" ParentKey="ID" Child="DB2INST1.TABLE_1" ChildAlias="T1"
ChildKey="ID"/> element.
- Replace Parent="DB2INST1.TABLE_0" with the name
of a table that is a parent in the relation.
- Replace ParentKey= "T0" with an alias of the
parent table. This alias must be unique and not duplicated in the crawler_rdb_plugin.xml file.
- Replace ParentKey="ID" with the name of a column
that is used as a key field in the relation.
- Replace Child="DB2INST1.TABLE_1" ChildAlias="T1" ChildKey="ID" with
information about a child table to be crawled.
This structure defines how the tables are to be joined. For example,
the following relationship map specifies that a root table named DB2INST1.TABLE_A is
to be crawled. Tables DB2INST1.TABLE_B and DB2INST1.TABLE_C are
joined under the condition DB2INST1.TABLE_A.ID=DB2INST1.TABLE_B.ID
AND DB2INST1.TABLE_B.ID=DB2INST1.TABLE_C.ID.<RelationMap Root="DB2INST1.TABLE_A">
<Relation Parent="DB2INST1.TABLE_A" ParentAlias="TA" ParentKey="ID"
Child="DB2INST1.TABLE_B" ChildAlias="TB" ChildKey="ID"/>
<Relation Parent="DB2INST1.TABLE_B" ParentAlias="TB" ParentKey="ID"
Child="DB2INST1.TABLE_C" ChildAlias="TC" ChildKey="ID"/>
Restriction: All references to the table or to field
names in the database are case-sensitive. Specify them in the same
case as defined in the database.
- Repeat step 4.h to
create <Relation> elements for all relations that
join tables from a root table.
- Edit the <Target TableAlias="T1"> element
and replace TableAlias="T1" with a ChildAlias value
that you defined in step 4.h.
- Edit the <Field Name="ID" FieldName="ID_1"
Enabling="True" Searchable="True" FieldSearchable="True" IsContent="True"/> element.
- Replace Name="ID" with the name of a column in
the documents to be crawled.
- Replace FieldName="ID_1" with the name of a metadata
field in the documents to be crawled. This value is used as the display
name for the column in the administration console and the search results.
- Replace Enabling="True" with "False" if
this column is not to beincluded in the document metadata.
- Replace Searchable="True" with "False" to
prevent users from searching this column with a free text query.
- Replace FieldSearchable="True" with "False" to
prevent users from searching this column by the column name.
- Replace IsContent="True" with "False" to
indicate that the column does not contain searchable content. If you
specify Searchable="True" and IsContent="True",
then the value of the column is used to detect duplicate documents.
The value also becomes part of the dynamic document summary in the
search results.
- Repeat step 4.k to
create <Field> elements for all of the columns
that are to be crawled.
- Repeat steps 4.j and 4.k to create <Target> and <Field> elements
for all of the child tables that are referenced in the relationship map
(<RelationMap>).
- Repeat steps 4.g through 4.m to create multiple
relationship maps for multiple root tables.
- Repeat steps 4.a through 4.n to configure a relationship
map for another database.
- Configure the crawler to use the plug-in:
- Open the administration console, edit a collection,
and select the Crawl page.
- In the administration console, select the crawler that
you created and click .
- In the Plug-in class name field,
type the name of the plug-in for crawling multiple structured tables:
com.ibm.es.plugin.rdb.RDBPlugin
- In the Plug-in class path field,
type the fully qualified paths for the plug-in and the JDBC drivers
that are used by the plug-in. For example, the path for
the JDBC driver for a DB2 database
on a Windows system might
be:
C:\Program Files\IBM\es\lib\plugin_rdb.jar;C:\Program
Files\IBM\SQLLIB\java\db2jcc.jar;C:\Program Files\IBM\SQLLIB\java\db2jcc_license_cu.jar
On
an AIX® or Linux system, the path might be:
/opt/IBM/es/lib/plugin_rdb.jar:IBM/SQLLIB/java/db2jcc.jar:/opt/IBM/SQLLIB/java/db2jcc_license_cu.jar
- Create index fields for the child table:
- In the Parse and Index pane, click .
- Click Create Index Field and
create an index field that you defined in the crawler_rdb_plugin.xml file
as part of step 4.k.
Repeat this step to create other index fields.
- To deploy the crawler_rdb_plugin.xml file
to the system configuration, restart the Watson Content Analytics system:
esadmin system stop
esadmin system start