You can import data into Classification Workbench from XML files that conform to the Content Classification XML schema, such as XML files that contain analysis data.
An advantage of using XML format is that there are many XML tools available that can help you easily create and manipulate XML content. In addition, many enterprise applications have XML export capabilities.
Before you import, all XML files must be in the same format and stored in a single directory, not in subdirectories.
If you are using the Import wizard to import XML files that contain analysis data (for example, matches and scores), the data must be stored in fields that are recognized by Classification Workbench. If your XML files contain analysis data in different fields, you can apply the correct field names when you import the XML. No changes to analysis data fields are required if you are importing XML from IBM® Content Classification applications, for example, XML that was generated by the Content Extractor from an IBM FileNet® Content Manager repository.
Alternatively, you can store scores and matches
in a single ICM_Match field as follows:
<ICM_Match
kbname="knowledge base name" score="0.n">category
name<ICM_Match>
Another field called ICM_DP_All_Changed_Names contains only the names of fields that were changed or deleted by the decision plan.
If your content set includes analysis data, see this topic for annotated examples of XML code in the required format: Sample XML output from saved analysis data. The <ICM_Event> tag that is included in the sample code must not be included in your custom XML code.
Follow these guidelines to create a content set in XML format that you can import into Classification Workbench. Content items can be saved in individual XML files, or multiple content items can be contained in one or more XML files.
<?xml version='1.0' encoding='UTF-8' ?>
<ICM_NVP key="Body">sample body text</ICM_NVP>
<ICM_NVP key="Title">sample title</ICM_NVP>
<ICM_NVP key="_Categories">CategoryA</ICM_NVP>
<ICM_NVP key="_Categories">CategoryB</ICM_NVP>
The following example of an XML file contains three content items:
<?xml version="1.0" encoding="UTF-8" ?>
- <All_Messages>
<ICMSchemaVersion>2.0</ICMSchemaVersion>
- <Corpus_Item>
<ICM_NVP key="message_body">Do you have my item in stock?</ICM_NVP>
<ICM_NVP key="_Categories">Inventory</ICM_NVP>
</Corpus_Item>
- <Corpus_Item>
<ICM_NVP key="message_body">Does my purchase include batteries?</ICM_NVP>
<ICM_NVP key="_Categories">Batteries</ICM_NVP>
</Corpus_Item>
- <Corpus_Item>
<ICM_NVP key="message_body">How much does it cost to wrap a gift?</ICM_NVP>
<ICM_NVP key="_Categories">Gift Orders</ICM_NVP>
</Corpus_Item>
</All_Messages>
When you import XML files, Classification Workbench scans each file to determine which fields are contained in the content set. You can increase the speed of the import by manually creating a field definition file (called catalog.xml) before you start the import.
Classification Workbench automatically generates a catalog.xml file when you export a content set in XML format. The fields in this file correspond to the fields that are displayed in the Field Definitions panel.
This file is stored in the folder that contains the XML files that you plan to import and must have the following structure:
<?xml version='1.0' encoding='UTF-8' ?>
<_Catalog entry_count="4">
The following table describes field attributes:
| Attribute | Description | Value |
|---|---|---|
| display_name | The name of the field. | Any string |
| type | The data type of the field. For descriptions of data types, see Field properties. | string, number, or classification |
| nlp_usage | Defines how the field will be analyzed by the natural language processing engine. See More about natural language processing. | Body, DocTitle, PlainText, Sender, or Subject |
| is_viewed | Determines whether the field is displayed in the content set page. | true or false |
| is_categories | Indicates that the field is used to store categories. This option should only be true for fields of type classification. | true or false |
| is_link | Indicates that the field is a link (that is, to start an external application). For more information, see More about natural language processing. | true or false |
| is_matches | Indicates that the field is used to store matches (that is, category names). | true or false |
| is_scores | Indicates that the field is used to store category scores. | true or false |
| is_firedRules | Indicates that the field is used to store the names of rules in the decision plan that were triggered. | true or false |
| is_changedNVPs | Indicates that the field is used to store fields that were changed or added by the decision plan and their values. | true or false |
For example, a sample field called "Message" is defined as follows:
<Entry display_name="Message" type="string" nlp_usage="Body" is_viewed="true"
is_categories="false" is_link="false">
<![CDATA[ Message ]]>
</Entry>
See the following illustration of a sample catalog.xml file:
