Importing XML files that conform to the Content Classification schema

You can import data into Classification Workbench from XML files that conform to the Content Classification XML schema, such as XML files that contain analysis data.

About this task

An advantage of using XML format is that there are many XML tools available that can help you easily create and manipulate XML content. In addition, many enterprise applications have XML export capabilities.

Before you import, all XML files must be in the same format and stored in a single directory, not in subdirectories.

Importing XML files with analysis data

About this task

If you are using the Import wizard to import XML files that contain analysis data (for example, matches and scores), the data must be stored in fields that are recognized by Classification Workbench. If your XML files contain analysis data in different fields, you can apply the correct field names when you import the XML. No changes to analysis data fields are required if you are importing XML from IBM® Content Classification applications, for example, XML that was generated by the Content Extractor from an IBM FileNet® Content Manager repository.

Knowledge base analysis data must be contained in the following fields:

ICM_Match: Contains the match result (that is, the category name) that was returned by Content Classification for the content item.
ICM_Score: Contains the score (a value between 0 - 1) that was returned by Content Classification for the category in ICM_Match.

Alternatively, you can store scores and matches in a single ICM_Match field as follows:
<ICM_Match kbname="knowledge base name" score="0.n">category name<ICM_Match>

Restriction: If your XML files refer to more than one knowledge base, you must specify the knowledge base name.

Decision plan analysis data must be contained in the following fields:

ICM_DP_Fired: Contains the names of rules in the decision plan that were triggered. Rule names are in the following format:
group name ^^ rule name
ICM_DP_Changed: Contains the names of content fields that were changed or added by the decision plan and their values.

Another field called ICM_DP_All_Changed_Names contains only the names of fields that were changed or deleted by the decision plan.

Tip: For descriptions of XML tags that are used by Content Classification, you can view the annotated XML schema file in Classification_Home\Classification Workbench\Program Files\icmContentSet.xsd.

Guidelines for formatting XML files

About this task

If your content set includes analysis data, see this topic for annotated examples of XML code in the required format: Sample XML output from saved analysis data. The <ICM_Event> tag that is included in the sample code must not be included in your custom XML code.

Follow these guidelines to create a content set in XML format that you can import into Classification Workbench. Content items can be saved in individual XML files, or multiple content items can be contained in one or more XML files.

XML files must have a legal XML header, for example:
```
  <?xml version='1.0' encoding='UTF-8' ?>
```
Each XML file must contain a root element beneath the header: <All_Messages>. A closing tag </All_Messages> must be at the end of the file.
The XML schema version element must be added before the first item: <ICMSchemaVersion>2.0</ICMSchemaVersion>.
Individual content items are identified by using a content item element and the element requires opening and closing tags, for example, <Corpus_Item> and </Corpus_Item>.
Each content item must include elements that identify various components of the item. These elements correspond to fields when the data is imported. For example, an XML file might have elements that identify the body text (required), the subject, sender, and corresponding category.

Important: In Classification Workbench, field names are case-sensitive. Be sure that the case of element names in XML files is consistent.

Fields are identified by the <ICM_NVP> element and field names are identified by the key parameter. For example, a content item with a Body field and a Title field is described as follows:
```
  <ICM_NVP key="Body">sample body text</ICM_NVP>
  <ICM_NVP key="Title">sample title</ICM_NVP>
```

You can repeat the same element for fields with multiple values, for example:

  <ICM_NVP key="_Categories">CategoryA</ICM_NVP>
  <ICM_NVP key="_Categories">CategoryB</ICM_NVP>

Elements that are used for fields that contain links (for example, to start external applications) can have single values only.
Field names cannot begin with an underscore "_" or "ICM_". The prefixes are reserved for internal fields such as _Categories.
The following example of an XML file contains three content items:

<?xml version="1.0" encoding="UTF-8" ?> 
- <All_Messages>
<ICMSchemaVersion>2.0</ICMSchemaVersion>
- <Corpus_Item>
  <ICM_NVP key="message_body">Do you have my item in stock?</ICM_NVP> 
  <ICM_NVP key="_Categories">Inventory</ICM_NVP> 
  </Corpus_Item>
- <Corpus_Item>
  <ICM_NVP key="message_body">Does my purchase include batteries?</ICM_NVP> 
  <ICM_NVP key="_Categories">Batteries</ICM_NVP> 
  </Corpus_Item>
- <Corpus_Item>
  <ICM_NVP key="message_body">How much does it cost to wrap a gift?</ICM_NVP> 
  <ICM_NVP key="_Categories">Gift Orders</ICM_NVP> 
  </Corpus_Item>
  </All_Messages>

Tip: Validate your custom XML files by using the XML schema (Classification_Home\Classification Workbench\Program Files\icmContentSet.xsd) before you import them into Content Classification.

Creating a catalog file

About this task

When you import XML files, Classification Workbench scans each file to determine which fields are contained in the content set. You can increase the speed of the import by manually creating a field definition file (called catalog.xml) before you start the import.

Classification Workbench automatically generates a catalog.xml file when you export a content set in XML format. The fields in this file correspond to the fields that are displayed in the Field Definitions panel.

This file is stored in the folder that contains the XML files that you plan to import and must have the following structure:

It must begin with a legal XML header; for example:
```
<?xml version='1.0' encoding='UTF-8' ?>
```
The next line defines the number of fields (in this example, four):
```
<_Catalog entry_count="4">
```

Subsequent lines represent fields, with attributes that correspond to field properties. Each line begins and ends with an Entry tag <Entry ... /Entry >.

The following table describes field attributes:

Attribute	Description	Value
display_name	The name of the field.	Any string
type	The data type of the field. For descriptions of data types, see Field properties.	string, number, or classification
nlp_usage	Defines how the field will be analyzed by the natural language processing engine. See More about natural language processing.	Body, DocTitle, PlainText, Sender, or Subject
is_viewed	Determines whether the field is displayed in the content set page.	true or false
is_categories	Indicates that the field is used to store categories. This option should only be true for fields of type classification.	true or false
is_link	Indicates that the field is a link (that is, to start an external application). For more information, see More about natural language processing.	true or false
is_matches	Indicates that the field is used to store matches (that is, category names).	true or false
is_scores	Indicates that the field is used to store category scores.	true or false
is_firedRules	Indicates that the field is used to store the names of rules in the decision plan that were triggered.	true or false
is_changedNVPs	Indicates that the field is used to store fields that were changed or added by the decision plan and their values.	true or false

For example, a sample field called "Message" is defined as follows:

  <Entry display_name="Message" type="string" nlp_usage="Body" is_viewed="true"
  is_categories="false" is_link="false">

  <![CDATA[ Message ]]>

  </Entry>

See the following illustration of a sample catalog.xml file:

Figure 1. Sample Catalog.xml File

The XML elements include _Catalog, with the entry_count attribute set to 4. Four Entry elements are nested within the _Catalong element, with display_name attributes set to From, Message, Subject, and Categories.