You create and test your custom text analysis algorithms
by using the UIMA Software Development Kit, and then deploy and run
them on document collections in Watson Explorer Content Analytics.
About this task
As an alternative to manually developing annotators
with the UIMA SDK,
you can use Content Analytics Studio
to develop and deploy custom text analytics for Watson Explorer Content Analytics applications. Content Analytics Studio is a complete development
environment for the building, customization, and testing of dictionaries,
rules, and UIMA annotators. This environment eliminates the need
for specialist knowledge of the underlying technologies of natural
language processing or UIMA. Content Analytics Studio enables
you to develop text analysis engines without needing to write any
code. Content Analytics Studio is a separately
installable component of Watson Explorer Content Analytics.
Procedure
To develop analysis algorithms by using the UIMA Software
Development Kit and integrate the algorithms with collections:
-
Plan and design:
-
Determine what information you want to search for. What
are the documents that you want to retrieve?
Which concepts
and relationships are needed for each particular search task? For
example, product and employee names might be needed to enhance general
purpose searches on a pharmaceutical company's internal website, while
people in the area of research and development need to use variants
of drug names and see drug-cause-cure relationships.
-
Specify the kind of text analysis that you need to retrieve
the information in the documents that you want to search.
-
If your collection contains XML documents, decide whether
you want to exploit the XML markup in your solution. In Watson Explorer Content Analytics, you can use XML markup
in one of two ways:
- If you can use the XML markup in your custom analysis
(for example, your documents contain
<summary>
or <topic>
elements
that can be useful in a summarization or categorization annotator),
create a XML elements to the common analysis structure mapping file.
- If you want to use the XML markup in your queries as it appears
in the document, you must enable native XML mapping.
-
Determine which text analysis result information that
is stored in the common analysis structure you want to be able to
access using semantic search. Create a common analysis structure to
index mapping file.
-
Determine whether you want to store analysis results
in a relational database, for example, to discover trends and associations
by using reporting or data mining applications. Create a common analysis
structure to database mapping file.
-
Design the semantic search application. Determine the
search user's use of the additional capabilities of semantic search.
Design the user interface.
-
Develop (UIMA activities):
-
Define the individual analysis steps.
-
Describe the type system for your mappings and analysis
algorithms.
-
Develop the analysis algorithms (annotators) for each
analysis step and embed the annotators in analysis engines by using
the UIMA Software
Development Kit.
-
After testing the analysis algorithms in UIMA, package the analysis
engines as a PEAR (Processing Engine Archive) file. The archive must
only contain your analysis algorithms, and not the basic Watson Explorer Content Analytics linguistic functionality.
When you design a text analysis solution, it might include
several analysis modules provided in more than one PEAR file. UIMA provides a means of
merging two or more PEAR files into a single PEAR file that you can
upload and run in Watson Explorer Content Analytics.
The facility for merging PEAR files ensures that there are no naming
collisions, the input and output capabilities are correctly merged,
and that there is no parameter overriding if merged parameters in
annotator descriptors have the same names. See the UIMA documentation for instructions
on how to merge PEAR files.
-
Deploy (Watson Explorer Content Analytics activities):
-
Use the administration console to upload the processing
engine archive file (.pear) to the Watson Explorer Content Analytics server. Provide a name
for the text analysis component by which you can refer to it.
-
Associate one or more collections with the uploaded
text analysis engine.
After you associate a text analysis
engine with a collection, you must restart the parser for the collection.
- Optional:
If you use this mapping, upload
and select the common analysis structure to database mapping that
you defined for your custom analysis.
You must associate
the mapping with individual collections, as applicable.
-
For each collection, upload and select the common analysis
structure to index mapping that you defined for semantic search.
-
Do one of these actions:
- Rebuild the index. If you use a document cache, you can rebuild
the index from the cache.
- If you change the category tree or dictionary resources, start
or restart the resource deployment task, and then rebuild the index.