Workflow for custom analysis integration

You create and test your custom text analysis algorithms by using the UIMA Software Development Kit, and then deploy and run them on document collections in Watson Explorer Content Analytics.

About this task

As an alternative to manually developing annotators with the UIMA SDK, you can use Content Analytics Studio to develop and deploy custom text analytics for Watson Explorer Content Analytics applications. Content Analytics Studio is a complete development environment for the building, customization, and testing of dictionaries, rules, and UIMA annotators. This environment eliminates the need for specialist knowledge of the underlying technologies of natural language processing or UIMA. Content Analytics Studio enables you to develop text analysis engines without needing to write any code. Content Analytics Studio is a separately installable component of Watson Explorer Content Analytics.

Procedure

To develop analysis algorithms by using the UIMA Software Development Kit and integrate the algorithms with collections:

  1. Plan and design:
    1. Determine what information you want to search for. What are the documents that you want to retrieve?
      Which concepts and relationships are needed for each particular search task? For example, product and employee names might be needed to enhance general purpose searches on a pharmaceutical company's internal website, while people in the area of research and development need to use variants of drug names and see drug-cause-cure relationships.
    2. Specify the kind of text analysis that you need to retrieve the information in the documents that you want to search.
    3. If your collection contains XML documents, decide whether you want to exploit the XML markup in your solution. In Watson Explorer Content Analytics, you can use XML markup in one of two ways:
      • If you can use the XML markup in your custom analysis (for example, your documents contain <summary> or <topic> elements that can be useful in a summarization or categorization annotator), create a XML elements to the common analysis structure mapping file.
      • If you want to use the XML markup in your queries as it appears in the document, you must enable native XML mapping.
    4. Determine which text analysis result information that is stored in the common analysis structure you want to be able to access using semantic search. Create a common analysis structure to index mapping file.
    5. Determine whether you want to store analysis results in a relational database, for example, to discover trends and associations by using reporting or data mining applications. Create a common analysis structure to database mapping file.
    6. Design the semantic search application. Determine the search user's use of the additional capabilities of semantic search. Design the user interface.
  2. Develop (UIMA activities):
    1. Define the individual analysis steps.
    2. Describe the type system for your mappings and analysis algorithms.
    3. Develop the analysis algorithms (annotators) for each analysis step and embed the annotators in analysis engines by using the UIMA Software Development Kit.
    4. After testing the analysis algorithms in UIMA, package the analysis engines as a PEAR (Processing Engine Archive) file. The archive must only contain your analysis algorithms, and not the basic Watson Explorer Content Analytics linguistic functionality.

      When you design a text analysis solution, it might include several analysis modules provided in more than one PEAR file. UIMA provides a means of merging two or more PEAR files into a single PEAR file that you can upload and run in Watson Explorer Content Analytics. The facility for merging PEAR files ensures that there are no naming collisions, the input and output capabilities are correctly merged, and that there is no parameter overriding if merged parameters in annotator descriptors have the same names. See the UIMA documentation for instructions on how to merge PEAR files.

  3. Deploy (Watson Explorer Content Analytics activities):
    1. Use the administration console to upload the processing engine archive file (.pear) to the Watson Explorer Content Analytics server. Provide a name for the text analysis component by which you can refer to it.
    2. Associate one or more collections with the uploaded text analysis engine.
      After you associate a text analysis engine with a collection, you must restart the parser for the collection.
    3. Optional: If you use this mapping, upload and select the common analysis structure to database mapping that you defined for your custom analysis.
      You must associate the mapping with individual collections, as applicable.
    4. For each collection, upload and select the common analysis structure to index mapping that you defined for semantic search.
    5. Do one of these actions:
      • Rebuild the index. If you use a document cache, you can rebuild the index from the cache.
      • If you change the category tree or dictionary resources, start or restart the resource deployment task, and then rebuild the index.