Data mining — UIMA concepts

InfoSphere™ Warehouse provides text-analysis functions that are based on the Unstructured Information Management Architecture (UIMA). You can use these text-analysis functions, or you can use Apache UIMA-compliant text-analysis components from third parties, for example, from IBM® business partners or academia, in the InfoSphere Warehouse.

To use the text-analysis functions in InfoSphere Warehouse, you do not need to understand the UIMA architecture. Therefore you can skip this section.
To understand how UIMA supports third-party components, this section provides a brief introduction to UIMA concepts. For more information, see the Unstructured Information Management Architecture (UIMA) documentation at the following website:
```
http://incubator.apache.org/uima/downloads/releaseDocs/2.2.1-incubating/docs/html/index.html
```

Unstructured information represents the largest, the most current, and the fastest growing source of information that is available to businesses and governments. There are mounds of information that are hosted in enterprises across different media, for example, text, voice, or video. With an unstructured information management (UIM) application, you can analyze large volumes of unstructured information to discover, organize, and deliver relevant knowledge to decision makers.

Unstructured data must be analyzed to interpret, detect, and locate concepts of interest that are not explicitly tagged or annotated in the original document. For example, documents might include the following domain-specific information:

Named entities: Named entities can be persons, organizations, locations, facilities, or products.
Opinions: Opinions can be complaints, threats, or facts.
Relations: Relations can be located in finances, supports, purchases, or repairs

The results of analyses must be put in structured forms so that powerful data-mining techniques and search technologies such as search engines, database engines, On-Line Analytical Processing (OLAP) tools, or Data Mining engines can be leveraged to efficiently find the concepts you need, when you need them.

By analyzing unstructured content, UIM applications make use of a variety of analysis technologies including, for example:

Statistical and rule-based natural language processing (NLP)
Information retrieval (IR)
Machine learning
Ontologies
Automated reasoning
Knowledge sources, for example, CYC, WordNet, or FrameNet

These technologies are developed independently by highly specialized scientists and engineers who use different techniques, interfaces, and platforms.

The bridge from the unstructured world to the structured world is built through the composition and deployment of these analysis capabilities. The Unstructured Information Management Architecture (UIMA) is an architecture and software framework that helps you build that bridge. It supports creating, discovering, composing, and deploying a broad range of analysis capabilities and linking them to structured information services.

Figure 1. UIMA helps you to build the bridge between unstructured and the structured world

The graphic illustrates how UIMA can help to bridge the unstructured world with the structured world

UIMA specifies component interfaces, data representations, design patterns, and development roles for creating, describing, discovering, composing, and deploying analysis capabilities.

The UIMA framework provides a run-time environment in which developers can plug in their UIMA component implementations and with which they can build and deploy UIM applications. The framework is not specific to any IDE or platform.

InfoSphere Warehouse uses the UIMA Software Development Kit (SDK) that is available at the following website:

http://incubator.apache.org/uima/

The UIMA SDK is a Java implementation of the UIMA framework. You can load your own UIMA compliant text-analysis modules and run them inside the InfoSphere Warehouse.