Data mining — Information extraction

In computer science, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information. Structured information might be, for example, categorized and contextually and semantically well-defined data from unstructured machine-readable documents on a particular domain.

An example of information extraction is the extraction of instances of corporate mergers. For example, the following string might result in an online-news sentence such as Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp.:

MergerBetween(company1,company2,date)

The significance of IE is determined by the growing amount of information that is available in unstructured form, this means without metadata, for example, on the Internet. You can better access unstructured information by transforming it into relational form.

A typical application of IE is to scan a set of documents that is written in a natural language and populate a database with the extracted information.

Following subtasks are typical for IE:

Named entity recognition: Recognition of entity names, for example, for people or organizations, product names, location names, temporal expressions, and certain types of numerical expressions.
References: Identification chains of noun phrases that refer to the same object
Terminology extraction: Finding the relevant terms for a given corpus
Opinion extraction or sentiment extraction: Determine the positive or the negative tonality of the text when describing a product, a service, or a person

There are many different algorithms to implement subtasks of information extraction. Each algorithm is suitable for a specific set of business problems:

Rule-based algorithms use patterns to extract concepts like phone numbers or email-addresses.
List-based algorithms use an enumeration of words to extract concepts like person names, product names, or location names.
More advanced algorithms use natural language processing, machine learning, statistical approaches, or a combination of these to extract complex concepts like sentiment or tonality.

InfoSphere™ Warehouse provides rule-based algorithms and list-based algorithms for information extraction. You can load and use additional UIMA-compliant algorithms from third-party providers (IBM® business partners, academia, or custom-developed) in InfoSphere Warehouse transformation flows. The software components that implement the information-extraction algorithms are called analysis engines or annotators. Analysis engines or annotators create annotations. An annotation describes the type of concept that is found in the text, the span, or the covered text. It also describes the start and the end of an annotation in the text, and optionally it describes additional features of the annotation.

For example, in the sentence President Brown visited Germany, the information extraction might produce the following annotations:

Annotation 1:
Type: Person 
coveredtext: President Brown 
begin: 0 
end: 14 
string-valued feature title: President

Annotation 2:
Type: Location 
coveredtext: Germany 
begin: 23 
end: 29

With InfoSphere Warehouse, you can analyze text that is stored in character-typed columns of DB2® relational tables such as CHAR, VARCHAR, or CLOB. The analysis results are also stored in relational tables. For each annotation type, the results are stored in a different table. For example, if you analyze the column TEXT in the following table, the resulting annotations are stored in the tables PERSONS and LOCATIONS.

Table 1. STATE VISITS
Integer: DOCID	TIMESTAMP: DATE	VARCHAR: TEXT
1	2006-06-28	President Brown visited Germany
2	1998-06-15	Thomas Black and Harry Gold visited France.
3	2004-07/25	Carl White visited Congo

The text analysis of the column TEXT in the table above results in the following tables:

Table 2. PERSONS
Integer: DOCID	VARCHAR: coveredText	Integer: Begin	Integer: End	VARCHAR: TITLE
1	President Brown	0	15	President
2	Thomas Black	0	12	Chancellor
2	Harry Gold	17	27	NULL

Table 3. LOCATIONS
Integer: DOCID	VARCHAR: coveredText	Integer: Begin	Integer: End
1	Germany	24	31
2	France	36	42

Table PERSONS shows that two person annotations are found in row 2. This is correct. In row 3 however, an annotation was not found. The annotator might not have recognize the person or the place.

Because human language is very complex, there is always an uncertainty. Annotators can create incorrect annotations, or they can miss an expected annotation.

To analyze the annotations that are found in the resulting tables PERSONS and LOCATONS together with the structured field DATE in the original table STATEVISITS, you can join these tables by using the common key field DOCID.