Technical overview of Big Match

The Big Match capability integrates directly with Apache Hadoop. Please refer to the System requirements for the currently supported Hadoop Distributions and versions.

Note: For information about specific platforms and supported versions, see the Installing InfoSphere Big Match for Hadoop topic listed in the related links.

Perhaps the most important aspect of Big Match is the mechanism for efficiently resolving members into entities. An "entity" is a grouping of records that are thought to represent the same person, organization, household, and so on. Resolving entities is the process of associating two or more member records that refer to the same individual or organization. After resolving entities, the Big Match applications write the entity linking data into a table within the same HBase instance as the source table you provide. The entity linking data allows users to run probabilistic searches of the members and entities.

Apache Hadoop includes the MapReduce processing framework and HBase. HBase is the Hadoop-based database that stores data in tables that are non-relational and distributed across nodes.

The Big Match capability does not connect with or rely on the operational server that supports a typical InfoSphere® MDM installation. As the instructions make clear, you install the capability directly onto your Hadoop cluster.

With the web-based Big Match Console, you can create and configure the algorithms that you want to use to process the data in your HBase tables. In the realm of IBM® InfoSphere MDM, an algorithm is a step-by-step procedure that compares and scores the similarities and differences of member attributes. As part of a process called derivation, the algorithm standardizes and buckets the data. The algorithm then defines a comparison process, which yields a numerical score. That score indicates the likelihood that two records refer to the same member. As a final step, the process specifies whether to create linkages between records that the algorithm considers to be the same person, organization, and so on. A set of linked records is known as an entity, so this last step in the process is called entity linking. Users familiar with IBM InfoSphere MDM might know that the matching process can be used to generate review tasks for potential linkages that don't surpass a certain threshold of certainty. The Big Match capability does not generate tasks. Potential linkages that do not meet the threshold are not linked together as entities.

The algorithms that you create differ depending on the data you need to process. Creating and configuring the algorithm can be a complex procedure. Consider following the tutorial for using the Big Match Console, which includes steps for working with your algorithm.

After you create your algorithm, you deploy it to your HBase cluster. The Big Match run time relies on the configured metadata and algorithm to derive, compare, and link the data.

Before you run the applications, you configure the HBase tables for the data you want to manage. Configuring the tables requires you to run a set of commands in the HBase console to enable Big Match for each table.

Note that Big Match creates an .xml configuration file for each table. Among other settings, the configuration files contain settings that define a one-to-one mapping from the HBase column family and column name to an attribute and field combination in the configuration you created. The configuration file also specifies which algorithms to run when you run Big Match.

When you install Big Match within your Hadoop cluster, the installer creates a component within the HBase master that is notified whenever a new table is enabled for Big Match. When a table is enabled, the component creates a corresponding table in which to store the derivation, comparison, and linking data that is generated when you run the Big Match applications. The component also loads the algorithm configuration into the new matching processes on the HBase Region Server and into the JVMs for MapReduce.

The Big Match capability installs within the primary HBase machine, and within the JVMs and HBase Region Server on the secondary machines.

After the components are installed and configured, you can run the applications in one of two ways:

As automatic background processes that run as you load data into your configured HBase tables. As you write data into your HBase table, the Big Match HBase coprocessors intercept the data to run the derive, compare, and link processes.
As manual batch processes that you run after you have loaded the data into the HBase tables. Each step in the process (derive, compare, and link) can be run as MapReduce applications from the Big MatchConsole.

Note that the fields that participate in matching must be written into HBase in a decompressed and decrypted format. Fields that are not used in matching can be stored in any format you want. To maximize storage capacity, it is suggested that you use Snappy compression at the HBase level. Doing so means that you do not need to pre-compress data if you don't want to.

By default, the derive, compare, and link applications run as an automatic background process. Depending on your hardware configuration, you might choose to run the applications in batch mode instead. For example, if your Hadoop cluster has a high spindle count, batch mode is likely to be more efficient. For the linking application, ample memory is required. IBM internal testing suggests that the HBase cluster needs to be able to allocate approximately 1 gigabyte of RAM for every million members you need to process. Sufficient RAM is a priority because the entity linking application must load the entire entity graph into memory.

For those users familiar with InfoSphere MDM, note that the derive and compare applications that run automatically do not differ from the corresponding derive and compare functions available with the MDM operational server. By contrast, the entity linking application proceeds as a two-step process that might feel unfamiliar to experienced MDM users. As a first step, the entity linking application unlinks any members that were previously linked into an entity but no longer have a connection to other entity members. It then links members into entities from scratch based on their most current weights and based on the most current version of the algorithm. Unlinking before linking ensures that only the appropriate members are part of an entity. It also ensures that any new members are processed and linked to the appropriate entity.

If you have experience running entity linking with the InfoSphere MDM operational server, you might notice small differences in the results that the operational server returns as compared to the results that are returned by the entity linking application in Big Match. In particular, you might notice that on average a greater number of members are assigned to an entity.

The Big Match installation package also includes APIs that extend the public HBase API so that you can run probabilistic searches of the data.

A combination of REST API and Java™ API manages communication among the components that are used by Big Match.

The Big Match offering includes a sample data set and a default algorithm that you can use to follow the tutorial. The sample algorithm allows you to explore Big Match without needing to first generate an algorithm of your own.

The Big Match Search interface sample is a lightweight, web-based search interface that is provided as a sample to allow users to experiment with member searches and entity searches. The interface is pre-configured to work with the sample data set. If you want the Search interface sample to work with other data sets, make a copy of the ui_config.xml.template template file that is included with the installation, edit the copy, and save it as ui_config.xml.

Where applicable, Big Match takes advantage of the security features available within a Hadoop distribution. The capability does not include security features independent of Hadoop.

System Requirements: http://www-01.ibm.com/support/docview.wss?uid=swg27035486