Clustering Search Engine Results

A typical search engine result has a URL, title, and snippet of text (short summary or description of the web page). In this scenario, each document corresponds to a search engine result, and will include two contents. We will also use a different value for the action attribute, which is more appropriate for short contents. Finally, this example illustrates assigning different levels of importance to different contents.

For each title and snippet that are retrieved from the Sources, we will create a document. This document will contain a content for the title that has a high weight, and a content for the snippet which has the default weight. Both contents are short and we do not want to apply summarization to them. Thus, the XML for this example:

<document url="url">
<content name="title" type="html" action="cluster-bold" weight="3">
<![CDATA[

       ... the document title would appear here ....


]] ></content>
<content name="snippet" type="html" action="cluster" output-action="bold">
<![CDATA[

       ... the document summary would appear here ....


]] ></content>
</document>

Also important for this example is the name attribute appearing on the content nodes. This allows you to identify the two different contents in the output. Also notice the weight attribute on the title content. Setting this value to 3 specifies that this content should be three times as important as the snippet content when clustering is performed. Finally, because of the action attribute's value of cluster-bold, the title content will be displayed in bold to highlight the appearance of cluster labels in its text.

Search engine results are most likely based on a query. It is very helpful to supply this information to the Watson Explorer Engine software. The query is specified in XML by including the following:

  <meta query="(query here)"/>

Here is a simple, but complete, example of this scenario:

<vce>
<meta query="companies" />
<document url="http://dee.example.com/">
<content name="title" type="html" action="cluster-bold" weight="3">
<![CDATA[

         Watson Explorer Engine


]] ></content>
<content name="snippet" type="html" action="cluster-bold">
<![CDATA[

         Groups the results by topic via document clustering technology.
         Options include Web or news search, selection of sources, language
         restriction, and filtering.


]] ></content>
</document>
<document url="http://sportsillustrated.cnn.com/hockey/">
<content name="title" type="html" action="cluster-bold" weight="3">
<![CDATA[

        CNN/SI: Hockey


]] ></content>
<content name="snippet" type="html" action="cluster-bold">
<![CDATA[

        Daily news, scores, feature stories, statistics, standings, player
        profiles, polls, and chat.


]] ></content>
</document>
</vce>