Data ingest

The services that supply IBM® Watson® Discovery Service the scalability and speed you need start in the data ingest phase. Discovery allows you to ingest your structured and unstructured data, including data that is private, licensed or even public.

And it includes three ways to bring that data into the system. The first way is by simply using the raw API calls. That means if you already have the plumbing in place to ingest data, and just want the normalization, enrichment and query benefits of Discovery, you can use existing API calls.

The raw API calls ingest JSON formatted data, so you can structure this pipeline however you want. And while the API calls also ingest binary files, such as Microsoft Word documents, PDF and HTML formats, Watson includes two additional ways to ingest them that are easier for most users.

One way is to use the web-based tooling built into Discovery. Web-based tooling is a browser-based user interface (UI) that allows you to ingest files much as you might upload photos to Flickr or files to Dropbox. And it allows you to quickly experiment with new or different data sets.

But the web UI is not terribly efficient for large amounts of bring your own data (BYOD) ingest. For BYOD use cases, Discovery includes a data crawler. The data crawler is a preconfigured API call where IBM has already built the plumbing for you.

The data crawler functions like a standalone program that uploads all of the files in a directory, or set of directories—from a local machine or a network file system. Further, Discovery includes connectors that talk to the different data repositories.

These connectors allow you to fetch all of the data, or only that data which has changed since the last ingest. This incremental upload capability automates what is otherwise a manual tracking and update process if you used the web UI.

Naturally, this increases efficiency because the system doesn’t have to crawl the entire content again. Just as important, it slashes your data oversight and janitorial work—without requiring any programming on your part.

Searches could rapidly extract a variety of answers related to a company such as:

Existing file system
Databases via a Java Database Connectivity (JDBC) driver
Content Management Interoperability Services (CMIS)
Server Message Block (SMB)
Common Internet File System (CIFS) for Samba file shares
SharePoint and SharePoint online
Box

Once the data has been ingested, Discovery automatically converts the binary- formatted data such as Microsoft Word documents, PDF and HTML into JSON. The built-in public interface of Discovery operates just as if document conversion was called directly through IBM Bluemix®.

To illustrate how this might be useful, imagine an online retailer. Blue Snail Style sells unique clothing discovered on global sailing adventure trips. The webpage product listings include rich descriptions of textile prints, fabrics and styles—along with stories about the discovery of each piece of clothing and locals who create and manufacture the products.

Like many analytics tools, Discovery can ingest the sales, inventory and other structured data for Blue Snail Style. But unlike many commercial off-the- shelf (COTS) analytics packages, it can ingest and convert the Blue Snail Style online catalog. This allows Blue Snail Style to integrate those rich descriptions and customer comments as a data source for their analytics along with their pricing, sales and inventory data sources.

View prior: Main page

View next: Data enrichment

Talk to an expert

Email us