Data harvesting

Harvesting or indexing is the process or task by which IBM® StoredIQ® examines and classifies data in your network.

Running a Harvest every volume job indexes all data objects on all volumes.
  • A full harvest can be run on every volume or on individual volumes.
  • An incremental harvest only harvests the changes on the requested volumes

These options are selected when you create a job for the harvest. A harvest must be run before you can start searching for data objects or textual content. An Administrator initiates a harvest by including a harvest step in a job.

Most harvesting parameters are selected from the Configuration subtab. You can specify the number of processes to use during a harvest, whether a harvest must continue where it left off if it was interrupted and many other parameters. Several standard harvesting-related jobs are provided in the system.

Harvesting with and without post-processing

You can separate harvesting activities into two steps: the initial harvest and harvest post-processing. The separation of tasks gives Administrators the flexibility to schedule the harvest or the post-process loading to run at times that do not affect system performance for system users. These users might, for example, be running queries. Examples of post-harvest activities are as follows:
  • Loading all metadata for a volume.
  • Computing all tags that are registered to a particular volume.
  • Generating all reports for that volume.
  • If configured, updating tags, and creating explorers in the harvest job.

Incremental harvests

Harvesting volumes takes time and taxes your organization's resources. You can maintain the accuracy of the metadata repository quickly and easily with incremental harvests. With both of these features, you can ensure that the vocabulary for all volumes is consistent and up to date. When you harvest a volume, you can speed up subsequent harvests by only harvesting for data objects that were changed or are new. An incremental harvest indexes new, modified, and removed data objects on your volumes or file servers. Because the harvests are incremental, it takes less time to update the metadata repository with the additional advantage of putting a lighter load on your systems than the original harvests.

Note: Harvesting NewsGator Volumes: Since NewsGator objects are just events in a stream, an incremental harvest of a NewsGator volume fetches only new events that were added since the last harvest. To cover gaps due to exceptions or to pick up deleted events, a full harvest might be required.

Reharvesting

The behavior is the same for both types of data server:

On a reharvest, the metadata for a document is updated because only the latest version of the document is considered. Therefore, the document might then no longer match previously applied filter criteria although is it still part of the infoset.

On a reharvest, also the full-text index is updated. Any previously applied cartridges are automatically reapplied to the latest document version to ensure that the results of any Step-up Analytics action are still available in the full-text index. Step-up Analytics or Step-up Full-Text actions run after a reharvest analyze and annotate the latest document version on the data source.