Configuring hash settings

Hashes are used to identify unique content. Configure the type of hash to compute when harvesting.

About this task

By default, IBM® StoredIQ® computes a SHA-1 hash for each object encountered during harvesting. If the SHA-1 hash is based on the content of the files, it can be used to identify unique files (and duplicates).

For computing such a hash, document content must be fetched over the network even for harvests where only file system metadata is collected. To avoid this, you can disable content based hashing for file system metadata only indexing. This provides the fastest indexing rate at the expense of the ability to identify unique content. In this case, the information used to compute the hash is based on volume and object metadata.

If you change the hash settings between harvests, the next harvest uses the updated settings for any new or modified documents. For example, you might not have content based hashes created initially, but some time after the harvest completed you decide to enable content based hashing. In this case, a full-text harvest (if the volume allows for that) generates regular content based hashes for all documents that are indexed during the harvest.

The hash setting does not impact data object preview.

Procedure

  1. Go to Administration > Configuration > Application > Hash settings.
  2. Determine whether you want to generate a content based hash.
    • For content based hashes, leave the Compute data object hash option selected.

      With this setting, content based hashes are generated as selected for full-text and metadata harvests (see step 4).

    • For metadata based hashes, clear the Compute data object hash checkbox.
  3. For creating a hash for email, select what email attributes are considered to compute the hash.
    Email has characteristics that present a challenge when attempting to identify unique messages based on a hash. Using a pure content based hash, it is likely that emails with identical user-visible content do not share the same SHA-1 hash. Therefore, you can select from a set of attributes the ones to contribute to the hash. By using specific fields to compute the email hash, an email located in a local PST archive in a file system, for example, can be identified as a duplicate of a message in an Exchange mailbox even though they are stored in completely different binary formats.
    By default, the following information contributes to the hash:
    • The information in the To, From, CC, and BCC attributes
    • The email subject
    • The content of the email body
    • The content of any email attachments

    The email hash selections operate independently from the data object hash settings; that is, a data object can have a binary hash or an email hash, but not both.

  4. For content based hashes, select whether you want to generate a full or a partial hash. This option is not available if you cleared the Compute data object hash checkbox.

    IBM StoredIQ offers two strategies for computing a content based hash. The default option is to read the entire contents of each file as input to computing a SHA-1 hash for the file (full hash). If the content of a file must be read to satisfy other content based index options (container processing or full-text indexing), a full content based hash is always computed.

    If you want only a file system metadata index with the ability to identify unique files, you have the option to create a hash from parts of the file content (partial hash). With a partial hash , only a maximum of 128 KB of a file's content is read to compute the hash. This minimizes the amount of data read reducing the workload on the data source and network and effectively increasing the indexing rate.

    For a partial hash, up to four 32 KB blocks from each file are read to compute the hash. If a file is less than 128 KB in size, the entire file content is evaluated. Content to compute the hash for files with a size greater than 128 KB is read as follows:

    • 1 x 32 KB block taken from the beginning of the file
    • 2 x 32 KB blocks equally spaced between the beginning and end of the file
    • 1 x 32 KB block taken from the end of the file

    The resulting four 32 KB blocks are used as input to compute the hash. The partial hash might not be appropriate for all use cases but might be sufficient for use cases such as storage management.

    • For a full hash, leave Entire data object content (required for data object typing) selected.

      IBM StoredIQ uses Oracle Outside In Technology filters to determine the object type based on content and to extract additional metadata and text.

      IBM StoredIQ implements its own support for text files, web archives (MHT), IBM Notes® email, and EMC EmailXtender and SourceOne archives.

      If a particular data object cannot be handled with the available text extraction methods, IBM StoredIQ can selectively use binary processing to extract strings from a file. File processed in this way have a binary processing attribute associated with them to allow the content to be filtered based on this processing attribute. It can be useful to segregate these files because binary processing can yield a high rate of false positives relative to other content extraction techniques.

      You can configure binary processing in the harvester settings.

    • For a partial hash, select Partial data object content.
  5. Click OK.