Data deduplication

Data deduplication is a method of reducing storage needs by eliminating redundant data.

Overview

Two types of data deduplication are available on IBM Spectrum Protect™: client-side data deduplication and server-side data deduplication.

Client-side data deduplication is a data deduplication technique that is used on the backup-archive client to remove redundant data during backup and archive processing before the data is transferred to the IBM Spectrum Protect server. Using client-side data deduplication can reduce the amount of data that is sent over a local area network.

Server-side data deduplication is a data deduplication technique that is done by the server. The IBM Spectrum Protect administrator can specify the data deduplication location (client or server) to use with the DEDUP parameter on the REGISTER NODE or UPDATE NODE server command.

Enhancements

With client-side data deduplication, you can:

Exclude specific files on a client from data deduplication.
Enable a data deduplication cache that reduces network traffic between the client and the server. The cache contains extents that were sent to the server in previous incremental backup operations. Instead of querying the server for the existence of an extent, the client queries its cache.
Specify a size and location for a client cache. If an inconsistency between the server and the local cache is detected, the local cache is removed and repopulated.

Note: For applications that use the IBM Spectrum Protect API, the data deduplication cache must not be used because of the potential for backup failures caused by the cache being out of sync with the IBM Spectrum Protect server. If multiple, concurrent backup-archive client sessions are configured, there must be a separate cache configured for each session.
Enable both client-side data deduplication and compression to reduce the amount of data that is stored by the server. Each extent is compressed before it is sent to the server. The trade-off is between storage savings and the processing power that is required to compress client data. In general, if you compress and deduplicate data on the client system, you are using approximately twice as much processing power as data deduplication alone.
The server can work with deduplicated, compressed data. In addition, backup-archive clients earlier than V6.2 can restore deduplicated, compressed data.

Client-side data deduplication uses the following process:

The client creates extents. Extents are parts of files that are compared with other file extents to identify duplicates.
The client and server work together to identify duplicate extents. The client sends non-duplicate extents to the server.
Subsequent client data-deduplication operations create new extents. Some or all of those extents might match the extents that were created in previous data-deduplication operations and sent to the server. Matching extents are not sent to the server again.

Benefits

Client-side data deduplication provides several advantages:

It can reduce the amount of data that is sent over the local area network (LAN).
The processing power that is required to identify duplicate data is offloaded from the server to client nodes. Server-side data deduplication is always enabled for deduplication-enabled storage pools. However, files that are in the deduplication-enabled storage pools and that were deduplicated by the client, do not require additional processing.
The processing power that is required to remove duplicate data on the server is eliminated, allowing space savings on the server to occur immediately.

Client-side data deduplication has a possible disadvantage. The server does not have whole copies of client files until you back up the primary storage pools that contain client extents to a non-deduplicated copy storage pool. (Extents are parts of a file that are created during the data-deduplication process.) During storage pool backup to a non-deduplicated storage pool, client extents are reassembled into contiguous files.

By default, primary sequential-access storage pools that are set up for data deduplication must be backed up to non-deduplicated copy storage pools before they can be reclaimed and before duplicate data can be removed. The default ensures that the server has copies of whole files at all times, in either a primary storage pool or a copy storage pool.

Important: For further data reduction, you can enable client-side data deduplication and compression together. Each extent is compressed before it is sent to the server. Compression saves space, but it increases the processing time on the client workstation.

The following options pertain to data deduplication:

Deduplication
Dedupcachepath
Dedupcachesize
Enablededupcache
Exclude.dedup
Include.dedup