IBM Support

Data Deduplication FAQ

Question & Answer


Answer

(Q1) What is IBM Spectrum Protect deduplication?

  1. It is an optional IBM Spectrum Protect feature that removes redundant data from a disk-based IBM Spectrum Protect storage pool. Reducing the amount of backup* data can reduce the cost of storage associated with backup and can allow more data to be stored on disk for faster access.
  2. It is important to consider that deduplication is just one method for data reduction. IBM Spectrum Protect also uses a progressive incremental backup methodology, which only backs up changed data, and supports client-side compression. IBM Spectrum Protect also allows exclusion of individual files from backup operations, which further reduces the data involved in these operations. IBM Spectrum Protect for Virtual Environments uses a progressive incremental model at the block level, which only backs up new and changed blocks.

    NOTE: References to backup and backup data also apply to archive and space-managed data (space-managed UNIX data can be used with server-side deduplication only).

(Q2) How effective is IBM Spectrum Protect deduplication?

  1. Deduplication effectiveness is usually measured in terms of the ratio of the amount of data before deduplication to the amount of data after deduplication, called the “deduplication ratio”. It can also be expressed as a percentage of data reduction.  However, the most important factor to consider is the overall reduction of backup data, rather than just the deduplication ratio.  Total data reduction with IBM Spectrum Protect includes progressive incremental, deduplication, and optionally, compression.
  2. IBM Spectrum Protect deduplication is as effective as any deduplication technology that is available on the market. Deduplication effectiveness is mostly determined by the type of data that is being backed up, and whether the data is unique or repeated.  For example, repeated full backups of the same data results in high deduplication ratios, but backing up changed data only(such as with the progressive incremental methodology) results in a lower deduplication ratio.  However, with progressive incremental backups the overall data reduction ratio remains high.  Data that is unique and not backed up repeatedly typically does not benefit from deduplication.
  3. IBM Spectrum Protect deduplication ratios typically range from 2:1 (50% reduction) to 15:1 (93% reduction), and is data-dependent.  Lower ratios are associated with backups of unique data, and higher ratios are associated with backups that are repeated, such as repeated full backups of databases or virtual machine images.  Mixtures of unique and repeated data results in ratios within that range. If you are not sure of what type of data you have and how well it reduces, use 4:1 for planning purposes when you compare with non-deduplicated IBM Spectrum Protect storage pool occupancy.  This ratio corresponds to an overall data reduction ratio of 15:1 or greater when factoring in the data reduction benefits of progressive incremental backups.

(Q3) Is IBM Spectrum Protect deduplication free?

  1. There is no additional software license cost.  However, IBM Spectrum Protect deduplication requires more resources (memory, sufficient space and disk performance for the IBM Spectrum Protect database, and CPU).
  2. IBM Spectrum Protect deduplication requires more processing, either on the client or the server.  This additional processing leverages IBM Spectrum Protect's internal database. Therefore, it is important to ensure that there is sufficient disk capacity for the IBM Spectrum Protect database and it is installed on a disk device that can support the I/O performance requirements.   

(Q4) When should I consider using IBM Spectrum Protect deduplication?

  1. You might use IBM Spectrum Protect deduplication when the following conditions apply:
    • You plan to use a disk-only backup solution (your primary backup storage pool remains on disk).
    • Your priority is to reduce the amount of disk storage required for backup data.
    • You have a limited bandwidth connection from clients to the IBM Spectrum Protect server.  In this case, client-side deduplication is an appropriate solution.
    • You are considering using IBM Spectrum Protect node replication (available since Tivoli Storage Manager V6.3).
    • Your IBM Spectrum Protect database is properly sized for deduplication and resides on a high performing disk array (see more FAQs for examples). Although not required, SSD (Solid-State Disk) is recommended for the IBM Spectrum Protect database.

(Q5) What is the largest amount of data that can be backed up to an IBM Spectrum Protect deduplicated storage pool?

  1. The practical limits of deduplication for each IBM Spectrum Protect server instance and a given hardware configuration are based on two main factors:  (1) the maximum amount of “source” data to be backed up, and (2) the maximum amount of data that is backed up each day.  “Source” data refers to the original data that is backed up along with all versions and copies of that data.
  2. Deduplication scalability limits are established for each IBM Spectrum Protect server instance. The limits apply regardless of how many deduplicated storage pools are configured on a single IBM Spectrum Protect server. Although there are no theoretical limits, there are practical limits determined by the maximum database size and amount of daily backup data.
  3. The practical maximum values depend upon many factors including the resources available to the IBM Spectrum Protect server (including CPU, memory, and I/O performance).  High-performing systems that include solid-state disks for the IBM Spectrum Protect database can manage greater capacity and workloads than systems with fewer resources. See Q7 for an example configuration.  Greater capacity can be achieved with extra hardware resources.  With container storage pools, the maximum amount of data backed up to all of the storage pools within a single IBM Spectrum Protect server instance should be kept under 4PB, since this roughly corresponds to the maximum recommended database size of 6TB.  The daily maximum backup data is determined by the ability of the hardware resources to contain daily processing of backup data.  Lab tests demonstrate IBM Spectrum Protect's capability to back up 80TB of data per day (and up to 100TB per day when you use client-side deduplication) while leaving sufficient time for completing data maintenance processing such as database backup and node replication.
  4. Backup capacity with deduplication can be scaled out by adding IBM Spectrum Protect server instances. However, deduplicated data is shared across IBM Spectrum Protect servers only when IBM Spectrum Protect node replication or a shared deduplicating appliance is used.

(Q6) How does IBM Spectrum Protect deduplication affect backup and restore performance?

  1. The use of deduplication and compression can result in longer client backup elapsed times compared to backups to a disk storage pool that is not deduplicated.  However, when the backup network is constrained, backup elapsed times can be faster when you use client-side deduplication.
  2. The impacts to backup elapsed time can often be mitigated by increasing the number of parallel backup sessions.
  3. Restore throughput from a deduplicated storage pool is generally slower when compared to restore from a disk-based storage pool that is not deduplicated.  However, when compared to restore performance from physical tape, restore from a disk-based deduplicated storage pool can be much faster.

(Q7) What are the hardware prerequisites for using IBM Spectrum Protect deduplication?  

See the IBM Spectrum Protect Blueprints for references of and hardware appropriate for implementing deduplication.

(Q8) How do I decide between using IBM Spectrum Protect's server-side or client-side deduplication?

  1. Use client-side deduplication in the following circumstances:
    • You want to distribute the workload across client systems rather than perform deduplication processing in the IBM Spectrum Protect server.
    • Bandwidth between the client and server is constrained.
  2. Use server-side deduplication in the following circumstances:
    • CPU resources on the client host system are inadequate to support the additional processing required by client-side deduplication during scheduled backup processing.

(Q9) How do I decide between IBM Spectrum Protect deduplication and a deduplication appliance?

  1. Use IBM Spectrum Protect deduplication in the following circumstances: 
    • You plan to use disk-based storage pools.
    • Based on your total backup data and daily backup amount it is more cost effective to invest in IBM Spectrum Protect server resources than a deduplicating appliance
  2. Use a deduplication appliance in the following circumstances:
    • You want to take advantage of deduplication across multiple IBM Spectrum Protect servers that use the same deduplication appliance.
    • Your backup data consists mostly of large files (greater than 500GB in V6 and greater than 2TB in V7).
    • Your daily backup data regularly exceeds 80TB (or 100TB client-side deduplication) and you choose not to deploy additional instances of IBM Spectrum Protect servers. 

(Q10) Which IBM Spectrum Protect​​​​​​​ features and options are incompatible or not supported with deduplication?

  1. Client-side encryption is incompatible with IBM Spectrum Protect deduplication. However IBM Spectrum Protect deduplication can be used together with SSL (encryption of data in flight) or encryption by the storage device.
  2. LAN-free backup
  3. Simultaneous write
  4. Subfile backup
  5. Client-side compression should not be used with server-side deduplication (since compressed objects do not deduplicate well).  However, client-side compression used with client-side deduplication can provide an effective means to further reduce storage pool data.

(Q11) How do I estimate the IBM Spectrum Protect​​​​​​​ database size when I use deduplication?

  1. A detailed explanation of how to estimate the IBM Spectrum Protect database size when you use deduplication is available in the following technote: https://www.ibm.com/support/pages/node/476911
  2. Refer to the following table for a rough estimate of IBM Spectrum Protect database capacity required when you use deduplication. This uses the criteria of 100GB of database capacity for every 10TB of backup data.  This is intended to be a conservative estimate for planning purposes, and you can find that the database requirements are less.

Total amount of backup data (TB)

Additional database size required for deduplication (TB)

20

0.2

50

0.5

100

1.0

(Q12) How do I determine how much storage is saved by using IBM Spectrum Protect​​​​​​​ deduplication?

  1. The easiest way to determine deduplication storage savings is to use the administrator command “query stgpool f=d”.  The value of the “Duplicate data not stored” field shows the amount of bytes saved and the percentage of savings. This value is not updated until after reclamation processing occurs for server-side deduplication.
  2. The best way to determine deduplication results is to run the query script available on the IBM Spectrum Protect support site: https://www.ibm.com/support/pages/node/476911​​​​​​​.  This script provides a summary of deduplication results as well as pending operations.

(Q13) Isn’t my backup at risk when my data references “chunks” of data from other files or hosts, and does not store all of the original data?    

  1. The additional risk that IBM Spectrum Protect deduplication presents to data integrity is infinitesimally small.  Best practices in data protection, such as making copies of backup data, are standard for mitigation of loss of backup data for any reason. 
  2. IBM Spectrum Protect provides a number of checks to ensure data integrity for all backup data, including deduplicated data. For chunks to be considered as duplicates, the chunks must have the exact same 160-bit SHA-1 digest and chunk size. IBM Spectrum Protect also computes and stores a 128-bit MD5 hash value for the entire file (or object) that is being backed up. The MD5 value is used to ensure that the data has been backed up properly, and upon restore this value is used to verify the integrity of the restored data.

(Q14) What IBM Spectrum Protect server code levels are recommended when you use IBM Spectrum Protect deduplication?

  1. You should install the latest IBM Spectrum Protect server maintenance for your point release when you use IBM Spectrum Protect deduplication (ftp://service.boulder.ibm.com/storage/tivoli-storage-management/maintenance/server/).
  2. The container storage pool introduced in IBM Spectrum Protect 7.1.3 provides inline server-side deduplication and significant improvements in performance and scalability.  The container storage pool was further enhanced in 7.1.5 to provide inline storage pool compression which further enhances data reduction capabilities.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEQVQ","label":"IBM Spectrum Protect"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
23 March 2020

UID

ibm13216873