Checklist for data deduplication
Data deduplication requires more processing resources on the server or client. Use the checklist to verify that hardware and your IBM Spectrum Protect configuration have characteristics that are key to good performance.
Question | Tasks, characteristics, options, or settings | More information |
---|---|---|
Are you using fast disk storage for the IBM Spectrum Protect database as measured in terms of input/output operations per second (IOPS)? |
Use a high-performance disk for the IBM Spectrum Protect database. At a minimum, use 10,000 rpm drives for smaller databases that are 200 GB or less. For databases over 500 GB, use 15,000 rpm drives or solid-state drives. Ensure that the IBM Spectrum Protect database has a minimum capability of 3000 IOPS. For each TB of data that is backed up daily (before data deduplication), include an extra 1000 IOPS to this minimum. For example, an IBM
Spectrum Protect server that is
ingesting 3 TB of data per day would need 6000 IOPS for the database disks:
|
Checklist for server database disks For more information about IOPS, see the IBM Spectrum Protect Blueprint at IBM Spectrum Protect Blueprint |
Do you have enough memory for the size of your database? | Use a minimum of 64 GB of system memory for IBM
Spectrum Protect servers that are deduplicating data. If the
retained capacity of backup data grows, the memory requirement might need to be higher. Monitor memory usage regularly to determine whether more memory is required. Use more system memory to
improve caching of database pages. The following memory size guidelines are based on the daily
amount of new data that you back up:
|
Memory requirements |
Have you properly sized the storage capacity for the database active log and archive log? | The suggested starting size for the active log is 16
GB. Configure the server to have an maximum active log size of 128 GB by setting the ACTIVELOGSIZE server option to a value of 131072. The suggested starting size for the archive log is 48 GB. The size of the archive log is limited by the size of the file system on which it is located, and not by a server option. Make the archive log at least as large as the active log. Use a directory for the database archive logs with an initial free capacity of at least 500 GB. Specify the directory by using the ARCHLOGDIRECTORY server option. Define space for the archive failover log by using the ARCHFAILOVERLOGDIRECTORY server option. |
|
Are the IBM
Spectrum Protect database and
logs on separate disk volumes (LUNs)? Is the disk that is used for the database configured according to best practices for a transactional database? |
The database must not share disk volumes with IBM Spectrum Protect database logs or storage pools, or with any other application or file system. |
See Server database and recovery log configuration and tuning |
Are you using a minimum of eight (2.2 GHz or equivalent) processor cores for each IBM Spectrum Protect server that you plan to use with data deduplication? | If you are planning to use client-side data deduplication, verify that client systems have adequate resources available during a backup operation to complete data deduplication processing. Use a processor that is at least the minimum equivalent of one 2.2 GHz processor core per backup process with client-side data deduplication. | |
Have you properly sized disk space for storage pools? | For a rough estimate, plan for 100 GB of database storage for every 10 TB
of data that is to be protected in deduplicated storage pools. Protected data is the
amount of data before deduplication, including all versions of objects stored. As a best practice, define a new container storage pool exclusively for data deduplication. Data deduplication occurs at the storage-pool level, and all data within a storage pool, except encrypted data, is deduplicated. |
Checklist for container storage pools |
Have you estimated storage pool capacity to configure enough space for the size of your environment? | You can estimate capacity requirements for a deduplicated storage pool by
using the following technique:
|
|
Have you distributed disk I/O over many disk devices and controllers? | Use arrays that consist of as many disks as possible, which is sometimes
referred to as wide striping. When I/O bandwidth is available and the files are large, for example 1 MB, the process of finding duplicates can occupy the resources of an entire processor during a session or process. When files are smaller, other bottlenecks can occur. Specify eight or more file systems for the deduplicated storage pool device class so that I/O is distributed across as many LUNs and physical devices as possible. |
See Checklist for storage pools on DISK or FILE. |
Have you scheduled data deduplication processing based on your backup strategy? | If you are not creating a secondary copy of backup data or if you are
using node replication for the second copy, client backup and duplicate identification can be
overlapped. This can reduce the total elapsed time for these operations, but might increase the time
that is required for client backup. If you are using storage pool backup, do not overlap client backup and duplicate identification. The best practice sequence of operations is client backup, storage pool backup, and then duplicate identification. For data that is not stored with client-side data deduplication, schedule storage-pool backup operations to complete before you start data deduplication processing. Set up your schedule this way to avoid reconstructing objects that are deduplicated to make a non-deduplicated copy to a different storage pool. Consider doubling the time that you allow for backups when you use client-side data deduplication in an environment that is not limited by the network. Ensure that you schedule data deduplication before you schedule compression. |
See Scheduling data deduplication and node replication processes. |
Are the processes for identifying duplicates able to handle all new data that is backed up each day? | If the process completes, or goes into an idle state before the next
scheduled operation begins, then all new data is being processed. The duplicate identification (IDENTIFY) processes can increase the workload on the processor and system memory. If you use a container storage pool for data deduplication, duplicate identification processing is not required. If you update an existing storage pool, you can specify 0 - 20 duplicate identification processes to start automatically. If you do not specify any duplicate-identification processes, you must start and stop processes manually. |
|
Is reclamation able to run to a sufficiently low threshold? | If a low threshold cannot be reached, consider the following actions:
|
|
Do you have enough storage to manage the Db2® lock list? | If you deduplicate data that includes large files or large numbers of
files concurrently, the process can result in insufficient storage space. When the lock list storage
is insufficient, backup failures, data management process failures, or server outages can
occur. File sizes greater than 500 GB that are processed by data deduplication are most likely to deplete storage space. However, if many backup operations use client-side data deduplication, this problem can also occur with smaller-sized files. |
For information about tuning the Db2 LOCKLIST parameter, see Tuning server-side data deduplication. |
Is deduplication cleanup processing able to clean out the dereferenced extents to free disk space before the start of the next backup cycle? | Run the SHOW DEDUPDELETE command. The output shows that
all threads are idle when the workload is complete. If cleanup processing cannot complete,
consider the following actions:
|
|
Is sufficient bandwidth available to transfer data to an IBM Spectrum Protect server? | Use client-side data deduplication and compression to reduce the bandwidth that is required to transfer data to an IBM Spectrum Protect server. | For more information, see the enablededupcache client option. |