Determining the impact of deduplication on a Tivoli Storage Manager server database and storage pools

Question

How can I tell how much Tivoli Storage Manager server database space is being consumed to manage my deduplicated storage pools? How much data am I saving in my deduplicated storage pools? If my server continues to manage more and more data, how much growth can I expect within my database? This document addresses these questions from a cost-benefit angle.

Answer

NOTE: This technote only applies to file type storage pools with deduplication enabled.

The Tivoli Storage Manager deduplication feature requires greater Tivoli Storage Manager server database space, but results in less storage pool space. If you are accustomed to estimating server size based on algorithms that predate deduplication, you should adjust your estimation mechanism.

Overview

The deduplication feature can be an efficient way to reduce the amount of space that data consumes in storage pools. However, this benefit comes at the cost of some additional database space, as the Tivoli Storage Manager server database must store and track all metadata required to manage the deduplication.

This document describes the potential impact to the database and includes a Perl script that can be downloaded and run to provide useful information, including the following data:

Deduplication ratio for a given storage pool
Deduplication ratio based on client platforms for a given storage pool
Average deduplicated chunk size for a given storage pool
Total chunks used for a given storage pool
Estimated database cost of using deduplication

NOTE: The formulas and script included in this document are to be used estimators as the exact database size may be under/over the result given. The reason being is that the database cost can vary dependent on the kind of data being stored, frequency and type of database reorganizations, and overall database layout and structure. The database sizing numbers provided here are averages and should allow for an ESTIMATE for future database growth.

Estimating the Database Size Impact of Using Deduplication

If you know how much data will be stored in the Tivoli Storage Manager storage hierarchy, you can estimate the additional cost for database usage.

The estimated cost for a deduplicated file is as follows:

Database usage is approximately 490 bytes per chunk that is assigned to a deduplicated object in a deduplicated pool.
Another 190 bytes is added for each copy storage pool that the deduplicated file is copied to.
The average chunk size for most deduplicated files is about 100 K.
To get a better estimate of the total database size requirement, add in the following formula to account for general database overhead not associated with deduplication:

(# of files) x (# of copies stored) x 200

The following select command can be used to calculate the total number of files stored for an existing server:

select sum(cast(num_files as bigint)) from occupancy where node_name is not null and
filespace_id is not null

Example: An estimated 500 million files will be stored without deduplication to the Tivoli Storage
Manager server. Two copy storage pools will be used to store copies of these files
plus the primary. The formula would look like this:

500000000 x 3 x 200 = 300 GB

Note: If this is a database that already has stored files in deduplicated form, you can run the script included in this document to get an exact chunk size average.

The following examples show how to calculate the database size impact of using deduplication.

Example 1: 15 GB of managed file data is stored in a deduplicated primary storage pool and
copied to a non-deduplicated copy storage pool.
Estimated # of chunks created = 150,000 (15,000,000,000 / 100,000)
Database cost for storing the deduplicated data in a primary pool = 73,500,000 (150,000 x 490)
Database cost for storing the deduplicated data in a copy pool = 28,500,000 (150,000 x 190)
Total database cost for deduplication = 102 MB

Example 2: 20 TB of managed file data is stored in a deduplicated primary storage pool and
copied to a non-deduplicated copy storage pool.
Estimated # of chunks created = 200,000,000 (20,000,000,000,000 / 100,000)
Database cost for storing chunks in primary pool = 98,000,000,000 (200,000,000 x 490)
Database cost for storing chunks in copy pool = 38 GB (200,000,000 x 190)
Estimated database cost for deduplication = 136 GB

Note: The above examples include a copy storage pool that is not deduplicated. If a copy
storage pool is deduplicated, the same cost would apply as if it were stored
in an additional primary pool.

The examples above are estimates, and do not fully reflect the database size implications of
deduplication. This is because of the need for base chunks to remain after a file has been expired.
Over time, the number of chunks will be greater than the estimates above. To
account for this, as best as possible, double the number of chunks that are estimated for
the formula above. For instance, both examples would look like this:

Example 1 -> 102 MB + 102 MB = 204 MB
Example 2 -> 136 GB + 136 GB = 272 GB

The reason for doubling is that there will be base deduplication chunks (actual data
segments) that must remain even after a file is expired/deleted from Tivoli Storage Manager.
The following is an example of this:

Example: File A is backed up to the Tivoli Storage Manager server and broken into 20 chunks (2
MB). These chunks are all base chunks, because there were no matches available in
the deduplication catalog. File B is then backed up, and it is a copy of File A, so it is
broken into the same 20 chunks. These chunks match File A completely, so all of the
chunks are set as "links" to File A chunks. File A is then expired, but none of the 20
base data chunks can be removed because the "linked" chunks from File B are
dependent on them. This is an unusual case, and it would be rare for a deduplication
system to mirror this scenario across the board, but it is a worst-case scenario and
should be considered for sizing purposes.

After a database has matured from a deduplication perspective and most initial seeding has taken place, the provided script can be run to get more details on database content and deduplication layout. With this information, you can get much closer to calculating actual cost and forecasting future growth.

Preparing to Use the Script

The attached script, tsm_dedup_stats.pl, can be used to capture detailed information about any Tivoli Storage Manager server at the V6 or higher level. The script requires basic environmental setup:

Perl must be installed on the system where the Tivoli Storage Manager server resides. Because the Perl script interrogates the DB2 database directly, it must be run from this system. To obtain code for the Perl installation, go to http://www.perl.org
The script must be run by using the instance ID configured for accessing the DB2 database used by the Tivoli Storage Manager server. This is typically the ID that is used to start the Tivoli Storage Manager server.
Preferably, a Tivoli Storage Manager client of V6.1 or higher should be used. The client must be installed on the system and be accessible (via environment) from the DB2 instance ID.
The script runs some potentially long-running select commands directly to the DB2 database. These selects are run such that the Tivoli Storage Manager environment is minimally impacted. However, the script should be run at a time when there is the least amount of activity taking place on the Tivoli Storage Manager server to minimize overall production impact.

The script has variables that can be set to govern how Tivoli Storage Manager and DB2 are accessed and also what kind of information will be captured. The following are examples of these variables and their standard default settings:

$adminID = "admin";
$adminPW = "admin";
$server = "SERVERNAME";
$tcps = "0.0.0.0";
$tcpp = "0000";
$dbName = "TSMDB1";
$aliasName = "TSMDB1";
$dbInfo = "YES";
$detailedDedupInfo = "YES";
$numDedupCopies = 1;

adminID is the administrator ID that is used to capture information by using dsmadmc.
adminPW is the administrator password that is used to capture information by using dsmadmc.
server, tcps, and tcpp are variables that can be set if the server is not set up for access in local dsm.opt/sys files. The tcps/tcpp parameters are mutually exclusive to the server parameter. If these are left to default settings, the script will attempt to access the Tivoli Storage Manager server based on local dsm.opt/sys settings.
dbName is the name of the actual Tivoli Storage Manager-managed database in DB2. This is typically TSMDB1, so this should not need to be changed.
aliasName is the schema name, which is typically TSMDB1 as well.
dbInfo is the flag that determines whether detailed DB2 database information is captured. The default value, YES, is appropriate in most cases.
detailedDedupInfo is the flag that determines whether detailed deduplication information is captured and reported on. In most cases, the most appropriate value is YES. However, since the YES value can cause the script to run for an extended time on large systems, the value can be set to NO if the information was captured previously.
numDedupCopies is the variable that tells the script how many copies of deduplicated files are kept on the server. This is not the number of copy pools, as different copy pools can be used in an environment where only one copy is kept of each deduplicated file. In addition, this should NOT INCLUDE copy pools that are deduplicated. Only non-deduplicated copy pools should be included in the count.

Example: One deduplicated primary storage pool is defined. That storage pool is fully
copied to two non-deduplicated copy storage pools, one onsite and one offsite. The
numDedupCopies variable should be changed to 2 (from the default of 1).

Example #2: One deduplicated primary storage pool is defined. That storage pool is fully
copied to an onsite deduplicated copy storage pool. The numDedupCopies should be
changed to 0 (from the default of 1). This is because the copy is in a deduplicated
storage pool.

Note: This script runs in English such that it will specifically look for the "Local database alias" string in the "db2 connect to tsmdb1" output. If the script is being run in a non-English environment, execute the "db2 connect to tsmdb1" command outside of the script and modify line NNN changing "Local database alias" to whatever is seen in the machine locale.

C:\>db2 connect to tsmdb1

Oplysninger om databaseforbindelser

Databaseserver = DB2/NT64 9.7.5

SQL-autorisations-id = DB2USER1

Lokalt databasealias = TSMDB1

This line in the script:

($dbConnect) = grep {m/Local database alias/} @out;

($dbConnect) = grep {m/Lokalt databasealias/} @out;

After all of the variables are set and the environment is ready, the script can be invoked as follows:
perl tsm_dedup_stats.pl

The output can be redirected for easier analysis and for comparing baselines throughout the server life cycle.

Sample Script Output

***************************************************************************************************************************
##############################################################################
## TSM server deduplication and database report
##
## Storage Management Server for AIX - Version 6, Release 3, Level 2.0
## Server Name: SCORPIO2
## Fri Jun 1 17:24:57 2012
##

Dedup related options
---------------------
ClientDedupTxnLimit 1024
DedupRequiresBackup No
DedupTier2FileSize 100
DedupTier3FileSize 400
EnableNasDedup No
MaxSessions 700
NumOpenVolsAllowed 20
ServerDedupTxnLimit 2048

TSM database information
------------------------
DATABASE:
Total Size of File System (MB): 4,546,560
Space Used by Database(MB): 3,857,486
Free Space Available (MB): 672,719
Total Pages: 241,384,612
Usable Pages: 241,382,900
Used Pages: 170,551,928
Free Pages: 70,830,972
ACTLOG:
Total Space(MB): 122,880
Used Space(MB): 53,964
Free Space(MB): 68,316
Archive Log Directory: /tsmarchlog

ARCHLOG:
File system: /tsmarchlog
df output(512): /dev/tsmlg38 1163919360 841252384 28% 317 1% /tsmarchlog

NAME ROWS_IN_TABLE TABLE_USED_MB TABLE_ALLOC_MB INDEX_USED_MB INDEX_ALLOC_MB
---------------------------- -------------------- -------------------- -------------------- -------------------- --------------------
BACKUP_OBJECTS 2797994157 697902 700616 478217 481959
BF_AGGREGATED_BITFILES 4143822136 223032 223989 438147 502385
BF_BITFILE_EXTENTS 779366304 82156 85535 259395 332503
BF_DEREFERENCED_CHUNKS 0 0 9293 0 15528
GROUP_LEADERS 207433875 6996 7018 9335 9595
BF_QUEUED_CHUNKS 120973 3 3046 6 5390
AF_BITFILES 10889393 607 2802 336 339
ARCHIVE_OBJECTS 2754594 1219 2169 565 591
AS_SEGMENTS 10889907 1972 1973 297 299
ACTIVITY_LOG 5924012 602 859 111 155
AF_SEGMENTS 10967940 385 386 692 700
REPLICATED_OBJECTS 3391613 108 266 441 452
BF_AGGREGATE_ATTRIBUTES 6892855 242 243 194 196
EXPORT_OBJECTS 1774030 81 82 37 37
RESTORE_SRVOBJ 0 0 58 0 0
SPACEMAN_OBJECTS 207749 41 45 11 13
DF_BITFILES 213565 11 26 5 6
DS_OVERFLOW 115119 3 20 2 3
DF_SEGMENTS 213565 7 18 9 12
DF_MIGRBITFILES 213566 6 16 4 5
UNRESOLVED_OBJECTS 0 0 16 0 0
DS_SEGMENTS 213458 6 15 4 5
SPACEMANEXT_OBJECTS 207749 9 11 5 6
ACTIVITY_SUMMARY 48804 3 10 1 1
SEQ_VOLUME_HISTORY 59171 4 5 2 3
REPLICATING_OBJECTS 0 0 1 90 129

26 record(s) selected.

Client node information
-----------------------
Node Count: 1107

Dedup server only: 550
Dedup client or server: 557
Node Count by type:
AIX: 89
Stats for Storage Pool: FILEPOOL
Dedup Pct: 33.72%
HPUX: 1
Stats for Storage Pool: FILEPOOL
Dedup Pct: 39.42%
Linux86: 139
Stats for Storage Pool: FILEPOOL
Dedup Pct: 63.80%
LinuxPPC: 2
Stats for Storage Pool: FILEPOOL
Dedup Pct: 48.91%
Linux390: 2
Mac: 8
Stats for Storage Pool: FILEPOOL
Dedup Pct: 18.69%
No Storage Pool Dedup: VMPOOL
Dedup Pct: 0
NetWare: 4
Stats for Storage Pool: FILEPOOL
Dedup Pct: 10.45%
SUN: 3
Stats for Storage Pool: FILEPOOL
Dedup Pct: 77.34%
WinNT: 619
Stats for Storage Pool: FILEPOOL
Dedup Pct: 82.47%
Stats for Storage Pool: VMPOOL
Dedup Pct: 86.44%
TDP VMware: 81
Stats for Storage Pool: FILEPOOL
Dedup Pct: 57.07%
Stats for Storage Pool: VMPOOL
Dedup Pct: 80.15%
TSM4VE: 3
TDP MSSQL: 13
Stats for Storage Pool: FILEPOOL
Dedup Pct: 7.63%
Stats for Storage Pool: VMPOOL
Dedup Pct: 90.66%
TDP Dom: 5
Stats for Storage Pool: VMPOOL
Dedup Pct: 1.15%
DP Oracle: 8
Stats for Storage Pool: FILEPOOL
Dedup Pct: 0.02%
TDPO: 2
Stats for Storage Pool: VMPOOL
Dedup Pct: 0.60%
DB2: 2
Invalid occupancy for FILEPOOL
Repair Occupancy should be run for this pool.

Deduplicated Storage pool information
-------------------------------------

Pool: FILEPOOL
Type: PRIMARY Est. Cap. (MB): 57295402.5 Pct Util: 37.4
Reclaim Thresh: 100 Reclaim Procs: 6 Next Pool:
Identify Procs: 0 Dedup Saved(MB):46589162

Logical stored (MB): 21479323.52
Dedup Not Stored (MB): 46589162.17
Total Managed (MB): 68068485.69

Volume count: 1981
AVG volume size(MB): 20207
Number of chunks: 488436849
Avg chunk size: 139816

Pool: FILEPOOL2
Type: PRIMARY Est. Cap. (MB): 0.0 Pct Util: 0.0
Reclaim Thresh: 100 Reclaim Procs: 6 Next Pool: FILEPOOL
Identify Procs: 0 Dedup Saved(MB):0

Logical stored (MB): 0
Dedup Not Stored (MB): 0
Total Managed (MB): 0
Volume count: 0
AVG volume size(MB): 0
Number of chunks: 248
Avg chunk size: 190860

Pool: VMPOOL
Type: PRIMARY Est. Cap. (MB): 26533846.1 Pct Util: 14.0
Reclaim Thresh: 100 Reclaim Procs: 4 Next Pool: FILEPOOL
Identify Procs: 0 Dedup Saved(MB):14854955

Logical stored (MB): 2189574.78
Dedup Not Stored (MB): 14854956.08
Total Managed (MB): 17044530.86

Volume count: 758
AVG volume size(MB): 33302
Number of chunks: 223175838
Avg chunk size: 92455

Data Ingestion and Expiration Stats
-----------------------------------
Files ingested(last 24): 79125204
Files expired(last 24) : 26488724331
Total data ingested(last 24): 26574.93 GB

Deduplication Deletion Statistics
------------------------------------
Total Queued Deref Chunks : 0.
Total In-flight Deref Chunks: 0.
Total Dedup Deref Chunks : 0

----------------------------------------
Final Dedup and Database Impact Report
----------------------------------------

Deduplication Database Totals
-----------------------------
Total Dedup Chunks in DB : 711612935
Average Dedup Chunk Size : 141657.5

Deduplication Impact to Database and Storage Pools
---------------------------------------------------
Estimated DB Cost of Deduplication: 483.90 GB
Total Storage Pool Savings: 61444.12 GB
*******************************************************************************************************************************

Script Output Highlights

Database table size information can be found in the TSM Database Information section. This shows the large tables and how many rows are contained within each table and associated indexes.
The Client Node Information section displays how many clients are registered to the server. The section also provides deduplication statistics for storage pools in which the clients are storing data.
The Deduplicated Storage Pool Information section displays both general information for deduplicated storage pools such as Pct Util, Reclaim Procs, as well as detailed information about the deduplication layout of the given storage pool. The key numbers here are # of chunks and average chunk size. Both of these numbers can be used to determine current database cost and can help in forecasting future database growth.
The Deduplication Deletion Statistics section displays the current status of background dereferenced chunk cleanup. The dereferenced chunk queues are used to defer the cleanup from mainline processing, such as expiration and data movement. The background cleanup should be running continuously to prevent a backlog that can influence database and storage pool growth. The script will alert the user if there is an excessive backlog and give recommendations.
The Final Dedup and Database Impact Report section displays information about total current cost of deduplication in relation to database size and growth. It will also provide the total deduplication savings across the entire server, along with total deduplication chunks and average size. These numbers can be used to assess the current impact of deduplication as well as forecasting for future usage.

Additional Reference Information for Database Growth Considerations
Database size, database reorganization, and performance considerations for Tivoli Storage Manager Version 6 servers:
http://www-304.ibm.com/support/docview.wss?uid=swg21452146

tsm_dedup_stats.pl

Tips

Determining the impact of deduplication on a Tivoli Storage Manager server database and storage pools

Question & Answer

Question

Answer

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?