IBM Support

IBM Tivoli Storage Manager Disaster Recovery Testing

Question & Answer


Answer

Why do recoveries fail

The two main reasons are because backups didn't happen and backups can't be used. A few miscellaneous issues crop up as well.

Different technologies (backups, disk mirroring, virtualization) reduce one or another of these causes, but there's no magic technological bullet that eliminates all of them.

  • Backups didn't happen.
    • They weren't made at all.
      That's why people have backup-monitoring software like Servergraph, Bocada. However, we know one story where operators "learned" that the missed backup alerts were "spurious" and should be ignored. This misconception didn't come to light until recovery-failure time.
    • Node was never registered.
      If Tivoli Storage Manager doesn't even know a node exists, you can't recover it from Tivoli Storage Manager. This problem is becoming more prevalent now that VMware machines are popping up everywhere.
    • DOMAIN statement ignores drive E:
      User doesn't want to bother backing up C: because it's a standard corporate image. So they change DOMAIN from ALL_LOCAL to D: Then, 6 months later, they add an E: drive. Tivoli Storage Manager quietly ignores E: forever.
    • Includes/ excludes skipped an important file.
      The inclexcl file uses patterns (regexes). There are patterns for both include and exclude that apply throughout and the results is difficult to understand. It's fairly common for users (or even experienced Tivoli Storage Manager administrators) to have an exclude list that doesn't do what you think it does. 
    • Database backup breaks log sequence.
      If someone (the DBA) does a full backup in mid-week, the database backups that use full-weekend + incremental-weekdays (Grandfather-Father-Son) scheme aren't restorable. (They are restorable from the full backup. However, if you are the storage administrator, the DBA might not tell you that they completed the restore.)
  • Backups cannot be used
    • Tape volumes are marked "unavailable"
      Tivoli Storage Manager usually marks a tape volume as bad when the tape drive has a problem. For this and other reasons, some of your backup tapes might not be usable unless you set them back to read/write. Unfortunately, bad tape volumes also behave this way so it's hard to tell the difference between bad-drive and bad-tape.
    • Copy pool volumes get damaged during off-site Disaster Recovery tests
      Off-site Disaster Recovery tests that make you move your copy pool tapes to Sterling Forest put them at risk of being dropped, overheated, or otherwise damaged in a hostile environment. So your Disaster Recovery test can succeed, at the same time it causes a future problem.
    • Disaster Recovery Plan is missing
      Without a list of the required tape volumes and device configurations, you cannot restore the Tivoli Storage Manager server. The "prepare" command generates a good Disaster Recovery plan. However, the Disaster Recovery Manager (DRM) might not be licensed, or the "prepare" command might not be scheduled, or the drplan file might not be moved off-site.
    • Tivoli Storage Manager database backup is missing
      In one case, someone pruned the volhist.txt file, so Tivoli Storage Manager could not figure out what volume held the latest Tivoli Storage Manager database backup.
    • User lost their encryption key
      If the dsmc client is set up to use encryption, the user is responsible for knowing the encryption key. If they lose it, they can't restore any encrypted files.
  • Miscellaneous Issues
    • Inadequate performance
      Flood-and-fire disasters leave the recovered business with a large catch-up load. Performance of the Disaster Recovery site needs to be significantly higher than the normal production site, at least until the catch-up period is over.
    • Rolling disaster
      Though remote mirrors can help you to achieve high RPOs / RTOs, they are not a substitute for backups. The database administrator whose maintenance script runs amok might need to restore from last week's backup, because the mirrored volume can corrupt simultaneously.
    • Interdependent systems
      The Tier 1 system that you can restore might depend on a system you consider Tier 3. You might have valid backups for the first, but not for the second.

What problems are there with off-site Disaster Recovery tests

It has long been the norm where off-site Disaster Recovery tests involved an annual trip to Sungard or IBM with a bunch of copy pool tapes and a disaster Recovery plan in hand. This can contribute to the following problems:

  • Expense
    Leasing computers and space, organizational disruption, personnel time and other expenses can quickly make a 6-figure exercise.
  • Poor test coverage
    Because of the expense, one typically only tests a few of the "critical", Tier-1 servers. The other 95% are untested. And it's not uncommon for a Tier-3 server to be upgraded to Tier-1, when a failed Disaster Recovery test shows that a critical application depended on that Tier-3 server.
  • Timeliness
    A Disaster Recovery test that succeeds today doesn't guarantee success next week. Things change constantly in a large storage environment.
  • Retry is difficult
    If the first DR test fails, it's usually too expensive to make a second or third attempt. Instead, at a painful post-mortem meeting, people promise to address the cause of the failure. But an invisible secondary issue might also have failed the test.

What can I do to boost my chances of success

  • Monitor backup success
    • Servergraph, ART, TSMmanager, and other products show whether backups were not made recently.
  • Test as widely as possible, as frequently as possible.
    • Test-recover every computer, every day? Far too demanding. But try an 80-20 solution:
      • Do a random file restore from every computer every day. Augment that with:
      • Use VMware machines to fully recovery your Tier 1 systems every week.
      • You can automate random file restores and recovery by using VMware machines.

How can I reduce the cost of Disaster Recovery testing

  • Bring it in-house
    • Asensus has software that does a complete Disaster Recovery test of a Windows application server; including recovery and validation of the TDP backup (Exchange, Oracle) About $995 per node.
    • Storix has software that assists with complete Disaster Recovery testing for Unix® application servers.
  • Outsource it
    • IBM BCRS and other providers take full responsibility for your backups.
  • Outsource your whole data center
    • Rent space at a remote data center, and use it as your primary.
      When bandwidth to your users is adequate, you might not need the servers at your site. And large consolidated sites can afford hurricane and earthquake-proofing.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
19 March 2020

UID

ibm13123327