Shared Storage Pool and Disaster Recovery in 30 seconds

How To

Summary

If you lose the Power server on which it was running any virtual machine that is based on VIOS SSP virtual disks the recovery is quick and simple. Here we cover what you need to know and do to get you VM backup and running quickly.

Objective

Steps

Shared Storage Pools - just got even more interesting!

SSP Clusters with disks pools and super fast disk allocation

VIOS Shared Storage Pools allows 16+ VIOS on different machines to operate as a VIOS cluster with a set of SAN LUNs in the pool. Think in terms of the number of TBs of disk space in the pool. The VIOS systems administrator can then allocate disk space to a new or existing virtual machine (VM or LPAR) in around a second. These virtual disks can be thin or thick provisioning regardless of the underlying disks. These virtual disks are also fast as the I/O is spread across all the LUNs. The SSP features drastically reduces the time to implement a new virtual machine. If we operate dual VIO Servers (which is normal), then the SSP has up to eight servers in the cluster and we can with MPIO, use the dual VIO Servers for redundancy.

Live Partition Mobility (LPM) ready by default

Assuming we use virtual networks then we are 100% ready for Live Partition Mobility (LPM) as our Shared Storage Pool based virtual machine disks are available across the SSP cluster with no mucking about with LUNs and SAN Zones as the disks are already online on every VIOS. This method does assume the same VLANs are available on the target VIOS but that is fairly normal for servers backing up each other.

But what about disaster recovery with SSP?

For example, one of the servers of the SSP cluster fails - like a sudden power down. Do we have the technology to rebuild the virtual machine?
We need to know a few things

LPAR CPU: dedicated CPU count or shared: entitlement, virtual processor count, weight factor
LPAR memory: size in GB
LPAR network or networks
LPAR SSP virtual disks - the names of the SSP LU resources
LPAR boot disk

Items 1, 2, and 3 are easy to work out, if you are regularly saving your configs from the server or if the HMC is still running which might not be the case for a complete site failure.

Items 1, 2, and 3 are easy to work out in the following ways:

From a regularly saved your VIOS configs in backups
From the HMC is still running - which might not be the case for a complete site failure.
From the HMCscanner tool - which generates an Excel Spreadsheet of HMC data on AIX or your workstation is highly recommended. The HMCscanner tool or a saved HMC System Plan contains the needed LPAR CPU, memory, and network details. IMHO: If you have not got this marvelous freely downloadable tool, then, you need to ensure you have another tool that can match the functionally and automate documentation generation.

Item 4 is known about by the surviving VIO Servers of the same SSP cluster.
Go to a possible target VIOS to resurrect the failed virtual machine and run the lssp command. Here is an example from my SSP VIOS:

$ lssp -clustername galaxy -sp atlantic -bd
Lu Name          Size(mb)    ProvisionType      %Used Unused(mb)  Lu Udid
vdisk_diamond3a  16384       THIN                 51% 7971        0d0f6526326f906077b3b2c9c6c42343
vdisk_diamond4a  16384       THIN                 40% 13309       8ff7f4e74244ced56d6353247c3f8ca1
Snapshot
diamond4a_SP11.snap
diamond4a_with_wp22.snap
diamond4a_ISD_WPAR_ready.snap
vdisk_diamond6a  131072      THIN                 17% 108400      5a47d7a731bf85ef59fbbe6c19e43768
vdisk_diamond7a  16384       THIN                 18% 13292       1b93e1c46e0cfdec310087b4180fc3d2
vdisk_diamond7b  131072      THIN                 31% 89410       a082bbbb69b069ae3931f171341342f6
vdisk_diamond8a  16384       THIN                 25% 12196       dbb870c0ed55791fa75ea2352c237966
vdisk_diamond9a  16384       THIN                 20% 13105       378030fb9f0f6b3f15b6aab74fe617da
vdisk_gold2a     16384       THIN                 18% 13383       3335cb9729e6f139301b871ab5d2ae72
vdisk_gold3a     16384       THIN                 19% 13193       b3da2b4256f897a3a2c048504bd3d80f
vdisk_gold4a     16384       THIN                 19% 13218       921839bd8566b55da1744c32d347c43e
vdisk_gold5a     16384       THIN                 18% 13342       f9e542e36b5cff11c854f461fbf61361
vdisk_gold6a     16384       THIN                 18% 13364       2e1b6657e85846f06716a1bf4eaf6057
vdisk_gold6b     16384       THIN                  0% 16385       01ab722c4b41b4f61f88dff3cba96779
vdisk_red2       16384       THIN                 18% 13336       e826fbe6b0b97e905ef4a8ba10bf1cda
vdisk_red3       16384       THIN                 30% 11454       f389b8b8c42ce02dc06353e622645b84
vdisk_red4       16384       THIN                 17% 13533       9605f8227d6846b46fc315d992131de6
. . .

My failed server is called gold and the virtual machine, I need to recover is called gold6.

The names of the LUs (virtual disks) are determined by the user - I use a simple vdisk_<LPAR-name><letter> naming convention. The letter is used when we have multiple LUs.
In light of these experiments, I think my naming convention could be improved:

"vdisk" is largely pointless
Our virtual machine (LPAR) names include the name of the machine (like gold), which in an LPM environment is dumb and the machine can change every day!
If is not clear which disk is the boot disk, then the first disk is a good guess. I could highlight the boot disk with the name. Or we need to record this fact.

So obviously the SSP LUs I need to recover are vdisk_gold6a and vdisk_gold6b. As the second disk (b) it used 0%, it is clear that vdisk_gold6a is the boot disk.

So what do I need to do to recover my gold6 virtual machine?
Here are the steps:

Select a target machine and go to its HMC and create a new AIX LPAR with similar CPU, memory ~ 2 minutes
- With the same virtual network or networks.
- With a virtual SCSI connection to the VIOS. You might need to add this vSCSI device to the VIOS too.
Go to the VIOS and assuming the new virtual machine vSCSI on the VIOS end is vhost42 and my cluster name galaxy and SSP atlantic. Run these two commands: ~2 minutes

lu -map -lu vdisk_gold6a -vadapter vhost42
lu -map -lu vdisk_gold6b -vadapter vhost42

Use the HMC boot to SMS and select the first disk as the boot disk ~1 minute
Use the HMC to start the virtual machine

This boot up might take about five minutes!

Not bad for recovering an important service. Once AIX starts it has to replay any JFS2 logs (usually in seconds) and you need to start the application or RDBMS. Any database might need to recover incomplete transactions, so the service might take a little longer to be fully ready.

If that is not fast enough for you, then have a recovery virtual machine setup in advance on a target machine. In which case, you run steps 1 - 3 in advance.

Then, start the virtual machine, which takes about 30 to 60 seconds.

One word of warning, don't have the original AND the recovery virtual machines running at the same time.

I think that would totally corrupt the file systems within a couple of milliseconds.

I did check this concept and idea with the SSP developers they said "Of course, Nigel, that works fine."

Additional Information

Other places to find more content from Nigel Griffiths IBM (retired)

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"Component":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power -\u003EPowerLinux"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"Component":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]

Tips

Shared Storage Pool and Disaster Recovery in 30 seconds

How To

Summary

Objective

Steps

Additional Information

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?