Shared Storage Pools - just got even more interesting!
SSP Clusters with disks pools and super fast disk allocation
VIOS Shared Storage Pools allows 16+ VIOS on different machines to operate as a VIOS cluster with a set of SAN LUNs in the pool. Think in terms of the number of TBs of disk space in the pool. The VIOS systems administrator can then allocate disk space to a new or existing virtual machine (VM or LPAR) in around a second. These virtual disks can be thin or thick provisioning regardless of the underlying disks. These virtual disks are also fast as the I/O is spread across all the LUNs. The SSP features drastically reduces the time to implement a new virtual machine. If we operate dual VIO Servers (which is normal), then the SSP has up to eight servers in the cluster and we can with MPIO, use the dual VIO Servers for redundancy.
Live Partition Mobility (LPM) ready by default
Assuming we use virtual networks then we are 100% ready for Live Partition Mobility (LPM) as our Shared Storage Pool based virtual machine disks are available across the SSP cluster with no mucking about with LUNs and SAN Zones as the disks are already online on every VIOS. This method does assume the same VLANs are available on the target VIOS but that is fairly normal for servers backing up each other.
But what about disaster recovery with SSP?
For example, one of the servers of the SSP cluster fails - like a sudden power down. Do we have the technology to rebuild the virtual machine?
We need to know a few things
- LPAR CPU: dedicated CPU count or shared: entitlement, virtual processor count, weight factor
- LPAR memory: size in GB
- LPAR network or networks
- LPAR SSP virtual disks - the names of the SSP LU resources
- LPAR boot disk
Items 1, 2, and 3 are easy to work out, if you are regularly saving your configs from the server or if the HMC is still running which might not be the case for a complete site failure.
Items 1, 2, and 3 are easy to work out in the following ways:
- From a regularly saved your VIOS configs in backups
- From the HMC is still running - which might not be the case for a complete site failure.
- From the HMCscanner tool - which generates an Excel Spreadsheet of HMC data on AIX or your workstation is highly recommended. The HMCscanner tool or a saved HMC System Plan contains the needed LPAR CPU, memory, and network details. IMHO: If you have not got this marvelous freely downloadable tool, then, you need to ensure you have another tool that can match the functionally and automate documentation generation.
Item 4 is known about by the surviving VIO Servers of the same SSP cluster.
Go to a possible target VIOS to resurrect the failed virtual machine and run the lssp command. Here is an example from my SSP VIOS:
$ lssp -clustername galaxy -sp atlantic -bd
Lu Name Size(mb) ProvisionType %Used Unused(mb) Lu Udid
vdisk_diamond3a 16384 THIN 51% 7971 0d0f6526326f906077b3b2c9c6c42343
vdisk_diamond4a 16384 THIN 40% 13309 8ff7f4e74244ced56d6353247c3f8ca1
Snapshot
diamond4a_SP11.snap
diamond4a_with_wp22.snap
diamond4a_ISD_WPAR_ready.snap
vdisk_diamond6a 131072 THIN 17% 108400 5a47d7a731bf85ef59fbbe6c19e43768
vdisk_diamond7a 16384 THIN 18% 13292 1b93e1c46e0cfdec310087b4180fc3d2
vdisk_diamond7b 131072 THIN 31% 89410 a082bbbb69b069ae3931f171341342f6
vdisk_diamond8a 16384 THIN 25% 12196 dbb870c0ed55791fa75ea2352c237966
vdisk_diamond9a 16384 THIN 20% 13105 378030fb9f0f6b3f15b6aab74fe617da
vdisk_gold2a 16384 THIN 18% 13383 3335cb9729e6f139301b871ab5d2ae72
vdisk_gold3a 16384 THIN 19% 13193 b3da2b4256f897a3a2c048504bd3d80f
vdisk_gold4a 16384 THIN 19% 13218 921839bd8566b55da1744c32d347c43e
vdisk_gold5a 16384 THIN 18% 13342 f9e542e36b5cff11c854f461fbf61361
vdisk_gold6a 16384 THIN 18% 13364 2e1b6657e85846f06716a1bf4eaf6057
vdisk_gold6b 16384 THIN 0% 16385 01ab722c4b41b4f61f88dff3cba96779
vdisk_red2 16384 THIN 18% 13336 e826fbe6b0b97e905ef4a8ba10bf1cda
vdisk_red3 16384 THIN 30% 11454 f389b8b8c42ce02dc06353e622645b84
vdisk_red4 16384 THIN 17% 13533 9605f8227d6846b46fc315d992131de6
. . .
My failed server is called gold and the virtual machine, I need to recover is called gold6.
The names of the LUs (virtual disks) are determined by the user - I use a simple vdisk_<LPAR-name><letter> naming convention. The letter is used when we have multiple LUs.
In light of these experiments, I think my naming convention could be improved:
- "vdisk" is largely pointless
- Our virtual machine (LPAR) names include the name of the machine (like gold), which in an LPM environment is dumb and the machine can change every day!
- If is not clear which disk is the boot disk, then the first disk is a good guess. I could highlight the boot disk with the name. Or we need to record this fact.
So obviously the SSP LUs I need to recover are vdisk_gold6a and vdisk_gold6b. As the second disk (b) it used 0%, it is clear that vdisk_gold6a is the boot disk.
So what do I need to do to recover my gold6 virtual machine?
Here are the steps:
- Select a target machine and go to its HMC and create a new AIX LPAR with similar CPU, memory ~ 2 minutes
- With the same virtual network or networks.
- With a virtual SCSI connection to the VIOS. You might need to add this vSCSI device to the VIOS too.
- Go to the VIOS and assuming the new virtual machine vSCSI on the VIOS end is vhost42 and my cluster name galaxy and SSP atlantic. Run these two commands: ~2 minutes
- Use the HMC boot to SMS and select the first disk as the boot disk ~1 minute
- Use the HMC to start the virtual machine
This boot up might take about five minutes!
Not bad for recovering an important service. Once AIX starts it has to replay any JFS2 logs (usually in seconds) and you need to start the application or RDBMS. Any database might need to recover incomplete transactions, so the service might take a little longer to be fully ready.
If that is not fast enough for you, then have a recovery virtual machine setup in advance on a target machine. In which case, you run steps 1 - 3 in advance.
Then, start the virtual machine, which takes about 30 to 60 seconds.
One word of warning, don't have the original AND the recovery virtual machines running at the same time.
- I think that would totally corrupt the file systems within a couple of milliseconds.
I did check this concept and idea with the SSP developers they said "Of course, Nigel, that works fine."