IBM Support

VIOS Shared Storage Pool Single Repository Disk = Not a Problem

How To


Summary

Some worry about having just one but having more would cause problems and not fix them.

Objective

Nigels Banner

Steps

The SSP4 onwards can have mirrored failure group FC disks to handle adapter, FC cable, FC switch, entire FC disk sub-system, site or VIOS failure.

But there is still a single SSP Repository disk.  Isn't this a single point of failure?

The answer is "No because you can quickly rebuild the contents of the Repository Disk after it has failed."

You are not meant to know this as it is internal to the VIOS cluster-aware AIX Shared Storage Pool software but there is a saved copy of the Repository Disk on every node but even those are not needed most of the time.  Actually, if there was more than one Repository disk there would be greater problems.  If the disks were different how would you work out which is best or which is right? Even if you find a way to handle that you have a further problem - what if the network breaks and each half had a different Repository disk and carried on running - thinking its half was the primary copy.

To prove the point I took my "crash and burn" demo SSP cluster and tried to destroy it via the Repository disk.

Attempt 1: Pretend I was moving the SSP to new super fast IBM disks :-)

  • Moved all pool LUNs to new disks using: pv -replace ...
  • But how to move the Repository?
  • chrepos -n globular -r +hdisk16
  • Just a few seconds and the job is done - it even clears the old LUN header so it could be reused

Attempt 2: Oops silly me, I had an accident with the dd command and dd-ed zeros all over my SSP Repository disk

  • But wait the SSP and virtual machines are still running!
  • Cool a damaged Repository disk does not take down the SSP
  • If you check the error logs on all the virtual I/O Servers there will be events recording the issue
  • chrepos rebuilds it on a different disk :-)

Attempt 3: Replaced Repository disk with a node shutdown = it can never start &join my SSP.

  • Start the missing VIOS - clearly it can't use the old Repository disk but it knows about its "friends"
  • Give it 5 mins
  • Wow its working!  Very Cool - just be patient and "Don't Panic!"

Attempt 4: Nuts, I unmapped the Repository LUN on my V7000 - how silly of me!!

  • Loads of errlog messages on the VIOS = HELP!
  • Remapped the LUN from the V7000 so it is accessible again
  • All is well again :-)
  • The same would happen if you unplugged the wrong FC cable or powered off a switch etc. Get it back online reasonably quickly and all is OK.

Attempt 5: Completely deleted Repository LUN on the V7000

  • There is no way back as the content can't be rebuilt = no way back!
  • SSP retries for 1 hour then admits defeat.
  • I recommend going for lunch, so you don't sit there worrying!
  • It puts in the errlog that it now has no Repository disk.
  • chrepos to new disk works = OK
  • Note: if you don't wait till it gives up on the LUN then the chrepos will fail as it is still retrying to get the LUN working.

Attempt 6: Real intermittent FC laser failure on the FC adapter VIOS. I would like to claim we somehow injected this issue but it was really a genuine GBIC failure.

  • We got ~50 errlog messages in the VIOS errlog command output.
  • This VIOS only have one FC adapter and cable
  • SSP4 carried on doing I/O for its client's virtual machines
  • We smelled a rat! How could that possibly carry on working?
  • We asked the developers what they do it route the I/O over the network to another friendly VIOS and use its FC cable
  • I was impressed - clearly, this would be slower but it keeps the client virtual machines running.

Repository Disk / LUN Conclusions:

  • Rock solid technology
  • chrepos fixes the issues
  • You may need patience while it tries to recover
  • I can’t get it to fail
  • Actually better for not having multiple copies

but track the VIOS errlogs for warnings.

Additional Information


Other places to find Nigel Griffiths IBM (retired)

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"Component":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power -\u003EPowerLinux"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"Component":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]

Document Information

Modified date:
03 July 2023

UID

ibm11116225