IBM Support

Shared Storage Pool Stuck or Down? Don't Panic - Raise a PMR

How To


Summary

If you have a SSP problem -DO NOT PANIC or FIDDLE ABOUT.

Objective

Nigels Banner

Steps

I have a few customers and IBMers tell me their Shared Storage Pool (SSP) failed to come up after some major disaster like

  • Total network outage
  • Total SAN outage
  • Total site unexpected electrical power outage

They then fiddle about and eventually, like hours or even days later, send me email.

All I can say is I am sorry to hear about their issue, that I don't have that problems and I have had my share of electricity cuts (ironically while testing the uninterruptible power supply!).  But I can offer some advice . . .

The VIOS SSP feature is built to be easy to operate like a car.  You have deliberately NOT been handed a large set of tools to diagnose issues or take it apart. Like a car it was found that users with powerful tools and little knowledge do more harm than good. A lease-hire car rule clearly states that any damage you do to the car trying to "fix it" will be paid for by YOU!   What they have effectively done, in the above case, is broken down in their high tech car on the motorway and spent to whole weekend fiddling about and looking around the engine compartment - probably scratching their head and wondering what it all does!  and tinkering with various parts and settings. The smart money would have called in the country car breakdown service immediately and get an expert on the case to "work the problem".

Now let me confess - I have had a few problems with SSP over the years but the SSP was not to blame - it was me. In testing I have have made some ghastly mistakes and some deliberately "to see what happens" to my "crash and burn" SSP.   Whenever, I get problems with my "sort of production" like SSP, then commands (like cluster -list) reporting on the SSP starting up with "unable to connect to database" it is 100% a user created problem like DNS is down or I messed up the VIOS local /etc/hosts file (misspelling one VIOS hostname (a 2 should have been a 1) or duplicated IP address use (two VIOS with the same address due to bad editing)). Note:  /etc/hosts should have all the VIOS listed on every VIOS in the SSP and /etc/netsvc.conf set to local first.    Basically, these are self inflicted wounds.  In fact it is often amazes me the SSP was working on any nodes after some of my screw-up!

So perhaps I should highlight to all SSP users at the first sign of a SSP issue that you are not alone.

Here is What to Do Next?

  1. Don't panic!  DO NOT FIDDLE.
  2. Ensure all the SSP data LUNs and Repository LUN are online on every VIOS SSP node.  VIOS lspv -size is good here and if available lscluster -d
  3. Ensure your network actually works between all the VIOSes.   Try more than just ping. Sometimes ping works but nothing else does!  So also try ssh between nodes.
  4. Wait an hour with all SSP VIOS running- SSP does attempt to self heal but does so carefully - lock time outs means this can take a while for locks to cleared and retries attempted.  This not "madly rushing about to correct issues like a headless chicken" is typical of clusters as rushing can cause yet more problems and deadlocks.
  5. DO NOT FIDDLE with the Repository disk unless every thing else is working OK and you have HMC Repository disk warnings. The command to switch Repo LUN is chrepos.
  6. DO NOT FIDDLE and run SSP commands (other than cluster -list and lu -list which are simple and a good quick check of SSP health).
    Some complex cluster wide SSP commands like cluster -status take local and global locks - this blocks the SSP from fixing itself!
  7. Raise a PMR and get VIOS Support working on it for you.
  8. Prepare to send IBM Support diagnostic data about the SSP and even create a SSP snap in advance: clffdc (for example: clffdc -c FULL -p 2) then be prepared to send them large file it creates.
  9. Pro-actively do your part.  Be prepared to answer questions quickly and follow instructions to minimise the down time.

From working with the VIOS Shared Storage Pool support and development team, I am always impressed with their diagnosis skills and ability to workout a trick problem.

Additional Information


Other places to find Nigel Griffiths IBM (retired)

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"Component":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power -\u003EPowerLinux"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"Component":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]

Document Information

Modified date:
13 June 2023

UID

ibm11115637