VIOS SSP Best Practices

Best Practices for administration and maintenance of VIOS SSP

Storage

Only add RAID'ed devices to the storage pool.
Configure mirroring with SSP
- Mirror system tier at a minimum (if not performed at storage level)
- Monitor failure of storage devices (for pool or repository) and replace immediately
- Failure groups should be kept balanced (similar size)
Isolate meta-data from user data with storage tiers
- Use at least one data tier
- System tier size as 1% or more of overall pool storage
  - potentially down to more aggressive 0.3% for larger pools
SSP disk striping benefits from more LUNS (primarily with user tiers)
- Do not use only one or two enormous LUNs in a failure group
- Ideally 16-64 LUNs per failure group and even more LUNs as the pool size increases
Monitor space utilization and overcommitment regularly
- Add storage to pool as soon as utilization threshold is reached
- Determine comfort zone for over commitment
- Review free space after every new vm/lpar is deployed and fully operational
Configure all client lpars with multipathing
- One path through each VIOS
Configure all VIOS with multipathing
- Monitor path failures and take corrective actions
- For non-IBM storage, we recommend using the AIX native PCM (Path Control Module), and the ODM filesets provided by the vendor.
Keep a spare LUN on-hand to replace the CAA repository disk if necessary
- The cluster is in degraded mode if this disk fails, so it should be replaced as soon as possible

The storage pool is highly dependent on the network for control and meta-data communication, so network reliability is important
Synchronize clocks of all VIOS nodes in the cluster using ntpd or xntpd
Keep cluster communication on network adapter that is not congested
- Do not share cluster communication adapter with client partitions (SEA)
- Do not share cluster communication adapter with storage traffic (FCoE)
Redundancy in network configuration
- Configure multiple network interfaces with 'cluster -addips' (requires the node to be stopped)
- Ensure these networks are on separate isolated subnets
- Also consider configuring disk communication with 'cluster -addcompvs' in case all networks go down
- More information about these options is available in Knowledge Center:
  https://www.ibm.com/support/knowledgecenter/8284-22A/p8hb1/p8hb1_networkingforssp.htm
Perform network administration only during maintenance window
- If necessary, bring storage pool and VMs down during network reconfiguration
Use either short or long hostnames consistently (do not use both in a cluster)
DNS problems can impact sotrage pool host name lookup. To avoid this:
- Use /etc/hosts
- Set /etc/netsvc.conf with "hosts=local,bind"
Changing hostname or IP address requires removing and adding a node
- Do not perform this "on the fly" or cluster confusion ensues.

Stop cluster services on the node while performing any updates/maintenance on the VIOS or third party software
Update the pool's sever node (or 'MFS' node) last to avoid unnecessary extra turnover of this role.
- Identify this role by running the following command, as root, on any active node in the pool:
  # pooladm pool lsmfs `pooladm pool list | tail -1`
Perform updates/maintenance on one VIOS node in a pair at a time
- While updating one VIOS on the frame, leave the other redundant VIOS available
- Only bring down the redundant VIOS once the updated one is back fully online

The VIOS release notes README should be consulted for current minimum requirements and current limitations
Use viosbr either with automatic backup option at a specified frequency, or manually after configuration changes such as add/remove/change of nodes, disks, repository, client LPARs, etc.
- Archive the backup images off of the cluster
Build a disaster recovery policy around VIOS backup
- For example use storage replication and recover cluster at a remote site on replicated storage, with viosbr backup
- Test the recovery procedures occasionally
Always collect VIOS snaps across cluster shortly after any issue
- The clffdc command allows collecting a cluster-wide snap (as root).