VIOS SSP Best Practices
Best Practices for administration and maintenance of VIOS SSP
Table of Contents
Storage
- Only add RAID'ed devices to the storage pool.
- Configure mirroring with SSP
- Mirror system tier at a minimum (if not performed at storage level)
- Monitor failure of storage devices (for pool or repository) and replace immediately
- Failure groups should be kept balanced (similar size)
- Isolate meta-data from user data with storage tiers
- Use at least one data tier
- System tier size as 1% or more of overall pool storage
- potentially down to more aggressive 0.3% for larger pools
- SSP disk striping benefits from more LUNS (primarily with user tiers)
- Do not use only one or two enormous LUNs in a failure group
- Ideally 16-64 LUNs per failure group and even more LUNs as the pool size increases
- Monitor space utilization and overcommitment regularly
- Add storage to pool as soon as utilization threshold is reached
- Determine comfort zone for over commitment
- Review free space after every new vm/lpar is deployed and fully operational
- Configure all client lpars with multipathing
- One path through each VIOS
- Configure all VIOS with multipathing
- Monitor path failures and take corrective actions
- For non-IBM storage, we recommend using the AIX native PCM (Path Control Module), and the ODM filesets provided by the vendor.
- Keep a spare LUN on-hand to replace the CAA repository disk if necessary
- The cluster is in degraded mode if this disk fails, so it should be replaced as soon as possible
Networking
- The storage pool is highly dependent on the network for control and meta-data communication, so network reliability is important
- Synchronize clocks of all VIOS nodes in the cluster using ntpd or xntpd
- Keep cluster communication on network adapter that is not congested
- Do not share cluster communication adapter with client partitions (SEA)
- Do not share cluster communication adapter with storage traffic (FCoE)
- Redundancy in network configuration
- Configure multiple network interfaces with 'cluster -addips' (requires the node to be stopped)
- Ensure these networks are on separate isolated subnets
- Also consider configuring disk communication with 'cluster -addcompvs' in case all networks go down
- More information about these options is available in Knowledge Center:
https://www.ibm.com/support/knowledgecenter/8284-22A/p8hb1/p8hb1_networkingforssp.htm
- Perform network administration only during maintenance window
- If necessary, bring storage pool and VMs down during network reconfiguration
- Use either short or long hostnames consistently (do not use both in a cluster)
- DNS problems can impact sotrage pool host name lookup. To avoid this:
- Use /etc/hosts
- Set /etc/netsvc.conf with "hosts=local,bind"
- Changing hostname or IP address requires removing and adding a node
- Do not perform this "on the fly" or cluster confusion ensues.
Upgrades and Maintenance
- Stop cluster services on the node while performing any updates/maintenance on the VIOS or third party software
- Update the pool's sever node (or 'MFS' node) last to avoid unnecessary extra turnover of this role.
- Identify this role by running the following command, as root, on any active node in the pool:
# pooladm pool lsmfs `pooladm pool list | tail -1`
- Identify this role by running the following command, as root, on any active node in the pool:
- Perform updates/maintenance on one VIOS node in a pair at a time
- While updating one VIOS on the frame, leave the other redundant VIOS available
- Only bring down the redundant VIOS once the updated one is back fully online
Miscellaneous
- The VIOS release notes README should be consulted for current minimum requirements and current limitations
- Use viosbr either with automatic backup option at a specified frequency, or manually after configuration changes such as add/remove/change of nodes, disks, repository, client LPARs, etc.
- Archive the backup images off of the cluster
- Build a disaster recovery policy around VIOS backup
- For example use storage replication and recover cluster at a remote site on replicated storage, with viosbr backup
- Test the recovery procedures occasionally
- Always collect VIOS snaps across cluster shortly after any issue
- The clffdc command allows collecting a cluster-wide snap (as root).
Links and Social Media
- IBM PowerVM LinkedIn Group - http://www.linkedin.com/groups/8403988
- IBM PowerVM Developerworks Wiki - http://tinyurl.com/z6s29k2
- VIOS SSP YouTube videos - http://www.youtube.com/user/nigleargriffiths