VIOS SSP Best Practices

VIOS SSP Best Practices

Best Practices for administration and maintenance of VIOS SSP


Table of Contents


Storage

  • Only add RAID'ed devices to the storage pool.
  • Configure mirroring with SSP
    • Mirror system tier at a minimum (if not performed at storage level)
    • Monitor failure of storage devices (for pool or repository) and replace immediately
    • Failure groups should be kept balanced (similar size)
  • Isolate meta-data from user data with storage tiers
    • Use at least one data tier
    • System tier size as 1% or more of overall pool storage
      • potentially down to more aggressive 0.3% for larger pools
  • SSP disk striping benefits from more LUNS (primarily with user tiers)
    • Do not use only one or two enormous LUNs in a failure group
    • Ideally 16-64 LUNs per failure group and even more LUNs as the pool size increases
  • Monitor space utilization and overcommitment regularly
    • Add storage to pool as soon as utilization threshold is reached
    • Determine comfort zone for over commitment
    • Review free space after every new vm/lpar is deployed and fully operational
  • Configure all client lpars with multipathing
    • One path through each VIOS
  • Configure all VIOS with multipathing
    • Monitor path failures and take corrective actions
    • For non-IBM storage, we recommend using the AIX native PCM (Path Control Module), and the ODM filesets provided by the vendor.
  • Keep a spare LUN on-hand to replace the CAA repository disk if necessary
    • The cluster is in degraded mode if this disk fails, so it should be replaced as soon as possible

Networking

  • The storage pool is highly dependent on the network for control and meta-data communication, so network reliability is important
  • Synchronize clocks of all VIOS nodes in the cluster using ntpd or xntpd
  • Keep cluster communication on network adapter that is not congested
    • Do not share cluster communication adapter with client partitions (SEA)
    • Do not share cluster communication adapter with storage traffic (FCoE)
  • Redundancy in network configuration
  • Perform network administration only during maintenance window
    • If necessary, bring storage pool and VMs down during network reconfiguration
  • Use either short or long hostnames consistently (do not use both in a cluster)
  • DNS problems can impact sotrage pool host name lookup.  To avoid this:
    • Use /etc/hosts
    • Set /etc/netsvc.conf with "hosts=local,bind"
  • Changing hostname or IP address requires removing and adding a node
    • Do not perform this "on the fly" or cluster confusion ensues.

Upgrades and Maintenance

  • Stop cluster services on the node while performing any updates/maintenance on the VIOS or third party software
  • Update the pool's sever node (or 'MFS' node) last to avoid unnecessary extra turnover of this role.
    • Identify this role by running the following command, as root, on any active node in the pool:
      # pooladm pool lsmfs `pooladm pool list | tail -1`
  • Perform updates/maintenance on one VIOS node in a pair at a time
    • While updating one VIOS on the frame, leave the other redundant VIOS available
    • Only bring down the redundant VIOS once the updated one is back fully online

Miscellaneous

  • The VIOS release notes README should be consulted for current minimum requirements and current limitations
  • Use viosbr either with automatic backup option at a specified frequency, or manually after configuration changes such as add/remove/change of nodes, disks, repository, client LPARs, etc.
    • Archive the backup images off of the cluster
  • Build a disaster recovery policy around VIOS backup
    • For example use storage replication and recover cluster at a remote site on replicated storage, with viosbr backup
    • Test the recovery procedures occasionally
  • Always collect VIOS snaps across cluster shortly after any issue
    • The clffdc command allows collecting a cluster-wide snap (as root).