High availability setup
High availability setups for applications on Linux® on IBM Z® or LinuxONE typically use path redundancy for network connections.
Identifying suitable PCI functions
To avoid outages during PCI network adapter maintenance, use redundant paths through PCI functions. Such PCI functions have different values in /sys/bus/pci/devices/<pci_id>/pfip/segment*, where <pci_id> is the function address. The pfip/segement* provides an abstract indication of the path that is used to access the PCI function with segment0 having the highest significance. This can be used to compare the paths used by two or more PCI functions, to give an indication of the degree of isolation between them. If possible, choose PCI functions with a high degree of isolation. Read more information about segments and other PCI device information.
000d:00:00.0 and
00b5:00:00.0:# cat /sys/bus/pci/devices/000d:00:00.0/pfip/segment0 0x01 # cat /sys/bus/pci/devices/00b5:00:00.0/pfip/segment0 0x03
Bonding for PCI network interface redundancy
On Linux, you can use the bonding device driver to create bonded interfaces. For more information about bonded interfaces, see Linux Channel Bonding Best Practices and Recommendations. This publication describes bonding of OSA-Express based interfaces, but the descriptions of the bonding device driver also apply to PCI based interfaces. In particular, as for OSA-Express, the BONDING_MODULE_OPTS specification must include fail_over_mac option. The exact option name can vary by distribution.
Path redundancy for SMC-R connections
Use SMC-R link groups to guard against link failure in SMC-R connections. SMC-R automatically creates link groups for PCI functions with matching PNET IDs.
To safeguard against failure of the TCP/IP connection, use a bonded interface that combines paths through two different OSA-Express adapters.
PCI functions with a high level of isolation
(see Bonding for PCI network interface redundancy) are selected. In Linux, these paths result in network interfaces eno181 and eno13.
The paths through the two OSA Express adapters result in network interfaces eth0
and eth1. These two interfaces are bonded into an interface
bond0.
In its IOCDS, the hardware configuration assigns the same PNET ID, PNET1, to
eno181, eno13, eth0, and eth1.
This common PNET ID associates the four interfaces, and by extension also the bonded interface
bond0. The two PCI based
interfaces form an SMC link group.
A connection that is initiated through bond0 has a redundant TCP/IP connection.
The SMC link group provides failover for RDMA traffic.
Situations to consider
- External network issues
- Even if the PCI network interface is fully operational, communication with the target system can fail due to external network problems, such as faulty switches or routers. Use standard mechanisms like Spanning Tree Protocol (STP) and dynamic routing to mitigate single points of failure in the network. These mechanisms are outside the scope of this document.
- Carrier loss
- If a physical link of a PCI network adapter is lost or inactive (for example due to cable disconnection or the loss of an optical signal), all interfaces on that adapter will report NO-CARRIER status. In a bonding configuration, the bond interface automatically switches to an alternate interface until the carrier is restored. You can simulate this condition for testing by disconnecting the network cable or disabling the corresponding port on the hardware switch.
- PCI Device recovery
- Device recovery can be initiated by firmware, by the device driver, or manually by using the
zpcictlcommand. For more details aboutzpcictl, see Device Drivers, Features, and Commands: Chapter 39 - PCI Express support). - PCI function in standby state
- A PCI function can be configured offline either from the SE or HMC or by the owning Linux system. Firmware-initiated code updates can include setting PCI functions offline and online again. Note that when a PCI function is configured offline, the corresponding network interface is destroyed. The network interface is re-created by configuring it online again. For a successful recovery, ensure that configuration settings like IP addresses, bond, and VLAN are set up persistently in your network management tools.
