Configuring Multi-Rail over TCP (MROT)

IBM Spectrum Scale 5.1.5 introduces the Multi-Rail over TCP (MROT) feature. This feature enables the concurrent use of multiple subnets to communicate with a specified destination, and now allows the concurrent use of multiple physical network interfaces without requiring bonding to be configured.

With MROT, the subnets attribute in the mmchconfig command can be used to establish fault tolerance or automatic failover. All the IP addresses which are in the subnets attribute you define are used to establish connections with the nodes within the cluster. If some of the interfaces corresponding to these IP addresses are down, GPFS uses the other subnet-defined interfaces for communication. It is necessary that the interfaces corresponding to the daemon IP addresses are operational even with the subnets attribute configured. This requirement is similar to releases where MROT is not implemented.

In addition to the IP addresses in the subnets attribute, if you also want to use the daemon IP address for communication you need to configure the subnet of daemon IP address in the subnets attribute as well. In releases where MROT is not implemented, only one IP address is used for daemon communication. It is either the daemon IP address or another IP address that is taken from the subnets attribute.

With MROT, when the subnets attribute lists multiple subnets, if any of these multiple subnets are defined on both the local node and the peer node, then all the common subnets are used for communication. In versions where MROT is not implemented, only the first subnet that is common in the list is used for communication between the local and peer nodes.

Rules for configuring the subnets attribute

Use the mmchconfig command to modify the subnets attribute, to define the IP addresses and network interfaces used for daemon communication. The following rules define how the subnets attribute configuration is processed for any local node, and how communication is done with the peer nodes:

If the subnets attribute is not configured, daemon IP addresses are used for communication.
Multiple subnets can be configured and will be used for communication if the peer node has the network interfaces configured on all the common subnets.
All IP addresses that are defined in the common subnets are used for communication.
TCP connections are only established between those IP addresses that are within the same subnet.
If IP addresses are defined in the subnets attribute for both the local and the peer node in the same subnet, then the daemon IP address is not used for communication, by default. To use the daemon IP address as well, it must be configured through the subnets attribute and the daemon IP addresses of a pair of nodes must be in the same subnet.
If there are no IP addresses in the same subnet, then daemon IP addresses are used for communication.
Load balance and failover are supported.

Failure detection and recovery

When a TCP connection is marked as broken, ICMP is used to detect the connectivity of the source IP address and the destination IP address. If the ICMP echo reply is not received, the IP pair is marked as down. A background recovery thread continuously detects the connectivity of this IP pair. When the ICMP echo reply is received, this IP pair is marked as up, and can be used again to establish TCP connections. The ICMP echo command (network ping) must be unblocked in the firewall for IBM Spectrum Scale to function properly.

For more information, see Firewall recommendations for internal communication among nodes.

IP Pair table

With MROT, the IP pair table contains all the IP pairs, including the source IP address and the destination IP address which are used to establish TCP connections between the local node and the peer node. When the subnets attribute lists multiple subnets, and if any of these multiple subnets are defined for both the local node and the peer node, then all the common subnets are used for communication. All the IP addresses in the local node and the peer node which are defined in the common subnets are used for communication and are listed in the IP pair table. You can use the mmdiag--network option to show the IP pair table. For example, specifying the following subnets:

subnets='192.168.1.0 10.0.0.0'

For the IP pair table between Node A and Node B, if Node A has network interfaces with IP addresses 192.168.1.1 and 10.0.0.1, and Node B has network interfaces with IP addresses 192.168.1.2 and 10.0.0.2, then the IP pair table on Node A will be as shown.

Table 1. IP Pair Table on Node A
idx	iface	Status	Source	Destination	Subnet
0	eth0	up	10.0.0.1	10.0.0.2	10.0.0.0/24
1	eth1	up	192.168.1.1	192.168.1.2	192.168.1.0/24

The TCP connections between Node A and Node B are established based on the IP pair table, by using a round-robin algorithm.

Configuring N:N connection model

The graphic shows the configuration for the N:N connection model — Figure 1. Configuring the N:N connection model

For a specific subnet in the subnets attribute list, when the number of IP addresses in this subnet on Node A is the same as the number of IP addresses in this subnet on Node B, then that connection model is referred to as the N:N model. There are N IP pairs in this subnet. Figure 1 shows an example of a N:N model on specifying the following subnets.

subnets = “192.168.1.0 192.168.2.0"

For subnet 192.168.1.0, the NSD server has the network interface with IP address 192.168.1.1, and the NSD client has the network interface with IP address 192.168.1.10. The number of IP addresses on the NSD server is the same as the number of IP addresses on the NSD client. Therefore, this is an N:N connection model. The same applies to subnet 192.168.2.0. The IP pair table on the NSD server is shown in the following table.

Table 2. IP Pair Table on NSD server for an N:N connection model
idx	iface	Status	Source	Destination	Subnet
0	eth0	up	192.168.1.1	192.168.1.10	192.168.1.0/24
1	eth1	up	192.168.2.1	192.168.2.10	192.168.2.0/24

Configuring M*N connection model

The graphic shows the configuration of the M*N connection model. — Figure 2. Configuring the M*N connection model

For a specific subnet in the subnets attribute list, when the number of IP addresses in this subnet on Node A is not the same as the number of IP addresses in this subnet on Node B then such a connection model is referred to as M*N model. There are M*N IP pairs in this subnet. Figure 2 shows an example, when specifying the following subnets:

subnets = “192.168.1.0”

For subnet 192.168.1.0, the NSD server has the network interface with the IP addresses 192.168.1.1, 192.168.1.2, 192.168.1.3 and 192.168.1.4, and the NSD client has the network interface with the IP address 192.168.1.10. The number of IP addresses on the NSD server is not the same as the number of IP addresses on the NSD client. Therefore, this is referred to as M*N connection model. The IP pair table on the NSD server is shown.

Table 3. IP Pair Table on NSD server for an M*N connection model
idx	iface	Status	Source	Destination	Subnet
0	eth0	up	192.168.1.1	192.168.1.10	192.168.1.0/24
1	eth1	up	192.168.1.2	192.168.1.10	192.168.1.0/24
2	eth2	up	192.168.1.3	192.168.1.10	192.168.1.0/24
3	eth3	up	192.168.1.4	192.168.1.10	192.168.1.0/24

Configuring multiple IP addresses in the same subnet

The graphic shows the configuration for multiple IP addresses — Figure 3. Configuring multiple IP addresses

If multiple IP addresses are configured in the same subnet, then specific OS dependent configurations must be configured. Figure 3 shows an example by specifying the following subnets.

subnets = “192.168.1.0”

For subnet 192.168.1.0, the NSD server has network interfaces with the IP addresses 192.168.1.1 and 192.168.1.2, and the NSD client has the network interfaces with the IP addresses 192.168.1.10 and 192.168.1.11. The IP pair table on the NSD server is shown:

Table 4. IP Pair Table on NSD server for multiple IP addresses
idx	iface	Status	Source	Destination	Subnet
0	eth0	up	192.168.1.1	192.168.1.10	192.168.1.0/24
1	eth1	up	192.168.1.2	192.168.1.11	192.168.1.0/24

The specifications shown in the preceding tables can function properly only if you define certain OS dependent configurations.

Linux configurations

In order to have multiple network interfaces on the same subnet, arp_filter and source-based policy routing needs to be configured. The official reference values for configuring the arp_filter with the variable as shown:

variable: net.ipv4.conf.interface.arp_filter.

1 - Allows you to have multiple network interfaces on the same subnet, and have the ARPs for each interface be answered based on whether or not the kernel would route a packet from the ARP’d IP out that interface. Therefore, you must use source based routing for this to work. In other words, it allows control over which cards will respond to an ARP request. In most cases, it is 1.
0 - This is the default value. The kernel can respond to ARP requests with addresses from other interfaces. This may seem wrong but it usually makes sense, because it increases the chance of successful communication. IP addresses are owned by the complete host on Linux, not by particular interfaces. This behavior might cause problems only for more complex setups like load-balancing.

For more information, see https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt.

The arp_filter is enabled for the interface if at least one of the following attributes conf/{all,interface}/arp_filter is set to TRUE; otherwise it is disabled.

Note: This configuration is defined per interface setting, where “interface” is the name of your network interface; “all” is a special interface. It changes the settings for all interfaces.

Follow the procedure shown to complete the Linux configurations.

On the NSD Server

Issue the following command to set the arp_filter.

sysctl -w net.ipv4.conf.default.arp_filter=1
sysctl-w net.ipv4.conf.all.arp_filter=1

Issue the following commands to set the source-based policy routing.

ip addr add 192.168.1.1/24 dev eth0
ip addr add 192.168.1.2/24 dev eth1
echo 200 subnet_192.168.1.0_eth0 >> /etc/iproute2/rt_tables
echo 201 subnet_192.168.1.0_eth1 >> /etc/iproute2/rt_tables
ip rule add from 192.168.1.1 lookup subnet_192.168.1.0_eth0
ip rule add from 192.168.1.2lookup subnet_192.168.1.0_eth1
ip route add 192.168.1.0/24 dev eth0 table subnet_192.168.1.0_eth0
ip route add 192.168.1.0/24 dev eth1table subnet_192.168.1.0_eth1

On the NSD Client

Issue the following command to set the arp_filter

sysctl -w net.ipv4.conf.default.arp_filter=1
sysctl -w net.ipv4.conf.all.arp_filter=1

Issue the following command to set the source-based policy routing.

ip addr add 192.168.1.10/24 dev eth0
ip addr add 192.168.1.11/24 dev eth1
echo 200 subnet_192.168.1.0_eth0>> /etc/iproute2/rt_tables
echo 201 subnet_192.168.1.0_eth1>> /etc/iproute2/rt_tables
ip rule add from 192.168.1.10 lookup subnet_192.168.1.0_eth0
ip rule add from 192.168.1.11 lookup subnet_192.168.1.0_eth1
ip route add 192.168.1.0/24 dev eth0 table subnet_192.168.1.0_eth0
ip route add 192.168.1.0/24 dev eth1 table subnet_192.168.1.0_eth1

Configuring the subnets attribute for multi-cluster

Configuring the subnets attribute for multi-cluster is similar to configuring it for the local cluster. However, in this case, both the home cluster and the remote cluster must be configured. You must also specify a cluster name or a cluster name pattern for each subnet. This is needed when a private network is shared across clusters. If the use of a private network is limited to only the local cluster, then no cluster name is required in the subnets attribute.

Configuring maxTcpConnsPerNodeConn

The total number of TCP connections between the two nodes is controlled by the maxTcpConnsPerNodeConn attribute. The valid values are 1-16, with the default being 2. If maxTcpConnsPerNodeConn is less than the number of IP pairs between a pair of nodes, only some IP pairs, specifically the ones defined in maxTcpConnsPerNodeConn, are used for communication. For example, if maxTcpConnsPerNodeConn is 2, but there are 4 IP pairs in the IP pair table between a pair of nodes, then only 2 IP pairs are used for communication.

For more information, see Recommendations for tuning maxTcpConnsPerNodeConn parameter