Tuning your Linux system for more efficient parallel job performance
The Linux® default network and network device settings might not produce optimum throughput (bandwidth) and latency numbers for large parallel jobs. The information that is provided describes how to tune the Linux network and certain network devices for better parallel job performance.
This information is aimed at private networks with high-performance network devices such as the Gigabit Ethernet network, and might not produce similar results for 10/100 public Ethernet networks.
The following table provides examples for tuning your Linux system for better job performance. By following these examples, it is possible to improve the performance of a parallel job that runs over an IP network.
Network Tuning Factors | Tuning for the current boot session | Modifying the system permanently |
---|---|---|
arp_ignore - With arp_ignore set to 1, a device answers only to an ARP request if the address matches its own. | echo '1' > /proc/sys/net/ipv4/conf/all/arp_ignore | Add this line to the /etc/sysctl.conf file: net.ipv4.conf.all.arp_ignore = 1 |
arp_filter - With arp_filter set to 1, the kernel answers only to an ARP request if it matches its own IP address. | echo '1' > /proc/sys/net/ipv4/conf/all/arp_filter | Add this line to the /etc/sysctl.conf file: net.ipv4.conf.all.arp_filter = 1 |
rmem_default - Defines the default receive window size. | echo '1048576' > /proc/sys/net/core/rmem_default | Add this line to the /etc/sysctl.conf file: net.core.rmem_default = 1048576 |
rmem_max - Defines the maximum receive window size. | echo '2097152' > /proc/sys/net/core/rmem_max | Add this line to the /etc/sysctl.conf file: net.core.rmem_max = 2097152 |
wmem_default - Defines the default send window size. | echo '1048576' > /proc/sys/net/core/wmem_default | Add this line to the /etc/sysctl.conf file: net.core.wmem_default = 1048576 |
wmem_max - Defines the maximum send window size. | echo '2097152' > /proc/sys/net/core/wmem_max | Add this line to the /etc/sysctl.conf file: net.core.wmem_max = 2097152 |
Set device txqueuelen - Sets each network device, for example, eth0, eth1, and on. | /sbin/ifconfig device_interface_name txqueuelen 4096 | Not applicable |
Turn off device interrupt coalescing - To improve latency. | See sample script. This script must be run after each reboot. | Not applicable |
This sample script unloads the e1000 Gigabit Ethernet device driver and reloads it with interrupt coalescing disabled.
For example:
#!/bin/ksh
Interface=eth0
Device=e1000
Kernel_Version=`uname -r`
ifdown $
rmmod $
insmod /lib/modules/$/kernel/drivers/net/$/$.ko
InterruptThrottleRate=0,0,0
ifconfig $
exit $?
MPI jobs use shared memory to handle intranode communication. You might need to modify the system default for allowable maximum shared memory size to allow a large MPI job to successfully enable shared memory usage. It is recommended that you set the system allowable maximum shared memory size to 256 MB or larger for supporting large MPI jobs.
To modify this limit for the current boot session, run the echo "268435456" > /proc/sys/kernel/shmmax command as root.
To modify this limit permanently, add kernel.shmmax = 268435456
to the /etc/sysclt.conf
file and reboot the system:
DNS caching should be enabled to minimize runtime host name resolution, especially if LDAP is also enabled in the cluster.
Network tuning factors | Tuning for the current boot session and updating it into the boot image |
---|---|
gc_thresh3 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh3 should be the maximum number of compute operating system nodes, plus 300.\ |
echo "5300" >/proc/sys/net/ipv4/neigh/default/gc_thresh3 |
gc_thresh2 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh2 should be 100 less than gc_thresh3.\ |
echo "5200" >/proc/sys/net/ipv4/neigh/default/gc_thresh2 |
gc_thresh1 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh1 should be 100 less than gc_thresh2\ |
echo "5100" >/proc/sys/net/ipv4/neigh/default/gc_thresh1 |
gc_interval - The ARP garbage collection interval on the compute nodes should be high so that it does not process ARP cleanup. \ | echo "1000000000" > /proc/sys/net/ipv4/neigh/default/gc_interval |
gc_stale_time - The ARP stale time should be set high so that it does not get discarded. \ | echo "2147483647" > /proc/sys/net/ipv4/neigh/default/gc_stale_time |
base_reachable_time_ms - The ARP valid entry time (in milliseconds) should be set high so that it does not get discarded. \ | echo "2147483647" > /proc/sys/net/ipv4/neigh/default/base_reachable_time_ms |
Parent topic: Planning to install IBM Spectrum MPI