Unnecessary retransmits

Edit online

Related to the hard-versus-soft mount question is the question of the appropriate timeout duration for a given network configuration.

If the server is heavily loaded, is separated from the client by one or more bridges or gateways, or is connected to the client by a WAN, the default timeout criterion may be unrealistic. If so, both server and client are burdened with unnecessary retransmits. For example, if the following command:

# nfsstat -c

reports a significant number, like greater than five percent of the total, of both timeouts and badxids, you could increase the timeo parameter with the mount command.

Identify the directory you want to change, and enter a new value, in tenths of a second, on the NFS TIMEOUT line.

The default time is 0.7 second, timeo=7, but this value is manipulated in the NFS kernel extension depending on the type of call. For read calls, for instance, the value is doubled to 1.4 seconds.

To achieve control over the timeo value for operating system version 4 clients, you must set the nfs_dynamic_retrans option of the nfso command to 0. There are two directions in which you can change the timeo value, and in any given case, there is only one right way to change it. The correct way, making the timeouts longer or shorter, depends on why the packets are not arriving in the allotted time.

If the packet is only late and does finally arrive, then you may want to make the timeo variable longer to give the reply a chance to return before the request is retransmitted.

However, if the packet has been dropped and will never arrive at the client, then any time spent waiting for the reply is wasted time, and you want to make the timeo shorter.

One way to estimate which option to take is to look at a client's nfsstat -cr output and see if the client is reporting lots of badxid counts. A badxid value means that an RPC client received an RPC call reply that was for a different call than the one it was expecting. Generally, this means that the client received a duplicate reply for a previously retransmitted call. Packets are thus arriving late and the timeo should be lengthened.

Also, if you have a network analyzer available, you can apply it to determine which of the two situations is occurring. Lacking that, you can try setting the timeo option higher and lower and see what gives better overall performance. In some cases, there is no consistent behavior. Your best option then is to track down the actual cause of the packet delays/drops and fix the real problem; that is, server or network/network device.

For LAN-to-LAN traffic through a bridge, try a value of 50, which is in tenths of seconds. For WAN connections, try a value of 200. Check the NFS statistics again after waiting at least one day. If the statistics still indicate excessive retransmits, increase the timeo value by 50 percent and try again. You also want to examine the server workload and the loads on the intervening bridges and gateways to see if any element is being saturated by other traffic.