Troubleshooting
Problem
DCE RPC Timeouts
Resolving The Problem
Introduction
Suppose I make a DCE RPC to a server that's unavailable. How long will it take for my RPC to time out and return a failure code? Sounds like a simple question, but unfortunately the answer is pretty complicated. The shortest possible correct answer is: it depends. The purpose of this note is to list the things that it depends upon, and to describe how you can control the timeout period if the default behavior doesn't meet your needs.
Variables that Control Timeout
The following variables control how long an RPC timeout will take:
1. Protocol sequence: UDP timeouts can be controlled much more easily from within DCE applications, and are more predictable. TCP timeouts are controlled by OS parameters and are not as predictable. Specifically...
2. Nature of the dead server: this doesn't matter at all if you're making RPCs via UDP, but if you use TCP then it can make a dramatic difference. With TCP, if the target server machine is up but the specific target server process is down, then the timeout will be almost instantaneous. If however the entire target server machine is down or otherwise inaccessible (i.e., if the box cannot be pinged), then a TCP RPC may take several minutes to time out. In contrast, a UDP RPC will take about 35 seconds to time out, no matter what the nature of the dead server process or machine. Furthermore...
3. User-adjustable timeout settings: that 35-second UDP timeout can be adjusted fairly easily, on a per-binding-handle basis, by a DCE application. Depending on the platform, you might even be able to adjust it via an environment variable. With TCP on the other hand, the multi-minute timeout for accesses to a dead server machine can be adjusted only on a per-client-machine basis -- -- that is, you have to adjust the machine's global TCP timeout in order to change it. Also...
4. State of the connection: there will probably be different TCP timeout parameters to control initial connection timeouts and timeouts within an existing connection.
Protocol Sequences: TCP and UDP
Protocol sequence is the most important factor when analyzing DCE RPC timeouts. By default, a DCE server will be accessible via both TCP and UDP, and a client will choose randomly between the two protocols when transmitting an RPC. If the target server is up, then the choice of protocol is almost indistinguishable: both protseqs will exhibit the same behavior, even when faced with a lossy network connection that causes intermittent comm failures (rather than permanent ones).
However, if the target server is truly down or is completely inaccessible, then the differences between RPCs made over TCP and those made over UDP can lead to puzzling behavior. If you do nothing special, then you'll see one of four different failure modes:
1. If the RPC is made via UDP, then it will fail with a "communications failure" error after about 35 seconds
2. If the RPC is made via TCP and the target machine is down and the client does not have an existing connection to the server, then it will fail with a "connection request timed out" error after several minutes (the exact time depends on the client machine's OS, as you'll see, but is typically 1 - 3 minutes)
3. If the RPC is made via TCP and the target machine is down and the client has an existing connection to the server, then it will fail with a "connection closed" error after several minutes (this timeout will probably be significantly longer than the one in the previous bullet, like 8 or 9 minutes)
4. If the RPC is made via TCP and the target machine is up but the target server is down, then it will fail with a "connection request rejected" error, usually in under 1 second.
In general, we recommend using UDP for RPCs if you have to control timeouts. Nevertheless, we'll discuss specifics for timeout control in both UDP and TCP in the remainder of this note.
Controlling UDP RPC Timeouts
Every DCE RPC is associated with a data structure called a "binding handle". The binding handle contains a set of control information about the RPC's destination, including the protseq to use (TCP or UDP), the IP address and port number of the target server, the object UUID for the RPC (in cases where object RPCs are used -- object UUIDs are an optional feature of DCE RPCs), and a communications timeout setting. We are mostly interested in the communications timeout setting in this section.
The comm timeout in the binding handle is a value from 0 to 10. Each DCE implementation is free to assign its own "real-world" meaning to these settings, subject to the following constraints:
* A lower setting means a shorter timeout
* Zero is the shortest "reasonable" timeout setting
* Nine is the highest "reasonable" finite timeout setting
* Ten means infinite timeout
* The setting is considered to be advisory only, and the RPC runtime doesn't strictly have to pay any attention to it
That last one, the one about "advisory only", is the kicker with respect to TCP -- with TCP, the only thing the RPC facility cares about is: did you set it to ten? If so, then you get an infinite timeout. If not, then you get whatever the OS gives you -- so zero through nine is all the same to TCP.
But with UDP, the DCE RPC facility actually uses the comm timeout setting to decide how long to wait when making an RPC. The actual timeout values (in seconds) that correspond to each number from 0 to 9 are left up to the implementation, but the common rule seems to be that a timeout setting of N will be interpreted to mean 2^N + 3 seconds. This rule is approximately correct on both AIX and Solaris, for whatever it's worth.
A programmer can use the rpc_mgmt_set_com_timeout function to set the comm timeout on a particular binding handle. If this function is not called, then the default setting is 5, which usually corresponds to about 35 seconds. On Solaris, you can set the RPC_BINDING_TIMEOUT environment variable to change the default from 5 to something else. In any case, rpc_mgmt_set_com_timeout, if called, has the final say. (When a binding handle is created, it is initialized with a timeout of 5, unless RPC_BINDING_TIMEOUT is set and you're on Solaris, in which case it's initialized with RPC_BINDING_TIMEOUT. After that, a programmer can set it to something else via rpc_mgmt_set_com_timeout; you can even call rpc_mgmt_set_com_timeout more than once to change the timeout from one RPC to the next. The current timeout setting in the binding handle controls the timeout absolutely for UDP RPCs.)
You can use our ping_string_binding sample program to investigate timeouts (available for download from the DCE Downloads link "DCE / DFS Debugging Tools"). This program takes a string binding as its command-line argument, and attempts to send a DCE "ping" RPC, rpc_mgmt_is_server_listening, to the server specified in the binding. The program uses rpc_mgmt_set_com_timeout to set a timeout of 3, but you can adjust that easily if you want to play.
For example, using the value of 3, let's try to ping the DCE endpoint mapper on three machines: one whose endpoint mapper is up, one whose EP mapper is down, and one machine that doesn't even exist. The endpoint mapper process, dced, always listens on port 135, so we'll construct string bindings aimed at port 135 on the three machines:
% ping_string_binding 'ncadg_ip_udp:9.38.205.15[135]'
Mon Jun 5 10:43:48 EDT 2000
Trying to ping 'ncadg_ip_udp:9.38.205.15[135]'...
Server responded to ping in 68.2 milliseconds
...
% date; ping_string_binding 'ncadg_ip_udp:9.38.205.16[135]' ; date
Mon Jun 5 10:44:02 EDT 2000
Trying to ping 'ncadg_ip_udp:9.38.205.16[135]'...
Error: rpc_mgmt_is_server_listening returned 382312470
Communications failure (dce / rpc)
Done.
Mon Jun 5 10:44:16 EDT 2000
% ping 9.38.205.16
9.38.205.16 is alive
% date; ping_string_binding 'ncadg_ip_udp:9.38.205.99[135]' ; date
Mon Jun 5 10:44:33 EDT 2000
Trying to ping 'ncadg_ip_udp:9.38.205.99[135]'...
Error: rpc_mgmt_is_server_listening returned 382312470
Communications failure (dce / rpc)
Done.
Mon Jun 5 10:44:46 EDT 2000
% ping 9.38.205.99
no answer from 9.38.205.99
%
In the above example, you first see a successful ping to the dced that's up. Then we have unsuccessful RPCs to the other two machines; both take close to our predicted value of 11 seconds to time out. There's some extra slop in the timings -- the UNIX date command isn't the most precise way to time a single function call -- but you can modify the program to get a more accurate time if you want to. Note that the UNIX ping command verifies that one of the unsuccessful DCE RPCs was directed to a machine that was up, and the other was directed to a machine that was not accessible, because I made up the IP address -- we have no machine here with an address of 9.38.205.99.
Controlling TCP RPC Timeouts (if you insist...)
TCP RPC timeouts are primarily governed by OS parameters. If you set the binding handle timeout to 10, then you do get an infinite timeout with TCP, but that's the only case in which the binding handle timeout matters with TCP RPCs. We specifically discuss Solaris and AIX parameters in this note; presumably other operating systems will have similar mechanisms.
When the OS tries to establish a TCP connection, it sends a message (specifically, a SYN packet) to the TCP module on the target machine, asking for a connection to a particular port. If there is no server listening on the specified port, then the remote TCP module responds that it can't make a connection, and the client process knows right away that it's doomed. This is why a TCP RPC that's aimed at a dead server on a live machine will fail almost instantly. For example, using the live machine from the above example:
% date ; ping_string_binding 'ncacn_ip_tcp:9.38.205.16[135]' ; date
Mon Jun 5 11:18:31 EDT 2000
Trying to ping 'ncacn_ip_tcp:9.38.205.16[135]'...
Error: rpc_mgmt_is_server_listening returned 382312514
Connection request rejected (dce / rpc)
Done.
Mon Jun 5 11:18:33 EDT 2000
%
...and you can see that the DCE RPC failure was not delayed. Since the remote TCP module sends an immediate failure indicator (in the form of a RESET packet), the RPC facility knows that it's hopeless, so failure is almost immediate. (Astute observers will note that the same sort of thing can happen with UDP, sort of -- the remote machine might send an ICMP "destination unreachable" packet if a UDP RPC is aimed at a non-existent server on an available machine. DCE's UDP RPC module does not look for these ICMP messages though, so these UDP RPCs will not time out instantly -- rather, the DCE client will keep sending UDP packets to the server machine until the timeout period elapses .)
Anyway, back to TCP... suppose now that the target machine itself is inaccessible. In this case, TCP can wait a very long time before giving up:
% date ; ping_string_binding 'ncacn_ip_tcp:9.38.205.99[135]' ; date
Mon Jun 5 11:23:56 EDT 2000
Trying to ping 'ncacn_ip_tcp:9.38.205.99[135]'...
Error: rpc_mgmt_is_server_listening returned 382312513
Connection request timed out (dce / rpc)
Done.
Mon Jun 5 11:25:11 EDT 2000
The above was done on an AIX machine; on Solaris we have this:
% date ; ping_string_binding 'ncacn_ip_tcp:9.38.205.99[135]' ; date
Mon Jun 5 11:24:40 EDT 2000
Trying to ping 'ncacn_ip_tcp:9.38.205.99[135]'...
Error: rpc_mgmt_is_server_listening returned 382312513
Connection request timed out (dce / rpc)
Done.
Mon Jun 5 11:27:50 EDT 2000
So we waited 1 minute 15 seconds on AIX, and 3 minutes 10 seconds on Solaris. How's come? Read on...
...TCP Timeouts on AIX
On AIX, there's a parameter named tcp_keepinit, settable via the AIX no command. (The command name really is "no", meaning "network options", I suppose.) The default is 75 seconds; the settable units are half-seconds, so the default setting of tcp_keepinit is actually 150:
# /usr/sbin/no -o tcp_keepinit
tcp_keepinit = 150
The default 75-second timeout is what we saw in the AIX example above.
...TCP Timeouts on Solaris
On Solaris, there's a parameter named tcp_ip_abort_cinterval, settable via the Solaris ndd command. The default is 3 minutes; the settable units are milliseconds, so the default setting of tcp_ip_abort_cinterval is actually 180000:
# ndd /dev/tcp tcp_ip_abort_cinterval
180000
Now, this should lead to a timeout of 3 minutes; but we saw 3 minutes and 10 seconds in the example above... The discrepancy occurs on Solaris because of the way that the timeout interacts with the SYN packets sent by TCP. A SYN packet is sent in order to initiate a TCP connection to a remote machine. If no response is received to the first SYN packet, then another is sent 3 seconds later. If still no response, then another SYN is sent 6 seconds later, then 12 seconds after that, then again 24 seconds after that, and so on -- the delay time starts at 3 seconds and keeps doubling until tcp_ip_abort_cinterval has been exceeded. So, the SYN packets are sent at time 0, 3, 9, 21, 45, 93, 189, and so on. In our case, after 189 seconds (3 minutes 9 seconds), we had exceeded the tcp_ip_abort_cinterval timeout of 180 seconds, so TCP gave up. In contrast, we saw above that AIX doesn't care about the SYN packet timing -- it just cuts you off when tcp_keepinit says to cut you off. But on Solaris, the tcp_ip_abort_cinterv
al timeout is checked only after each SYN packet times out.
The information above describes timeout characteristics that apply when a DCE client attempts to start a new conversation with a server. If on the other hand you're using TCP RPCs and a server process or machine goes down in the midst of an existing conversation, or if the server suddenly becomes inaccessible because of some network problem, then you will see different timeout characteristics. For example, on Solaris, the tcp_ip_abort_interval parameter comes into play, and that guy has a default setting of 480000, which means that an RPC timeout for an existing connection can take 8 minutes. The timing on AIX seems to be similar, maybe a touch longer, but I don't know if it's adjustable there. (Stevens' "TCP/IP Illustrated Volume 1" book indicates that 9 minutes is a standard value for this timeout, so perhaps AIX uses that value and does not allow it to be adjusted?)
Summary
Our advice: use UDP, and consider use of rpc_mgmt_set_com_timeout if you need fine control.
The timeout control that you get with TCP RPCs is pretty awful. It's just so much simpler with UDP... the only disadvantage is that in the case where the target machine is up but the target server process is down, the TCP timeout will be much quicker than the default UDP timeout. Still, the controllability and predictability of UDP timeouts usually far outweigh this one TCP advantage in our opinion. The only way to get comparable control with TCP RPCs would be to implement it yourself in your application client code, using a separate watchdog thread or setting up an alarm signal to monitor any threads that are making RPCs and to time out the RPC thread(s) via thread cancels if necessary.
Was this topic helpful?
Document Information
Modified date:
23 August 2018
UID
swg21112112