IJ41619: A TEMS BECOMES UNRESPONSIVE AND AGENTS GO OFF-LINE

APAR status

Closed as program error.

Error description

Problem Description:
A TEMS becomes unresponsive and agents go off-line.
Restarting the TEMS resolves the problem for a while but the
symptom eventually reoccurs.

Problem Summary:
When the symptom occurs, a netstat report reveals that the
recv-Q buffer of multiple connections to the TEMS is increasing.
This is an indication that the KDE (socket layer) component of
the TEMS is not processing inbound traffic.

Such stalls can occur when the KDE layer accepts a connection
request but subsequent attempts to receive data from the
endpoint fail.
If the TLS handshake is involved, KDE will wait for a response
for 2 minutes before timing out.
If data retrieval fails while KDE is exchanging connection
details with the endpoint, KDE will wait for 3 minutes before
timing out the request.

If multiple connection requests timeout, KDE can exit one
timeout only to enter another.
In this event, KDE appears stalled because while waiting for a
timeout to expire, it is unable to service other sockets that
are attempting to communicate with the TEMS.
This is why the recv-Q buffer of existing connections is seen to
increase.

Solution:
The time that KDE allows before a connection request is
abandoned is arguably too long, especially as it can impact the
ability of a TEMS to service requests on existing connections
which can ultimately lead to instability of the entire ITM
environment.
Therefore, 3 new environment variables have been introduced that
allow the timeouts to be controlled.

KDEB_MAX_SOCKET_RECEIVE_WAITS affects how many attempts KDE will
make to receive data from an endpoint; each time an attempt
fails, it waits for 5 seconds before retrying.
The default number of retries is 36 meaning that KDE will wait
for 3 minutes before the receive is abandoned.
The limits of this new variable are : 1 <=
KDEB_MAX_SOCKET_RECEIVE_WAITS <= 36.

KDEBE_GSK_READ_TIMEOUT affects how long KDE waits for a GSKit
read to complete and, therefore, affects connections that use
TLS, (ip.spipe).
The default timeout is 120 seconds.
The limits of this new variable are : 1 <=
KDEBE_GSK_READ_TIMEOUT<= 120.

KDEBE_GSK_WRITE_TIMEOUT affects how long KDE waits for a GSKit
write to complete and, therefore, affects connections that use
TLS, (ip.spipe).
The default timeout is 30 seconds.
The limits of this new variable are : 1 <=
KDEBE_GSK_WRITE_TIMEOUT <= 30.

Warning : KDE is used by every ITM component, including agents,
tacmd, TEPS, RTEMS, Hubs etc.
If incorrectly configured, the overall stability of an ITM
environment can be adversely affected.
These variables should only be used after IBM has had an
opportunity to review logs and has confirmed that the associated
timeouts are the cause of a malfunction.

Local fix

```
NA
```

Problem summary

A TEMS becomes unresponsive and agents go off-line.
    Restarting the TEMS resolves the problem for a while but the
symptom eventually reoccurs.


When the symptom occurs, a netstat report reveals that the
recv-Q buffer of multiple connections to the TEMS is increasing.
    This is an indication that the KDE (socket layer) component
of the TEMS is not processing inbound traffic.

    Such stalls can occur when the KDE layer accepts a
connection request but subsequent attempts to receive data from
the endpoint fail.
    If the TLS handshake is involved, KDE will wait for a
response for 2 minutes before timing out.
    If data retrieval fails while KDE is exchanging connection
details with the endpoint, KDE will wait for 3 minutes before
timing out the request.

    If multiple connection requests timeout, KDE can exit one
timeout only to enter another.
    In this event, KDE appears stalled because while waiting for
a timeout to expire, it is unable to service other sockets that
are attempting to communicate with the TEMS.
    This is why the recv-Q buffer of existing connections is
seen to increase.

Problem conclusion

The time that KDE allows before a connection request is
abandoned is arguably too long, especially as it can impact the
ability of a TEMS to service requests on existing connections
which can ultimately lead to instability of the entire ITM
environment.

    Therefore, 3 new environment variables have been introduced
that allow the timeouts to be controlled.

    KDEB_MAX_SOCKET_RECEIVE_WAITS affects how many attempts KDE
will make to receive data from an endpoint; each time an attempt
fails, it waits for 5 seconds before retrying.
    The default number of retries is 36 meaning that KDE will
wait for 3 minutes before the receive is abandoned.
    The limits of this new variable are : 1 <=
KDEB_MAX_SOCKET_RECEIVE_WAITS <= 36.

    KDEBE_GSK_READ_TIMEOUT affects how long KDE waits for a
GSKit read to complete and, therefore, affects connections that
use TLS, (ip.spipe).
    The default timeout is 120 seconds.
    The limits of this new variable are : 1 <=
KDEBE_GSK_READ_TIMEOUT<= 120.

    KDEBE_GSK_WRITE_TIMEOUT affects how long KDE waits for a
GSKit write to complete and, therefore,  affects connections
that use TLS, (ip.spipe).
    The default timeout is 30 seconds.
    The limits of this new variable are : 1 <=
KDEBE_GSK_WRITE_TIMEOUT <= 30.

    _Warning_ : KDE is used by every ITM component, including
agents, tacmd, TEPS, RTEMS, Hubs etc.
    If incorrectly configured, the overall stability of an ITM
environment can be adversely affected.
    These variables should only be used after IBM has had an
opportunity to review logs and has confirmed that the associated
timeouts are the cause of a malfunction.


The fix for this APAR is contained in the following maintenance
packages:

   | service pack | 6.3.0.7-TIV-ITM-SP0014

Temporary fix

Comments

APAR Information

APAR number
IJ41619
Reported component name
TEMS
Reported component ID
5724C04MS
Reported release
630
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-08-17
Closed date
2023-04-17
Last modified date
2023-04-17

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
TEMS
Fixed component ID
5724C04MS

Applicable component levels

[{"Business Unit":{"code":"BU029","label":"Software"},"Product":{"code":"SSZ8F3","label":"IBM Tivoli Monitoring V6"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"630"}]

Document Information

Modified date:
17 April 2023

Tips

IJ41619: A TEMS BECOMES UNRESPONSIVE AND AGENTS GO OFF-LINE

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?