APAR status
Closed as program error.
Error description
Problem Description: A TEMS becomes unresponsive and agents go off-line. Restarting the TEMS resolves the problem for a while but the symptom eventually reoccurs. Problem Summary: When the symptom occurs, a netstat report reveals that the recv-Q buffer of multiple connections to the TEMS is increasing. This is an indication that the KDE (socket layer) component of the TEMS is not processing inbound traffic. Such stalls can occur when the KDE layer accepts a connection request but subsequent attempts to receive data from the endpoint fail. If the TLS handshake is involved, KDE will wait for a response for 2 minutes before timing out. If data retrieval fails while KDE is exchanging connection details with the endpoint, KDE will wait for 3 minutes before timing out the request. If multiple connection requests timeout, KDE can exit one timeout only to enter another. In this event, KDE appears stalled because while waiting for a timeout to expire, it is unable to service other sockets that are attempting to communicate with the TEMS. This is why the recv-Q buffer of existing connections is seen to increase. Solution: The time that KDE allows before a connection request is abandoned is arguably too long, especially as it can impact the ability of a TEMS to service requests on existing connections which can ultimately lead to instability of the entire ITM environment. Therefore, 3 new environment variables have been introduced that allow the timeouts to be controlled. KDEB_MAX_SOCKET_RECEIVE_WAITS affects how many attempts KDE will make to receive data from an endpoint; each time an attempt fails, it waits for 5 seconds before retrying. The default number of retries is 36 meaning that KDE will wait for 3 minutes before the receive is abandoned. The limits of this new variable are : 1 <= KDEB_MAX_SOCKET_RECEIVE_WAITS <= 36. KDEBE_GSK_READ_TIMEOUT affects how long KDE waits for a GSKit read to complete and, therefore, affects connections that use TLS, (ip.spipe). The default timeout is 120 seconds. The limits of this new variable are : 1 <= KDEBE_GSK_READ_TIMEOUT<= 120. KDEBE_GSK_WRITE_TIMEOUT affects how long KDE waits for a GSKit write to complete and, therefore, affects connections that use TLS, (ip.spipe). The default timeout is 30 seconds. The limits of this new variable are : 1 <= KDEBE_GSK_WRITE_TIMEOUT <= 30. Warning : KDE is used by every ITM component, including agents, tacmd, TEPS, RTEMS, Hubs etc. If incorrectly configured, the overall stability of an ITM environment can be adversely affected. These variables should only be used after IBM has had an opportunity to review logs and has confirmed that the associated timeouts are the cause of a malfunction.
Local fix
NA
Problem summary
A TEMS becomes unresponsive and agents go off-line. Restarting the TEMS resolves the problem for a while but the symptom eventually reoccurs. When the symptom occurs, a netstat report reveals that the recv-Q buffer of multiple connections to the TEMS is increasing. This is an indication that the KDE (socket layer) component of the TEMS is not processing inbound traffic. Such stalls can occur when the KDE layer accepts a connection request but subsequent attempts to receive data from the endpoint fail. If the TLS handshake is involved, KDE will wait for a response for 2 minutes before timing out. If data retrieval fails while KDE is exchanging connection details with the endpoint, KDE will wait for 3 minutes before timing out the request. If multiple connection requests timeout, KDE can exit one timeout only to enter another. In this event, KDE appears stalled because while waiting for a timeout to expire, it is unable to service other sockets that are attempting to communicate with the TEMS. This is why the recv-Q buffer of existing connections is seen to increase.
Problem conclusion
The time that KDE allows before a connection request is abandoned is arguably too long, especially as it can impact the ability of a TEMS to service requests on existing connections which can ultimately lead to instability of the entire ITM environment. Therefore, 3 new environment variables have been introduced that allow the timeouts to be controlled. KDEB_MAX_SOCKET_RECEIVE_WAITS affects how many attempts KDE will make to receive data from an endpoint; each time an attempt fails, it waits for 5 seconds before retrying. The default number of retries is 36 meaning that KDE will wait for 3 minutes before the receive is abandoned. The limits of this new variable are : 1 <= KDEB_MAX_SOCKET_RECEIVE_WAITS <= 36. KDEBE_GSK_READ_TIMEOUT affects how long KDE waits for a GSKit read to complete and, therefore, affects connections that use TLS, (ip.spipe). The default timeout is 120 seconds. The limits of this new variable are : 1 <= KDEBE_GSK_READ_TIMEOUT<= 120. KDEBE_GSK_WRITE_TIMEOUT affects how long KDE waits for a GSKit write to complete and, therefore, affects connections that use TLS, (ip.spipe). The default timeout is 30 seconds. The limits of this new variable are : 1 <= KDEBE_GSK_WRITE_TIMEOUT <= 30. _Warning_ : KDE is used by every ITM component, including agents, tacmd, TEPS, RTEMS, Hubs etc. If incorrectly configured, the overall stability of an ITM environment can be adversely affected. These variables should only be used after IBM has had an opportunity to review logs and has confirmed that the associated timeouts are the cause of a malfunction. The fix for this APAR is contained in the following maintenance packages: | service pack | 6.3.0.7-TIV-ITM-SP0014
Temporary fix
Comments
APAR Information
APAR number
IJ41619
Reported component name
TEMS
Reported component ID
5724C04MS
Reported release
630
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-08-17
Closed date
2023-04-17
Last modified date
2023-04-17
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
TEMS
Fixed component ID
5724C04MS
Applicable component levels
[{"Business Unit":{"code":"BU029","label":"Software"},"Product":{"code":"SSZ8F3","label":"IBM Tivoli Monitoring V6"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"630"}]
Document Information
Modified date:
17 April 2023