IBM Support

IJ41619: A TEMS BECOMES UNRESPONSIVE AND AGENTS GO OFF-LINE

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • Problem Description:
    A TEMS becomes unresponsive and agents go off-line.
    Restarting the TEMS resolves the problem for a while but the
    symptom eventually reoccurs.
    
    Problem Summary:
    When the symptom occurs, a netstat report reveals that the
    recv-Q buffer of multiple connections to the TEMS is increasing.
    This is an indication that the KDE (socket layer) component of
    the TEMS is not processing inbound traffic.
    
    Such stalls can occur when the KDE layer accepts a connection
    request but subsequent attempts to receive data from the
    endpoint fail.
    If the TLS handshake is involved, KDE will wait for a response
    for 2 minutes before timing out.
    If data retrieval fails while KDE is exchanging connection
    details with the endpoint, KDE will wait for 3 minutes before
    timing out the request.
    
    If multiple connection requests timeout, KDE can exit one
    timeout only to enter another.
    In this event, KDE appears stalled because while waiting for a
    timeout to expire, it is unable to service other sockets that
    are attempting to communicate with the TEMS.
    This is why the recv-Q buffer of existing connections is seen to
    increase.
    
    Solution:
    The time that KDE allows before a connection request is
    abandoned is arguably too long, especially as it can impact the
    ability of a TEMS to service requests on existing connections
    which can ultimately lead to instability of the entire ITM
    environment.
    Therefore, 3 new environment variables have been introduced that
    allow the timeouts to be controlled.
    
    KDEB_MAX_SOCKET_RECEIVE_WAITS affects how many attempts KDE will
    make to receive data from an endpoint; each time an attempt
    fails, it waits for 5 seconds before retrying.
    The default number of retries is 36 meaning that KDE will wait
    for 3 minutes before the receive is abandoned.
    The limits of this new variable are : 1 <=
    KDEB_MAX_SOCKET_RECEIVE_WAITS <= 36.
    
    KDEBE_GSK_READ_TIMEOUT affects how long KDE waits for a GSKit
    read to complete and, therefore, affects connections that use
    TLS, (ip.spipe).
    The default timeout is 120 seconds.
    The limits of this new variable are : 1 <=
    KDEBE_GSK_READ_TIMEOUT<= 120.
    
    KDEBE_GSK_WRITE_TIMEOUT affects how long KDE waits for a GSKit
    write to complete and, therefore, affects connections that use
    TLS, (ip.spipe).
    The default timeout is 30 seconds.
    The limits of this new variable are : 1 <=
    KDEBE_GSK_WRITE_TIMEOUT <= 30.
    
    Warning : KDE is used by every ITM component, including agents,
    tacmd, TEPS, RTEMS, Hubs etc.
    If incorrectly configured, the overall stability of an ITM
    environment can be adversely affected.
    These variables should only be used after IBM has had an
    opportunity to review logs and has confirmed that the associated
    timeouts are the cause of a malfunction.
    

Local fix

  • NA
    

Problem summary

  • A TEMS becomes unresponsive and agents go off-line.
        Restarting the TEMS resolves the problem for a while but the
    symptom eventually reoccurs.
    
    
    When the symptom occurs, a netstat report reveals that the
    recv-Q buffer of multiple connections to the TEMS is increasing.
        This is an indication that the KDE (socket layer) component
    of the TEMS is not processing inbound traffic.
    
        Such stalls can occur when the KDE layer accepts a
    connection request but subsequent attempts to receive data from
    the endpoint fail.
        If the TLS handshake is involved, KDE will wait for a
    response for 2 minutes before timing out.
        If data retrieval fails while KDE is exchanging connection
    details with the endpoint, KDE will wait for 3 minutes before
    timing out the request.
    
        If multiple connection requests timeout, KDE can exit one
    timeout only to enter another.
        In this event, KDE appears stalled because while waiting for
    a timeout to expire, it is unable to service other sockets that
    are attempting to communicate with the TEMS.
        This is why the recv-Q buffer of existing connections is
    seen to increase.
    

Problem conclusion

  • The time that KDE allows before a connection request is
    abandoned is arguably too long, especially as it can impact the
    ability of a TEMS to service requests on existing connections
    which can ultimately lead to instability of the entire ITM
    environment.
    
        Therefore, 3 new environment variables have been introduced
    that allow the timeouts to be controlled.
    
        KDEB_MAX_SOCKET_RECEIVE_WAITS affects how many attempts KDE
    will make to receive data from an endpoint; each time an attempt
    fails, it waits for 5 seconds before retrying.
        The default number of retries is 36 meaning that KDE will
    wait for 3 minutes before the receive is abandoned.
        The limits of this new variable are : 1 <=
    KDEB_MAX_SOCKET_RECEIVE_WAITS <= 36.
    
        KDEBE_GSK_READ_TIMEOUT affects how long KDE waits for a
    GSKit read to complete and, therefore, affects connections that
    use TLS, (ip.spipe).
        The default timeout is 120 seconds.
        The limits of this new variable are : 1 <=
    KDEBE_GSK_READ_TIMEOUT<= 120.
    
        KDEBE_GSK_WRITE_TIMEOUT affects how long KDE waits for a
    GSKit write to complete and, therefore,  affects connections
    that use TLS, (ip.spipe).
        The default timeout is 30 seconds.
        The limits of this new variable are : 1 <=
    KDEBE_GSK_WRITE_TIMEOUT <= 30.
    
        _Warning_ : KDE is used by every ITM component, including
    agents, tacmd, TEPS, RTEMS, Hubs etc.
        If incorrectly configured, the overall stability of an ITM
    environment can be adversely affected.
        These variables should only be used after IBM has had an
    opportunity to review logs and has confirmed that the associated
    timeouts are the cause of a malfunction.
    
    
    The fix for this APAR is contained in the following maintenance
    packages:
    
       | service pack | 6.3.0.7-TIV-ITM-SP0014
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ41619

  • Reported component name

    TEMS

  • Reported component ID

    5724C04MS

  • Reported release

    630

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-08-17

  • Closed date

    2023-04-17

  • Last modified date

    2023-04-17

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    TEMS

  • Fixed component ID

    5724C04MS

Applicable component levels

[{"Business Unit":{"code":"BU029","label":"Software"},"Product":{"code":"SSZ8F3","label":"IBM Tivoli Monitoring V6"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"630"}]

Document Information

Modified date:
17 April 2023