IBM Support

IT41652: RDQM: Unexpected failover caused by excessive sensitivity to network delays

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • A replicated data queue manager (RDQM) instance fails over more
    than expected during periods of network instability.
    

Local fix

  • Note:
    * Before making these noted below two changes, end all queue
    managers running in the HA group.
    * After making these changes, restart all nodes.
    * Once all nodes have been restarted, start the queue manager
    up on its preferred primary node.
    
    Change #1
    In the /etc/corosync/corosync.conf file, under the totem
    stanza, add 'token: 3000' to increase the corosync timeout. For
    example:
    totem {
      version: 2
      crypto_cipher: none
      crypto_hash: none
      clear_node_high_bit: yes
      token: 3000
      interface {
    
    Change #2
    In the /etc/drbd.d/global_common.conf file, under the net
    stanza, add 'ping-timeout 40;' as shown below:
     net {
        ping-timeout 40;
        max-buffers 40k;
    
    
    IBM does not advise changing any other values in these
    configuration files, nor recommend the use of values other than
    those stated here.
    

Problem summary

  • ****************************************************************
    USERS AFFECTED:
    All RDQM users with network delays.
    
    
    Platforms affected:
    Linux on x86-64
    
    ****************************************************************
    PROBLEM DESCRIPTION:
    The default values of internal timeouts used by the corosync and
    drbd libraries underneath IBM MQ RDQM were found to be overly
    sensitive to network delays, and required adjustment.
    

Problem conclusion

  • The drbd ping-timeout and corosync timeout values have been
    incremented to make RDQM more tolerant to network and VM delays.
    
    ---------------------------------------------------------------
    The fix is targeted for delivery in the following PTFs:
    
    Version    Maintenance Level
    v9.2 LTS   9.2.0.10
    v9.3 LTS   9.3.0.5
    v9.x CD    9.3.3
    
    The latest available maintenance can be obtained from
    'WebSphere MQ Recommended Fixes'
    http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037
    
    If the maintenance level is not yet available information on
    its planned availability can be found in 'WebSphere MQ
    Planned Maintenance Release Dates'
    http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
    ---------------------------------------------------------------
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT41652

  • Reported component name

    MQ BASE V9.2

  • Reported component ID

    5724H7281

  • Reported release

    920

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-07-29

  • Closed date

    2023-02-22

  • Last modified date

    2023-02-22

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    MQ BASE V9.2

  • Fixed component ID

    5724H7281

Applicable component levels

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"920","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
23 February 2023