IBM Support

IT28193: MFT monitors stop working after an agent reconnects to its agent queue manager

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • An IBM MQ V9.0.0.3 Managed File Transfer agent is connecting to
    its queue manager using the CLIENT transport, and has a number
    of resource monitors associated with it that poll directories
    looking for files that have names which match a specific
    pattern. When the monitors find a file that matches the pattern,
    they submit managed transfer requests to transfer that file.
    
    After running for a period of time, the agent loses connectivity
    to its agent queue manager and goes into recovery. When the
    agent has reconnected to the agent queue manager, the resource
    monitors associated with the agent fail to submit any managed
    transfer requests to the agent, even though files that match the
    specified pattern are put in the directories that the resource
    monitors are polling.
    

Local fix

  • The agent needs to be stopped and restarted after it has lost
    connectivity to its agent queue manager.
    

Problem summary

  • ****************************************************************
    USERS AFFECTED:
    This issue affects users of IBM MQ Managed File Transfer (MFT)
    who have agents that:
    
    - Connect to their agent queue manager using the CLIENT
    transport.
    - And have been configured to use resource monitors.
    
    
    Platforms affected:
    MultiPlatform
    
    ****************************************************************
    PROBLEM DESCRIPTION:
    If an agent had connected to its agent queue manager using the
    CLIENT transport, and then lost connectivity to the queue
    manager for some reason (such as a network outage), all of the
    internal threads within the agent that were communicating with
    the queue manager at that time would start an internal
    "TriggerRecoveryThread". Each "TriggerRecoveryThread" would:
    
    - Stop any managed transfers that are currently in progress.
    - Stop any resource monitors within the agent.
    
    In addition to this, the first "TriggerRecoveryThread" that was
    started would create another internal thread, called the
    "RecoveryThread". This thread would periodically try to
    reconnect to the agent queue manager. Once the "RecoveryThread"
    had successfully reconnected, it would restart all of the
    monitors associated with the agent, and try to resume all of the
    managed transfers that were stopped when the agent was
    disconnected from the agent queue manager.
    
    The internal locking model within the agent meant that only one
    thread within an agent could start or stop resource monitors at
    any one time. As a result, if an agent lost connectivity to its
    agent queue manager for a very short period of time, then the
    following sequence of events could occur:
    
    - All of the internal threads within the agent that were
    connected to the agent queue manager created a
    "TriggerRecoveryThread".
    - "TriggerRecoveryThread-1" started. After stopping all of the
    managed transfers that were in progress, the thread obtained an
    internal lock and began stopping all of the resource monitors.
    - The other "TriggerRecoveryThreads" ("TriggerRecoveryThread-2",
    "TriggerRecoveryThread-3", and so on....) started up. There were
    no managed transfers to stop (as these had already been stopped
    by "TriggerRecoveryThread-1"), the threads then tried to stop
    the resource monitors running within the agent. Because
    "TriggerRecoveryThread-1" had taken the internal lock, all of
    these threads became blocked.
    - "TriggerRecoveryThread-1" finished stopping all of the
    resource monitors, and released the internal lock.
    - "TriggerRecoveryThread-1" then started a "RecoveryThread",
    before stopping
    
    - "TriggerRecoveryThread-2" now got the internal lock, and tried
    to stop the resource monitors.
    
    - - While this processing was taking place, the "RecoveryThread"
    reconnected to the agent queue manager. It then became blocked
    waiting for internal lock held by "TriggerRecoveryThread-2",
    which it needed in order to restart the resource monitors.
    
    - As the monitors had already been stopped by
    "TriggerRecoveryThread-1", there was nothing for
    "TriggerRecoveryThread-2" to do, and so it released the internal
    lock.
    
    - The "RecoveryThread" got the internal lock, and restarted all
    of the resource monitors. After all of the monitors had been
    restarted, the thread released the internal lock and started
    resuming any managed transfers that were stopped by
    "TriggerRecoveryThread-1".
    
    - Next, "TriggerRecoveryThread-3" got the internal lock, and
    stopped the resource monitors. It then released the lock before
    stopped.
    - All of the remaining "TriggerRecoveryThreads" performed the
    same behaviour, and stopped the resource monitors, before they
    stopped.
    
    After all of this processing had taken place, the agent had
    reconnected to the agent queue manager. However, all of the
    resource monitors associated with it were in a STOPPED state,
    which meant that they were no longer performing any polls and
    submitting managed transfer requests to the agent.
    

Problem conclusion

  • To resolve this issue, IBM MQ Managed File Transfer agents have
    been updated so that only the first internal thread which
    detects that the agent queue manager is no longer available will
    create an "TriggerRecoveryThread". The "TriggerRecoveryThread"
    will start the "RecoveryThread" before exiting. This ensures
    that once the "RecoveryThread" has restarted the resource
    monitors associated with an agent, they will remain in a STARTED
    state and will only be stopped by another
    "TriggerRecoveryThread" if the agent loses connectivity to its
    agent queue manager again.
    
    ---------------------------------------------------------------
    The fix is targeted for delivery in the following PTFs:
    
    Version    Maintenance Level
    v9.0 LTS   9.0.0.7
    v9.1 CD    9.1.3
    v9.1 LTS   9.1.0.3
    
    The latest available MQ maintenance can be obtained from
    'WebSphere MQ Recommended Fixes'
    http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037
    
    If the maintenance level is not yet available information on
    its planned availability can be found in 'WebSphere MQ
    Planned Maintenance Release Dates'
    http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
    ---------------------------------------------------------------
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT28193

  • Reported component name

    IBM MQ BASE M/P

  • Reported component ID

    5724H7261

  • Reported release

    903

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2019-02-21

  • Closed date

    2019-04-23

  • Last modified date

    2019-06-17

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    IBM MQ MFT V9.0

  • Fixed component ID

    5724H7262

Applicable component levels

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"903","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
17 June 2019