IT28193: MFT monitors stop working after an agent reconnects to its agent queue manager

APAR status

Closed as program error.

Error description

An IBM MQ V9.0.0.3 Managed File Transfer agent is connecting to
its queue manager using the CLIENT transport, and has a number
of resource monitors associated with it that poll directories
looking for files that have names which match a specific
pattern. When the monitors find a file that matches the pattern,
they submit managed transfer requests to transfer that file.

After running for a period of time, the agent loses connectivity
to its agent queue manager and goes into recovery. When the
agent has reconnected to the agent queue manager, the resource
monitors associated with the agent fail to submit any managed
transfer requests to the agent, even though files that match the
specified pattern are put in the directories that the resource
monitors are polling.

Local fix

The agent needs to be stopped and restarted after it has lost
connectivity to its agent queue manager.

Problem summary

****************************************************************
USERS AFFECTED:
This issue affects users of IBM MQ Managed File Transfer (MFT)
who have agents that:

- Connect to their agent queue manager using the CLIENT
transport.
- And have been configured to use resource monitors.


Platforms affected:
MultiPlatform

****************************************************************
PROBLEM DESCRIPTION:
If an agent had connected to its agent queue manager using the
CLIENT transport, and then lost connectivity to the queue
manager for some reason (such as a network outage), all of the
internal threads within the agent that were communicating with
the queue manager at that time would start an internal
"TriggerRecoveryThread". Each "TriggerRecoveryThread" would:

- Stop any managed transfers that are currently in progress.
- Stop any resource monitors within the agent.

In addition to this, the first "TriggerRecoveryThread" that was
started would create another internal thread, called the
"RecoveryThread". This thread would periodically try to
reconnect to the agent queue manager. Once the "RecoveryThread"
had successfully reconnected, it would restart all of the
monitors associated with the agent, and try to resume all of the
managed transfers that were stopped when the agent was
disconnected from the agent queue manager.

The internal locking model within the agent meant that only one
thread within an agent could start or stop resource monitors at
any one time. As a result, if an agent lost connectivity to its
agent queue manager for a very short period of time, then the
following sequence of events could occur:

- All of the internal threads within the agent that were
connected to the agent queue manager created a
"TriggerRecoveryThread".
- "TriggerRecoveryThread-1" started. After stopping all of the
managed transfers that were in progress, the thread obtained an
internal lock and began stopping all of the resource monitors.
- The other "TriggerRecoveryThreads" ("TriggerRecoveryThread-2",
"TriggerRecoveryThread-3", and so on....) started up. There were
no managed transfers to stop (as these had already been stopped
by "TriggerRecoveryThread-1"), the threads then tried to stop
the resource monitors running within the agent. Because
"TriggerRecoveryThread-1" had taken the internal lock, all of
these threads became blocked.
- "TriggerRecoveryThread-1" finished stopping all of the
resource monitors, and released the internal lock.
- "TriggerRecoveryThread-1" then started a "RecoveryThread",
before stopping

- "TriggerRecoveryThread-2" now got the internal lock, and tried
to stop the resource monitors.

- - While this processing was taking place, the "RecoveryThread"
reconnected to the agent queue manager. It then became blocked
waiting for internal lock held by "TriggerRecoveryThread-2",
which it needed in order to restart the resource monitors.

- As the monitors had already been stopped by
"TriggerRecoveryThread-1", there was nothing for
"TriggerRecoveryThread-2" to do, and so it released the internal
lock.

- The "RecoveryThread" got the internal lock, and restarted all
of the resource monitors. After all of the monitors had been
restarted, the thread released the internal lock and started
resuming any managed transfers that were stopped by
"TriggerRecoveryThread-1".

- Next, "TriggerRecoveryThread-3" got the internal lock, and
stopped the resource monitors. It then released the lock before
stopped.
- All of the remaining "TriggerRecoveryThreads" performed the
same behaviour, and stopped the resource monitors, before they
stopped.

After all of this processing had taken place, the agent had
reconnected to the agent queue manager. However, all of the
resource monitors associated with it were in a STOPPED state,
which meant that they were no longer performing any polls and
submitting managed transfer requests to the agent.

Problem conclusion

To resolve this issue, IBM MQ Managed File Transfer agents have
been updated so that only the first internal thread which
detects that the agent queue manager is no longer available will
create an "TriggerRecoveryThread". The "TriggerRecoveryThread"
will start the "RecoveryThread" before exiting. This ensures
that once the "RecoveryThread" has restarted the resource
monitors associated with an agent, they will remain in a STARTED
state and will only be stopped by another
"TriggerRecoveryThread" if the agent loses connectivity to its
agent queue manager again.

---------------------------------------------------------------
The fix is targeted for delivery in the following PTFs:

Version    Maintenance Level
v9.0 LTS   9.0.0.7
v9.1 CD    9.1.3
v9.1 LTS   9.1.0.3

The latest available MQ maintenance can be obtained from
'WebSphere MQ Recommended Fixes'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037

If the maintenance level is not yet available information on
its planned availability can be found in 'WebSphere MQ
Planned Maintenance Release Dates'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
---------------------------------------------------------------

Temporary fix

Comments

APAR Information

APAR number
IT28193
Reported component name
IBM MQ BASE M/P
Reported component ID
5724H7261
Reported release
903
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2019-02-21
Closed date
2019-04-23
Last modified date
2019-06-17

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
IBM MQ MFT V9.0
Fixed component ID
5724H7262

Applicable component levels

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"903","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
17 June 2019

Tips

IT28193: MFT monitors stop working after an agent reconnects to its agent queue manager

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?