IT40017: IBM MQ channels might go unresponsive with high CPU usage in channel process if channel synchronization record is corrupted

APAR status

Closed as program error.

Error description

IBM MQ channels might go unresponsive with high CPU usage in
channel process amqrmppa or runmqchl if the channel
synchronization record is corrupted.  If the problem affects the
channels on the SENDER side(e.g. SDR or CLUSSDR) channels  then
no message will be sent by the channels. The "DIS CHS" output
likely to show no value in the SUBSTATE field.

AMQ8417I: Display Channel Status details.
  CHANNEL(CLUSCHL1)      CHLTYPE(CLUSSDR)
  ...
  RQMNAME(RQM1)            STATUS(RUNNING)
  SUBSTATE( )
  XMITQ(SYSTEM.CLUSTER.TRANSMIT.CLUSCHL1)

If the affected channels are on the receiver side (e.g. RCVR or
CLUSRCVR) then the channel process on the receiver side consumes
high CPU with the corresponding SDR or RCVR channel going into
retrying state.

The top output for the affect channel process shows high CPU
usage.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM
    TIME+ COMMAND
 133101 mqm       20   0  265412  13836  11584 R 106.2   0.4
2:08.97 runmqchl
 133101 mqm       20   0  265412  13836  11584 R   99.7   0.4
  2:11.97 runmqchl
 133101 mqm       20   0  265412  13836  11584 R   99.3   0.4
  2:14.97 runmqchl
 133101 mqm       20   0  265412  13836  11584 R   99.7   0.4
  2:17.97 runmqchl

IBM MQ trace shows the channel process repeatedly calling
rflSeekBytes and rflReadBytes
with the same pattern of comparison and file pointers.

Example:

 19:54:22.511503   133101.1      RSESS:000001
NewFilePointer=716(0x000002cc) <---------
 19:54:22.511528   133101.1      RSESS:000001
NewFilePointer=1072(0x00000430)
 19:54:22.511552   133101.1      RSESS:000001
NewFilePointer=1428(0x00000594)
 19:54:22.511577   133101.1      RSESS:000001
NewFilePointer=1784(0x000006f8)
 19:54:22.511603   133101.1      RSESS:000001
NewFilePointer=2140(0x0000085c)
 19:54:22.511628   133101.1      RSESS:000001
NewFilePointer=2496(0x000009c0)
 19:54:22.511653   133101.1      RSESS:000001
NewFilePointer=2852(0x00000b24)
 19:54:22.511678   133101.1      RSESS:000001
NewFilePointer=716(0x000002cc)  <---------
 19:54:22.511704   133101.1      RSESS:000001
NewFilePointer=1072(0x00000430)
 19:54:22.511728   133101.1      RSESS:000001
NewFilePointer=1428(0x00000594)
 19:54:22.511753   133101.1      RSESS:000001
NewFilePointer=1784(0x000006f8)
 19:54:22.511778   133101.1      RSESS:000001
NewFilePointer=2140(0x0000085c)
 19:54:22.511806   133101.1      RSESS:000001
NewFilePointer=2496(0x000009c0)
 19:54:22.511831   133101.1      RSESS:000001
NewFilePointer=2852(0x00000b24)

Local fix

Stop the queue manager
Backup the queue manager
Rename the sync file AMQRSYNA.DAT
Start the queue manager with -ns option (strmqm -ns  QM)
Recreate the channel sync file ( rcrmqobj -m QM -t syncfile )
stop the queue manager
start the queue manager

Problem summary

****************************************************************
USERS AFFECTED:
All users of IBM MQ distributed channels who have a corrupted
channel synchronization record in the channel sync file.
Corruption of this file is not an expected or typical usage
pattern, and has not been observed as a result of any known
product defect.

The channel sync file is used by all queue manager channel types
except SVRCONN/CLNTCONN and AMQP channels.


Platforms affected:
MultiPlatform

****************************************************************
PROBLEM DESCRIPTION:
The IBM MQ channel process was not detecting a corruption in the
channel synchronization record and this caused infinite loop,
resulting in the channel going into an unresponsive state.

Problem conclusion

The IBM MQ code has been modified prevent infinite loop if the
channel synchronization record is corrupted.

This APAR does not address the corruption in the channel
synchronization record itself, as the cause of the corruption at
the time this issue was observed remains unknown.

With the fix applied, if the queue manager detects an infinite
loop when finding a channel record in the channel
synchronization file, the queue manager generates the following
error message and the channel goes into retrying state.

------------------------
02/28/2022 10:37:07 PM - Process(181467.1) User(root)
Program(runmqchl)
 Host(host1.ibm.com) Installation(Installation1)
 VRMF(9.1.0.7) QMgr(qm1)
 Time(2022-03-01T06:37:07.434Z)
 ArithInsert1(1017)
 CommentInsert1(AMQRSYNA.DAT)

AMQ9516E: File error occurred for file 'AMQRSYNA.DAT'.

EXPLANATION:
The filesystem returned error code 1017 for file 'AMQRSYNA.DAT'.
ACTION:
Record the name of the file and tell the systems administrator,
who should
ensure that file is correct and available, for example that the
current user
has appropriate access to the file for reading or writing.
------------------------

The user needs to take appropriate action to resolve the issue
i.e. rebuild the syncfile using rcrmqobj in this case. To
rebuild the syncfile, check the Local Fix/Workaround section

The queue manager also generates the following failure data
capture (FDC) record.

AMQ184577.0.FDC 2022/03/01 17:37:07.740247-8 Installation1
runmqchl 184577 1 RM738001 rflFindRecord Unknown(3F9)

Probe Id :- RM738001
Application Name :- MQM
Component :- rflFindRecord
Program Name :- runmqchl
Arguments :- -c "CHL9 " -m "qm1
Major Errorcode :- Unknown(3F9)

---------------------------------------------------------------
The fix is targeted for delivery in the following PTFs:

Version    Maintenance Level
v9.0 LTS   9.0.0.16
v9.1 LTS   9.1.0.12
v9.2 LTS   9.2.0.7
v9.x CD    9.3.2

The latest available maintenance can be obtained from
'WebSphere MQ Recommended Fixes'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037

If the maintenance level is not yet available information on
its planned availability can be found in 'WebSphere MQ
Planned Maintenance Release Dates'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
---------------------------------------------------------------

Temporary fix

Comments

APAR Information

APAR number
IT40017
Reported component name
IBM MQ BASE MP
Reported component ID
5724H7271
Reported release
910
Status
CLOSED PER
PE
NoPE
HIPER
YesHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-02-19
Closed date
2022-10-11
Last modified date
2023-02-24

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

IT43171

Fix information

Fixed component name
IBM MQ BASE MP
Fixed component ID
5724H7271

Applicable component levels

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"910","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
24 February 2023

Tips

IT40017: IBM MQ channels might go unresponsive with high CPU usage in channel process if channel synchronization record is corrupted

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?