Troubleshooting
Problem
This document helps you collect and share data required to diagnose Performance issues with IBM Order Management Software Certified Containers with IBM Support.
Diagnosing The Problem
Collecting the Data
1. Review MustGather for Environment Data and attach all requested diagnostics to the case.
2. Additionally depending on nature of your issue, provide the diagnostics requested below -
- Application server is frozen or unresponsive
- Agent or Integration server is frozen or unresponsive
- Application server crash or shut down
- Agent or Integration server crash or shut down
- Agent process is slow or messages not getting consumed
- Database Slowness or Locking
Application server is frozen or unresponsive
- If the Application Server is frozen or unresponsive, provide 3-4 sets of Thread dumps collected at intervals of 20 to 30 secs.
-
Identify the pod where slowness is observedExec into the pod by using -
oc exec -it name-ibm-oms-ent-prod-appserver-om-app-5bb8bb74c6-p5tww /bin/bash
In the pod, run ps aux command to identify the process IDSample Outputbash-4.4$ ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND default 1 5.7 28.0 7458356 2233580 ? SLsl Sep03 173:25 /opt/ibm/java/jre/bin/java -javaagent:/opt/ibm/wlp/bin/tools/ws-javaagent.jar -Djava.awt.headless=true -Djdk.attach.allowAttachSelf=true -Dvendor=websphere - default 858 0.0 0.0 12024 2548 pts/0 Ss+ Sep04 0:00 /bin/bash default 1624 0.0 0.0 12024 3184 pts/1 Ss+ 14:41 0:00 /bin/bash default 1631 0.5 0.0 12024 3300 pts/2 Ss 14:44 0:00 /bin/bash default 1637 0.0 0.0 44632 3372 pts/2 R+ 14:44 0:00 ps aux
- Now run kill -3 <pid>- The javacore.date.time.id.txt will be generated under the log directory folder (specified in values.yaml file).
-
- Capture Statistics data when the issue is ongoing. Provide an hour's worth of data from regular business hours or time of the issue.
- Provide 2 set of logs -
- Apply VERBOSE tracing on the identified Application or API
- Apply SQLDEBUG tracing on the identified Application or API
- Reference - Video that illustrates how to put components on trace
- Note: VERBOSE trace is not suitable for the production environment. If you are only able to reproduce the issue on production, then consider single user test with
UserTracing
enabled instead.
- For slowness or OOM issues, please share any generated verbose garbage collection logs and heap dumps.
- If the issue is identified to be due to Database Locking, upload the details requested here.
- Restart the server to mitigate the issue.
Note:
- If you are able to replicate the issue intermittently, set the below properties in values.yaml under appserver.jvm and upgrade the helm deployment. Post deployment, a healthcenter<ts>.hcd file will be created under the folder specified in the argument -Dcom.ibm.java.diagnostics.healthcenter.headless.output.directory (of the application server pod) after every 15 minutes (that is, the value set for -Dcom.ibm.java.diagnostics.healthcenter.headless.run.duration).
params:
- -Xhealthcenter:level:headless
- -Dcom.ibm.java.diagnostics.healthcenter.headless.output.directory=/opt/ibm/wlp/output/defaultServer
- -Dcom.ibm.java.diagnostics.healthcenter.headless.run.duration=15
- -Dcom.ibm.diagnostics.healthcenter.data.profiling=off
- -Dcom.ibm.java.diagnostics.healthcenter.allocation.threshold.low=10000000
- -Dcom.ibm.java.diagnostics.healthcenter.stack.trace.depth=20
- -Dcom.ibm.java.diagnostics.healthcenter.headless.files.to.keep=0
- If you are able to replicate slowness or OOM issues intermittently,
- Enable verbose garbage collection and upload logs to the case. Steps to enable verboseGC logs are documented here.
- Enable heap dumps on user events and upload the generated heap dumps to case. Steps to enable heap dumps are documented here. Please refer the IBM Documentation link which has more details regarding the same.
Agent or Integration server is frozen or unresponsive
- If the Agent or Integration Server threads are hung or blocking locks exist, then provide 3-4 sets of Thread dumps collected at intervals of 20 to 30 secs.
-
Identify the pod where slowness is observedExec into the pod by using -
oc exec -it name-ibm-oms-ent-prod-scheduleorder-784c8689df-w67gj /bin/bash
In the pod, run ps aux command to identify the process IDSample Outputbash-4.4$ ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND omsuser 1 0.0 0.0 12024 1772 ? Ss 03:17 0:00 /bin/sh /opt/ssfs/runtime/bin/agentserver.sh -jvmargs -Xms512m -Xmx1024m -Xms512m -Xmx1024m -Xhealthcenter:level:headless -Dcom.ibm.java.diagnostics.healthce omsuser 118 0.0 0.0 12032 1668 ? S 03:17 0:00 /bin/sh /opt/ssfs/runtime/bin/java_wrapper.sh -Xms512m -Xmx1024m -Djava.io.tmpdir=/opt/ssfs/runtime/tmp -Dfile.encoding=UTF-8 -Dnet.sf.ehcache.skipUpdateChec omsuser 138 0.4 6.5 5899152 522580 ? SLl 03:17 4:20 /opt/ssfs/runtime/jdk/bin/java -d64 -Xms512m -Xmx1024m -Djava.io.tmpdir=/opt/ssfs/runtime/tmp -Dfile.encoding=UTF-8 -Dnet.sf.ehcache.skipUpdateCheck=true -DI
- Now run kill -3 <pid> (kill -3 138 in my above example)- The file javacore.date.time.id.txt will be generated under the log directory folder (specified in values.yaml file).
-
- Capture the Statistics data when the issue is on-going. Provide an hour's worth of data from regular business hours or time of the issue.
- Provide 2 set of logs -
- Apply VERBOSE tracing on the identified Agent or Integration Server
- Apply SQLDEBUG tracing on the identified Agent or Integration Server
- For additional details refer to the video that illustrates how to put components on trace
- Note: VERBOSE trace is not suitable for the production environment. If you are only able to reproduce the issue on production, then consider single user test with
UserTracing
enabled instead.
- For slowness or OOM issues, please share any generated verbose garbage collection logs and heap dumps.
- If the issue is identified to be due to Database Locking, upload the details requested here.
- Restart the servers to mitigate the issue.
Note:
- If you are able to replicate the issue intermittently, set the below properties in the values.yaml under
omserver.common.jvmArgs
oromserver.servers.jvmArgs a
nd upgrade the helm deployment. Post deployment, a healthcenter<ts>.hcd file will be created under the folder specified in the argument -Dcom.ibm.java.diagnostics.healthcenter.headless.output.directory(of the agent or integration server pod) after every 15 minutes(i.e. the value set for -Dcom.ibm.java.diagnostics.healthcenter.headless.run.duration)
jvmArgs: "-Xms512m -Xmx1024m -Xhealthcenter:level:headless -Dcom.ibm.java.diagnostics.healthcenter.headless.output.directory=/shared/agents/$(OM_POD_NAME)/hcd
-Dcom.ibm.java.diagnostics.healthcenter.headless.run.duration=15 -Dcom.ibm.diagnostics.healthcenter.data.profiling=off -Dcom.ibm.java.diagnostics.healthcenter.allocation.threshold.low=10000000 -Dcom.ibm.java.diagnostics.healthcenter.stack.trace.depth=20 -Dcom.ibm.java.diagnostics.healthcenter.headless.files.to.keep=0"
- If you are able to replicate slowness or OOM issues intermittently,
- Enable verbose garbage collection and upload the logs to the case. Steps to enable verboseGC logs are documented here.
- Enable heap dumps on user events and upload the generated heap dumps to the case. Steps to enable heap dumps are documented here.Please refer the IBM Documentation link which has more details regarding the same.
Application server crash or shut down
Out of Memory (OOM) and in turn server crash could be due to Java heap space or StackOverFlow.
- If OOM occurs due to Java heap space, please share the generated verboseGC logs and memory heap dumps.
- If OOM occurs due to StackOverflow, please share the server error logs and thread dumps.
- Also, upload the Application Server VERBOSE logs. For additional details refer to the video that illustrates how to put components on trace
- Note: VERBOSE trace is not suitable for the production environment. If you are only able to reproduce the issue on production, then consider single user test with
UserTracing
enabled instead.
Agent or Integration server crash or shut down
Out of Memory (OOM) and in turn server crash could be due to Java heap space or StackOverFlow.
- If OOM occurs due to Java heap space, please share the generated verboseGC logs and memory heap dumps.
- If OOM occurs due to StackOverflow, please share the server error logs and thread dumps.
- Also, upload the Application Server VERBOSE logs. For additional details refer to the video that illustrates how to put components on trace
- Note: VERBOSE trace is not suitable for the production environment. If you are only able to reproduce the issue on production, then consider single user test with
UserTracing
enabled instead.
Agent process is slow or messages not getting consumed
- Apply SQLDEBUG tracing on the related Agent or Integration server for 5 minutes and upload the logs
- AWR Report (Oracle) or output of db2collect.sh (DB2). Details are specified here.
- Verbose Garbage Collection logs
Database Slowness or Locking
1. Capture Statistics data when the issue is ongoing. Provide an hour's worth of data from regular business hours or time of the issue.
2. Follow the database specific steps as applicable:
For DB2:
Run the oms_db2collect_v2.sh script to gather DB2 configuration and performance data. The collection is lightweight and provides a high-level overview of the database performance.
- Copy oms_db2collect_v2.sh script to the DB2 server. Provide appropriate read, write and execute permissions.
- Run the script - Usage - ./oms_db2collect.sh <dbname>
- Script will generate db2collect.<timestamp>.zip in the folder where it is run
For Oracle:
- AWR Report
- If the issue is intermittent and there are significant blocking locks, perform the following
- Set yfs.yfs.app.identifyconnection=Y in customer_overrides.properties file
- Use the SQL provided in Blocking_Lock_SQLs.txt 3-4 times every 1 min apart at the time of blocking lock. This SQL will provide complete details on all JVMs contributing to blocking locks.
References :
OMS Statistics Data
Below are sample SQLs that you can update accordingly.
DB2
select statistics_detail_key, start_time_stamp, end_time_stamp,
server_name, server_id, hostname, service_name, service_type, context_name, statistic_name, statistic_value from yfs_statistics_detail where statistics_detail_key like '2021082514%'
Oracle
select statistics_detail_key, to_char(start_time_stamp, 'YYYY-MM-DD HH24:mi:SS'), to_char(end_time_stamp, 'YYYY-MM-DD HH24:mi:SS'),
server_name, server_id, hostname, service_name, service_type, context_name, statistic_name, statistic_value
from yfs_statistics_detail
where statistics_detail_key like '2021082514%'
order by statistics_detail_key
Generating verboseGC logs
Application Server
1. Add the following JVM arguments under appserver.jvm.params of the values.yaml file
- -verbose:gc
- -Xverbosegclog:/shared/logs/$(OM_POD_NAME)/gclogs/%Y%m%d.%H%M%S.%pid.txt
- -Xgcpolicy:gencon
2. Update the deployment
verboseGC logs will be generated in the path specified in the Xverbosegclog argument.
Agent Server
1. Add the following jvm arguments under
omserver.common.jvmArgs
or omserver.servers.jvmArgs
of the values.yaml file
jvmArgs: "-Xms512m -Xmx1024m -verbose:gc -Xverbosegclog:/shared/agents/$(OM_POD_NAME)/gclogs/%Y%m%d.%H%M%S.%pid.txt --Xgcpolicy:gencon"
2. Update the deployment
verboseGC logs will be generated in the path specified in the -Xverbosegclog argument.
Enabling Heap Dump generation on User Events
Application Server
1. Add the following jvm arguments under appserver.jvm.params of the values.yaml file
- -Xdump:heap+java:events=user
- -XX:HeapDumpPath=$(LOG_DIR)
2. Update the deployment
Heap Dumps will now be generated in the path specified in the -XX:HeapDumpPath argument.
Agent Server
1. Add the following jvm arguments under
omserver.common.jvmArgs
or omserver.servers.jvmArgs
of the values.yaml file
jvmArgs: "-Xdump:heap+java:events=user -XX:HeapDumpPath=/shared/agents/$(OM_POD_NAME)"
2. Update the deployment
verboseGC logs will be generated in the path specified in the Xverbosegclog argument.
How to submit diagnostic data to IBM Support |
After you have collected the preceding information, and the case is opened, please see: For more details see: |
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB59","label":"Sustainability Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS6PEW","label":"Sterling Order Management"},"ARM Category":[{"code":"a8m0z000000cy0AAAQ","label":"Install and Deploy"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"10.0.0"}]
Was this topic helpful?
Document Information
Modified date:
01 October 2021
UID
ibm16486387