Troubleshooting Workload Management (WLM) problems in WebSphere Application Server

Troubleshooting

Problem

Troubleshooting for Workload Management problems in IBM® WebSphere® Application Server. This should help address common issues with this component before calling IBM support and save you time.

Resolving The Problem

Tab navigation

First steps to determine the problem

Workload management problems can occur at various stages depending on your configuration. The following questions are an update of the questions that are discussed in Redbook 4308 (section 1.3 on page 5). For details, please check document:

Websphere Application Server V6.1: Workload Management Problem Determination

Some additional hints are also available in Redbook:
Approach to Problem Determination in WebSphere Application Server v6

Both Redbooks are still valid for Websphere Application Server version 7.0, 8.0, 8.5 and 9.0.

The following questions should help you to determine the component in question:

1. Are you using a clustered environment?
If it is clustered. How does the workload behave currently and how do you expect the workload to behave?

2. On which exact component do you notice that the Workload does not behave as expected?

Does the problem occur in the WebServer Plugin or in the Application Servers? Make sure that the Web server plug-in is working correctly and spreading work across the application servers as intended. If this works fine continue the process by verifying that requests for EJBs are being distributed among the application servers in the cluster as you intended.

If it is not working fine, then please check the document:
MustGather: WebSphere Application Server HTTP plug-in problems

3. If you determine there is no problem either the WLM or the steps mentioned above, check if the problem is with the High Availability Manager.
MustGather: High Availability (HA) and the High Availability Manager (HAM)

Still having problems with WLM? Please check the following:

1. The cell names must be unique, if any kind of communication happens between different cells, such as remote EJB calls

2. High Availability (HA) Manager service must be enabled on all servers

HMGR0005I: The Single Server DCS Core Stack transport has been started for core group

3. Is the High Availability (HA) view stable?

Look for DCSV8050I message tag in the SystemOut logs. Each appserver, nodeagent and dmgr will maintain a view of what members are available and joined to the HA view. The message will look like:

[5/8/15 3:53:54:154 EDT] 00000040 CoreGroupMemb I DCSV8050I: DCS Stack DefaultCoreGroup at Member APAR_DEV\tss5l731n1_0_dmp\dmgr: New view installed, identifier (5:0.APAR_DEV\tss5l731n1_0_asp2\nodeagent), view size is 4 (AV=4, CD=4, CN=4, DF=11)

The 4 fields are:

AV - number of members currently in the view
CD - on member denied list but want to connect
CN - members supposed to be in view and are connected
DF - total number of members defined in coregroup.xml

To be considered a stable HA view the first 3 values should be equal. This value also represents the number of members available in the core group.

4. What dependency does WLM have to HA Manager?

5. Is the workload not balanced correctly?

WLM does not manage load, despite the name. WLM balances requests in the form or method calls/invocations. If the requests drive varying load on the servers, then you may correctly see that the load, as measured by CPU consumption for example is not balanced. The "pattern problem" occurs when you have an even number of members, and an even number of method calls such as "create" and "invoke". For example with 2 members the pattern could be that all the lightweight create requests execute on one server, and the heavyweight invoke requests end up on the other server. In that case the "load" on the servers (measured in CPU utilization) is not equal among the servers.

A workaround to this problem is adjusting the weights of the cluster to non-equal values. Typically recommended for normalization are cluster weights of 19 and 23.

If you observing uneven load distribution among an even number of cluster members, it is recommended to switch to an odd number of members. Note, an odd number of cluster members can also experience pattern problems based on the number of method calls.

If you notice that EJB requests are going to one and the same server all the time, please check the "Prefer Local" setting. for details, please check document:
Everything you always wanted to know about WebSphere Application Server, but were afraid to ask

6. For additional troubleshoot information, please check the document:
Troubleshooting WLM issues in WebSphere Application Server

Troubleshooting WLM related Exceptions and Symptoms
Please check the following technote to learn more about the different kinds of WLM exceptions
Troubleshooting: Workload Management Problems

What to do next?

If, after going through this process, you still have an undiagnosed problem, we recommend you to open a ticket with IBM Support and to collect the data mentioned in the "Collecting data" section.

Related Information

Troubleshooting WLM issues in WAS

WLM component troubleshooting tips

[{"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Workload Management (WLM)","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.0;8.5.5;8.5;8.0;7.0","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Tips

Troubleshooting Workload Management (WLM) problems in WebSphere Application Server