IBM Support

Troubleshooting Workload Management (WLM) problems in WebSphere Application Server

Troubleshooting


Problem

Troubleshooting for Workload Management problems in IBM® WebSphere® Application Server. This should help address common issues with this component before calling IBM support and save you time.

Resolving The Problem

Tab navigation


First steps to determine the problem

Workload management problems can occur at various stages depending on your configuration. The following questions are an update of the questions that are discussed in Redbook 4308 (section 1.3 on page 5). For details, please check document:

Websphere Application Server V6.1: Workload Management Problem Determination

Some additional hints are also available in Redbook:
Approach to Problem Determination in WebSphere Application Server v6

Both Redbooks are still valid for Websphere Application Server version 7.0, 8.0, 8.5 and 9.0.

The following questions should help you to determine the component in question:

1. Are you using a clustered environment?
If it is clustered. How does the workload behave currently and how do you expect the workload to behave?

2. On which exact component do you notice that the Workload does not behave as expected?

Does the problem occur in the WebServer Plugin or in the Application Servers? Make sure that the Web server plug-in is working correctly and spreading work across the application servers as intended. If this works fine continue the process by verifying that requests for EJBs are being distributed among the application servers in the cluster as you intended.

If it is not working fine, then please check the document:
MustGather: WebSphere Application Server HTTP plug-in problems

3. If you determine there is no problem either the WLM or the steps mentioned above, check if the problem is with the High Availability Manager.
MustGather: High Availability (HA) and the High Availability Manager (HAM)


Still having problems with WLM? Please check the following:

1. The cell names must be unique, if any kind of communication happens between different cells, such as remote EJB calls

2. High Availability (HA) Manager service must be enabled on all servers


    2.1) Look in SystemOut for the HMGR0005I message on every server. If you see that a "Single Server DCS Core Stack" has been started, it means that HAM has been disabled.
    HMGR0005I: The Single Server DCS Core Stack transport has been started for core group
    2.2) Additionally, for environments v7 and up, look in SystemOut for:

    HMGR0010I: The High Availability Manager has been disabled.

    2.3) If SystemOut has not been provided and full config is available, look at the 'enable' field in hamanagerservice.xml for each server. Usually found in '..../profiles/nodename/config/cells/cellname/nodes/nodename/servers/servername'

    Note: HMGR0011I message will still be seen even if HA manager has been disabled on that jvm.

    HMGR0011I: The High Availability Manager is configured to be the bulletin board provider.


3. Is the High Availability (HA) view stable?

Look for DCSV8050I message tag in the SystemOut logs. Each appserver, nodeagent and dmgr will maintain a view of what members are available and joined to the HA view. The message will look like:



[5/8/15 3:53:54:154 EDT] 00000040 CoreGroupMemb I DCSV8050I: DCS Stack DefaultCoreGroup at Member APAR_DEV\tss5l731n1_0_dmp\dmgr: New view installed, identifier (5:0.APAR_DEV\tss5l731n1_0_asp2\nodeagent), view size is 4 (AV=4, CD=4, CN=4, DF=11)

The 4 fields are:

AV - number of members currently in the view
CD - on member denied list but want to connect
CN - members supposed to be in view and are connected
DF - total number of members defined in coregroup.xml

To be considered a stable HA view the first 3 values should be equal. This value also represents the number of members available in the core group.

4. What dependency does WLM have to HA Manager?
    4.1) WLM depends on the HAManager component’s BulletinBoard function to propagate cluster data changes and updates for the coregroup.

    4.2) The HA Manager environment must be functioning correctly. For example if the HA views are not stable, it may prevent the propagation of cluster information to the NodeAgents and App Servers.

    4.3) If there are multiple core groups, they must be bridged so that EJB routing will work properly. Many customers "get away with" not bridging but they may eventually hit a situation where they will fail.

5. Is the workload not balanced correctly?

WLM does not manage load, despite the name. WLM balances requests in the form or method calls/invocations. If the requests drive varying load on the servers, then you may correctly see that the load, as measured by CPU consumption for example is not balanced. The "pattern problem" occurs when you have an even number of members, and an even number of method calls such as "create" and "invoke". For example with 2 members the pattern could be that all the lightweight create requests execute on one server, and the heavyweight invoke requests end up on the other server. In that case the "load" on the servers (measured in CPU utilization) is not equal among the servers.

A workaround to this problem is adjusting the weights of the cluster to non-equal values. Typically recommended for normalization are cluster weights of 19 and 23.

If you observing uneven load distribution among an even number of cluster members, it is recommended to switch to an odd number of members. Note, an odd number of cluster members can also experience pattern problems based on the number of method calls.

If you notice that EJB requests are going to one and the same server all the time, please check the "Prefer Local" setting. for details, please check document:
Everything you always wanted to know about WebSphere Application Server, but were afraid to ask


6. For additional troubleshoot information, please check the document:
Troubleshooting WLM issues in WebSphere Application Server


Troubleshooting WLM related Exceptions and Symptoms
Please check the following technote to learn more about the different kinds of WLM exceptions
Troubleshooting: Workload Management Problems

What to do next?

If, after going through this process, you still have an undiagnosed problem, we recommend you to open a ticket with IBM Support and to collect the data mentioned in the "Collecting data" section.

[{"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Workload Management (WLM)","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.0;8.5.5;8.5;8.0;7.0","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
15 June 2018

UID

swg21993688