Checking for ENQ contention problems

When workload slows down and global resource serialization appears to be operating normally, the problem is often due to some part of the workload dominating ENQ resources in the sysplex. Because many workloads require exclusive access to resources (for example, to update a file), resource contention occurs between different parts of the workload when incompatible requests are made for resources. By itself, resource contention is not a sign of a problem. However, contention held for a long period of time among the same resources and requesters might be an indication of a problem.

Global resource serialization provides diagnostic commands to help determine the source of contention.

Command	Use
DISPLAY GRS,CONTENTION	Provides an alphabetized list of all visible¹ ENQ resources that are in contention. Each resource is reported with the owners and waiters of the resource. For SCOPE=SYSTEM resources, D GRS,C only reports on contention that occurred on the system where the command was issued. It does not report contention for SCOPE=SYSTEM resources on other systems in the complex.
DISPLAY GRS,ANALYZE,WAITER	Provides a list of requesters that have been waiting the longest for ENQ resources. Each waiter is reported with: The resource name and scope. The count of waiters and blockers of the resource. The top blocker of the resource. Each waiter is reported with its wait time, system resource that it was ENQed on, and the type of access (shared or exclusive) requested. The counts of waiters and blockers are explicitly returned only when the count is greater than one. .
DISPLAY GRS,ANALYZE,BLOCKER	Provides a list of the requesters that have been blocking ENQ resources for the longest time. Each blocker is reported with: Resource name and scope. The count of waiters and blockers of the resource. The block time, system the blocker ENQed from, jobname, and the type of access requested are also reported.
DISPLAY GRS,ANALYZE,DEPENDENCY	Provides resource allocation dependency analysis: Starting with each of the longest ENQ waiters, an analysis is performed, iteratively chaining from waiter to top blocker until either a request that is not waiting is found, or a resource allocation deadlock is detected. Starting with the top blockers of a specified resource, an analysis is performed, iteratively chaining from waiter to top blocker until either a request that is not waiting is found, or a resource allocation deadlock is detected.

To illustrate how to use contention analysis, an example is presented. In this example, the three-system sysplex is made up of systems PROD1, PROD2 and TEST.

Figure 1. Example of Three System Sysplex

In the scenario, four different work units will be affected:

The master scheduler address space (*MASTER*) on PROD1
A production job (PRODJOB), running on PROD1
A database clean up job (CLEANUP), running on PROD2
The system programmer's TSO session (SYSPROG), running on TEST

The production job on PROD1 is a multistep process that submits the cleanup job, which is to run after the completion of the production run. The cleanup job is kept from running by the exclusive data set ENQ, [SYSDSN, PROD.DB] owned by the production job.

The scenario begins with the production job running. It reaches the step where it submits the cleanup job. The cleanup job initiates but is blocked in allocation on the global ENQ for [SYSDSN, PROD.DB]. However, as part of allocation, it takes exclusive ownership of [SYSDSN, PROD.PROCS]. The current view of contention is displayed in the following figure. In the figure, units of work are represented by rectangles and resources are represented by ovals. The arrow and text from the unit of work to the resource represents the dependency.

Figure 2. Current View of Contention

While the production job is executing, a task in the master scheduler address space (*MASTER*) fails, but does not end, while holding the system command resource, [SYSIEFSD, Q10]. This resource is required by tasks that need to issue MVS™ system commands. Following this failure, the production job invokes the MGCRE macro to issue a system command. Because the command resource is permanently hung up by the *MASTER* task, ownership cannot be granted to PRODJOB. Contention for ENQ resources now looks like:

Figure 3. Contention for ENQ Resources

You discover that there is some sort of problem with the production database; the production job and the cleanup job seem to be hung up. Interactive requests for the database fail with an indication that the database is unavailable. You run an exec from the TSO session, which attempts to allocate both the production database, [SYSDSN, PROD.DB] and the production procedures library, [SYSDSN, PROD.PROCS]. This, of course, hangs the TSO session in an ENQ wait. The final state of contention is as follows:

Figure 4. Final State of Contention

Note that on a normal system, there will always be some level of "background" contention that is going on all the time. The above example ignores that level of contention and only displays the contention that is applicable.

To debug this problem, use the contention analysis features provided by global resource serialization. The first thing that you discover is that commands do not seem to work on PROD1, so any systems analysis would have to occur either on TEST or PROD2. You must determine if any resources are in contention. If DISPLAY GRS,C were issued on PROD2 or TEST, the result would be as follows:

ISG343I 15.05.24 GRS STATUS 539                                    
S=SYSTEMS SYSDSN   PROD.DB                                         
SYSNAME        JOBNAME         ASID     TCBADDR   EXC/SHR    STATUS
PROD1     PRODJOB            001A       007E7B68 EXCLUSIVE    OWN  
PROD2     CLEANUP            0029       007E7B68   SHARE      WAIT 
TEST      SYSPROG            0027       007E7B68   SHARE      WAIT 
S=SYSTEMS SYSDSN   PROD.PROCS                                      
SYSNAME        JOBNAME         ASID     TCBADDR   EXC/SHR    STATUS
PROD2     CLEANUP            0029       007E7B68 EXCLUSIVE    OWN

Looking at this output, it would appear that the problem is with PRODJOB; it is blocking both CLEANUP and SYSPROG from continuing. However, the DISPLAY GRS,C command does not return information about local resources on PROD1, as it was issued on PROD2 or TEST. If system commands were working on PROD1, DISPLAY GRS,C from PROD1 would provide a more complete picture:

IISG343I 15.02.58 GRS STATUS 981
S=SYSTEMS SYSDSN   PROD.DB
SYSNAME   JOBNAME     ASID     TCBADDR     EXC/SHR   STATUS
PROD1     PRODJOB     001A     007E7B68   EXCLUSIVE   OWN
PROD2     CLEANUP     0029     007E7B68     SHARE     WAIT
TEST      SYSPROG     0027     007E7B68     SHARE     WAIT
S=SYSTEMS SYSDSN   PROD.PROCS
SYSNAME   JOBNAME     ASID     TCBADDR     EXC/SHR   STATUS
PROD2     CLEANUP     0029     007E7B68   EXCLUSIVE   OWN
TEST      SYSPROG     0027     007E7B68   EXCLUSIVE   WAIT
S=SYSTEM  SYSIEFSD Q10
SYSNAME   JOBNAME     ASID     TCBADDR     EXC/SHR   STATUS
PROD1     *MASTER*    0001     007E7B68   EXCLUSIVE   OWN
PROD1     PRODJOB     001A     007E7B68   EXCLUSIVE   WAIT

You can see that the local resource [SYSIEFSD, Q10], held by *MASTER* is really holding up all of the workload.

In a fully loaded system, where there is a considerable amount of workload being processed concurrently, the opportunities for contention and the number of units of work involved in that contention can become much higher. It might be impossible to quickly analyze what resources and units of work are part of a ENQ lockout and which ones are not. When this occurs, using the DISPLAY GRS,ANALYZE command is much more useful. Additionally, the GRS analysis command options are truly sysplex-wide in scope. The analysis will include local resources on all systems in the sysplex.

The analysis provided by this command is based on the fact that most of the "benign" contention in the sysplex is short term. That is, if you issue the same command over a period of time, the contention that is most affecting the sysplex remains in the output of the command. The output from the command is ordered by the length of time that the contention has been in effect. In a serious resource lockout, where one requester dominates ownership of a resource for a long period of time, or a resource deadlock, where a set of requesters requires resources held by the others in the set such that no request can be granted, the contention will quickly rise to the top of the output.

Using the previous lockout scenario, the DISPLAY GRS,ANALYZE,BLOCKER command would return:

ISG349I 15.03.09 GRS ANALYSIS 984                          
LONG BLOCKER ANALYSIS:  ENTIRE SYSPLEX                     
BLOCKTIME SYSTEM  JOBNAME E/S SCOPE QNAME    RNAME   
00:01:33 PROD1    PRODJOB *E*  SYSS SYSDSN   PROD.DB       
                              OTHER BLOCKERS: 0  WAITERS: 2
00:00:57 PROD1    *MASTER**E*  SYS  SYSIEFSD Q10           
                              OTHER BLOCKERS: 0  WAITERS: 1
00:00:44 PROD2    CLEANUP *E*  SYSS SYSDSN   PROD.PROCS    
                              OTHER BLOCKERS: 0  WAITERS: 1

It is clear from this output that PRODJOB has been blocking other requesters for the longest time. The display does not tell the complete story. The view obtained from DISPLAY GRS,ANALYZE,WAITER command shows that PRODJOB is also a waiter:

ISG349I 15.03.31 GRS ANALYSIS 987                          
LONG WAITER ANALYSIS:  ENTIRE SYSPLEX                      
WAITTIME  SYSTEM  JOBNAME E/S SCOPE QNAME    RNAME         
00:01:33 PROD2    CLEANUP *S*  SYSS SYSDSN   PROD.DB       
BLOCKER  PROD1    PRODJOB  E  OTHER BLOCKERS: 0  WAITERS: 1 
00:00:57 PROD1    PRODJOB *E*  SYS  SYSIEFSD Q10           
BLOCKER  PROD1    *MASTER* E                               
00:00:44 PROD2    SYSPROG *S*  SYSS SYSDSN   PROD.DB       
BLOCKER  PROD1    PRODJOB  E  OTHER BLOCKERS: 0  WAITERS: 1
00:00:44 PROD2    SYSPROG *E*  SYSS SYSDSN   PROD.PROCS

Again, because this is a simple case, it is easy to see that, although PRODJOB has been blocking the longest, PRODJOB is itself blocked by *MASTER* for [SYSIEFSD, Q10].

What if the scenario is far more complicated? Then the DISPLAY GRS,ANALYZE,DEPENDENCY command is very useful in determining if a single or a small set of jobs is causing the lockout. The command can also detect a resource allocation deadlock. For this scenario, the output from the command would be:

ISG349I 15.03.54 GRS ANALYSIS 990                    
DEPENDENCY ANALYSIS:  ENTIRE SYSPLEX
----- LONG WAITER #1                                 
WAITTIME  SYSTEM  JOBNAME E/S SCOPE QNAME    RNAME   
00:01:33 PROD2    CLEANUP *S*  SYSS SYSDSN   PROD.DB 
BLOCKER  PROD1    PRODJOB  E                         
00:00:57 PROD1    PRODJOB *E*  SYS  SYSIEFSD Q10     
BLOCKER  PROD1    *MASTER* E                         
--:--:-- PROD1    *MASTER*                           
ANALYSIS ENDED: THIS UNIT OF WORK IS NOT WAITING
----- LONG WAITER #2                                 
WAITTIME  SYSTEM  JOBNAME E/S SCOPE QNAME    RNAME   
00:00:57 PROD1    PRODJOB *E*  SYS  SYSIEFSD Q10     
BLOCKER  PROD1    *MASTER* E                         
--:--:-- PROD1    *MASTER*                           
ANALYSIS ENDED: THIS UNIT OF WORK IS NOT WAITING
----- LONG WAITER #3                                           
WAITTIME  SYSTEM  JOBNAME E/S SCOPE QNAME    RNAME             
00:00:44 PROD2    SYSPROG *S*  SYSS SYSDSN   PROD.DB           
BLOCKER  PROD1    PRODJOB  E                                   
00:01:33 PROD1    PRODJOB *E*  SYS  SYSIEFSD Q10               
BLOCKER  PROD1    *MASTER* E                                   
--:--:-- PROD1    *MASTER*                                     
ANALYSIS ENDED: THIS UNIT OF WORK IS NOT WAITING               
----- LONG WAITER #4                                           
WAITTIME  SYSTEM  JOBNAME E/S SCOPE QNAME    RNAME             
00:00:44 PROD2    SYSPROG *E*  SYSS SYSDSN   PROD.PROCS        
BLOCKER  PROD2    CLEANUP  E

From this analysis, it is obvious that the problem is with *MASTER* and not PRODJOB. In fact, all of the dependency analyses end with *MASTER*. Although you cannot restart *MASTER*, the only way to clear up this lockout is to reIPL PROD1. In the case where the unit of work that all of the other requests depend on is a cancelable job or a subsystem that can be recycled, the operator can take appropriate action against that address space to resume normal system operations.

For complete information on the DISPLAY GRS,ANALYZE command, see z/OS MVS System Commands.