Technical Blog Post
Abstract
AIX: WHO IS LOCKING MY FILE?
Body
File locking is an essential concept for insuring data integrity. It is quite common for programs to 'lock' files to make sure that what they read is accurate or to prevent anyone from reading or writing the file that they are modifying.
Because of that it might happen that your program fails to acquire a lock on a file. While some programs will handle this silently by retrying to obtain the lock some might report an error, hang waiting to acquire the lock or simply terminate.
Most users know about the 'fuser' command to check who is using what file. While this works fine in some cases, the command is very slow and sometimes not suitable on heavy systems or in cases where the lock is held for a time too short for 'fuser' to identify the owner.
This small article will help you find out who is holding a lock on the file you are trying to access. We will use a few different scenarios with files both on a regular and on a NFS mounted file system.
A few words about locking itself before starting. There might be various ways to lock a file but most commonly 'fcntl()' is used to perform file locking because it conveniently works with files that reside either on a regular or NFS mounted file system. Also 'fcntl()' is compatible with most Unix flavors which is a valuable point when it comes to porting applications to various platforms. As well 'fcntl()' allows you to have 'read' or 'write' locks and hold a lock on a 'part' of the file or on the whole file based on your requirements.
So let's go and visit a few scenarios on AIX...
== Scenario 1 ==
Your program is trying to lock a file that resides on a local file system
but that file is locked by another program that also uses 'fcntl()' to
handle locking. The other program holds the lock long enough for you to
notice the error from your program and run some diagnostics.
In that case we might use a simple C program that also uses 'fcntl()'
to identify who owns the lock blocking yours. The program 'chklock7.c'
below can be compiled using 'cc -q64 chklock7.c -o chklock7'.
/*----------------------------------------------------------------------------
*
* chklock7.c: Check who owns a lock on a given file.
*
* dalla
*
*--------------------------------------------------------------------------*/
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <sys/types.h>
#include <string.h>
#include <strings.h>
#include <time.h>
#include <sys/stat.h>
#include <sys/statvfs.h>
int
main(int ac, char **av)
{
char *fname;
int fd;
int oflags;
struct statvfs st;
struct flock lk;
/*
* Check arguments.
*/
if (ac < 2) {
printf("[error] Usage: %s <file to check>\n", av[0]);
exit(1);
}
fname = av[1];
/*
* Print pid for convenience.
*/
printf("[info] pid %d running\n", getpid());
/*
* First check if file is NFS or not.
*/
(void) memset((void *) &st, 0, sizeof(struct stat));
if (statvfs(fname, &st) < 0) {
printf("[error] statvfs(%s), [errno %d]\n", fname, errno);
exit(1);
}
printf("[info] file resides on filesystem of type '%s'\n", st.f_basetype);
/*
* Open the file.
*/
oflags = O_RDWR|O_LARGEFILE;
if ((fd = open(fname, oflags)) < 0) {
printf("[error] open(%s), [errno %d]\n", fname, errno);
exit(1);
} else {
printf("[info] open(%s) = %d\n", fname, fd);
}
/*
* Check if file is locked and by who if it is.
*/
(void) memset((void *) &lk, 0, sizeof(struct flock));
lk.l_start = 0;
lk.l_whence = SEEK_SET;
lk.l_len = 0;
lk.l_type = F_WRLCK;
if (fcntl(fd, F_GETLK64, &lk) < 0) {
printf("[error] fcntl(F_GETLK64), [errno %d]\n", errno);
close(fd);
exit(1);
}
if (lk.l_pid) {
printf("[info] lock on '%s' held by pid %d system %d\n",
fname, lk.l_pid, lk.l_sysid);
} else {
printf("[info] no lock held on '%s'\n", fname);
}
/*
* Close the file.
*/
close(fd);
exit(0);
}
Now let's say you are trying to lock '/home/dalla/tmp/getname' but someone
else already has a lock on it. To check who is holding it run:
# chklock7 /home/dalla/tmp/getname
[info] pid 13566462 running
[info] file resides on filesystem of type 'jfs2'
[info] open(/home/dalla/tmp/getname) = 3
[info] lock on '/home/dalla/tmp/getname' held by pid 37028212 system 0
Now we have the pid of the lock owner. The 'system id' being '0' indicates
that the owner is running and locking from the current system. So we can use
a simple 'ps' command to identify it:
# ps -edf | grep 37028212 | grep -v grep
dalla 37028212 18219270 0 01:56:15 pts/3 0:00 chklock6 /home/dalla/tmp/getname
== Scenario 2 ==
Very similar to the first scenario except that this time the file you are
trying to lock is on a NFS file system. So when it comes to NFS locking it
is internally a bit more complex. The lock ultimately will be on the machine
where the real file system is. That is the machine where the file system
would have been exported from. All clients will 'forward' lock requests to
the NFS server that will grant the lock or not. So in that case using the
same program we will have to check the 'system id'. Remember though that
the 'system id' will be the 'system id' as the 'NFS server' sees it because
this is on the machine where the NFS server runs that we will grant the lock.
Let's imagine one machine 'machine1' where the real file system is.
The file system is exported and proper NFS related daemons are running.
We also have machine 'machine2' that is a NFS client that mounted the
file system exported from 'machine1'. This machine also has all NFS related
daemons running. We run 'chklock7' on the NFS server. Here is the first case:
machine1# chklock7 /data1/dalla/tmp/dbvars
[info] pid 16318648 running
[info] file resides on filesystem of type 'nfs'
[info] open(/data1/dalla/tmp/dbvars) = 3
[info] lock on '/data1/dalla/tmp/dbvars' held by pid 64487674 system 0
In this case we see 'system id' being '0'. So this means that the program
holding the lock is running on the local machine, that is 'machine1'.
In that case a local 'ps' will be enough to identify the process. Now below
is another situation:
machine1# chklock7 /data1/dalla/tmp/dbvars
[info] pid 7995422 running
[info] file resides on filesystem of type 'jfs2'
[info] open(/data1/dalla/tmp/dbvars) = 3
[info] lock on '/data1/dalla/tmp/dbvars' held by pid 31653986 system 16374
In this case we see that system id is '16374'. So... here we need to put
a name on that number. This has to be done as 'root' using the Kernel
debugger 'kdb':
machine1# echo kdump | kdb -script
read vscsi_scsi_ptrs OK, ptr = 0xF1000000C01E4E20
(0)> kdump
Executing kdump command
NFS KLM sysid list:
sysid prog vers InUse ip addr Name Ref SmC SmS
...
16377 100021 4 FALSE ........... paris 0001 TRUE FALSE
16376 100021 4 FALSE ........... jabba 0002 TRUE FALSE
16375 100021 1 FALSE ........... machine3 0001 TRUE FALSE
16374 - - - ........... machine2 0000 FALSE TRUE
...
So we find it is 'machine2'. In that case the 'ps' should be run on 'machine2'
in order to identify the process.
== Scenario 3 ==
This time our program still throws some errors that the file we want a lock on
is already locked but by the time we are ready to run 'chklock7' the process
that was holding the lock is gone. So now we have to find a way to catch the
process at the time it acquired the lock. To do that we will use probevue
and track the 'fcntl()' system call but only for locking requests.
The script is below. We are interested in 'fcntl()' calls for locks but since
'fcntl()' uses only a 'file descriptor' and not a file name we also have to
handle 'open()' calls to be able to match the file name we are interested in.
/*
* chklock7.pb: Track locking activity on a given file.
*
* Run as user 'root' using the following command line:
*
* probevue -t 10 -e 75 -s 64 -o chklock7.out chklock7.pb
*
* In the 'open()' entry probe replace the file name in 'strstr()'
* by the filename you want to track lock for.
*
*
* dalla
*/
int open(char *, int);
int kfcntl(int, int);
__thread int in_open;
__thread char *open_path;
__thread int open_mode;
__thread int in_kfcntl;
__thread int kfcntl_fd;
__thread int kfnctl_cmd;
/*
* Note that we check for the 'filename' only. We could check the
* full path but then any process that would open the same file using
* a relative path would not be caught.
*/
@@syscall:*:open:entry
{
__auto String fname[256];
fname = get_userstring((void *) __arg1, -1);
if (strstr(fname, "getname")) {
thread:in_open = 1;
thread:open_path = __arg1;
thread:open_mode = __arg2;
}
}
@@syscall:*:open:exit
when (thread:in_open == 1)
{
__auto String fname[256];
if (__rv >= 0) {
fname = get_userstring((void *) thread:open_path, -1);
printf("[%s - %ld - %ld] open(%s, 0x%08x) = %d\n",
__pname, __pid, __tid, fname, thread:open_mode, __rv);
}
thread:in_open = 0;
}
/*
* Here we are only interested in F_SETLK64 and F_SETLKW64 (fcntl.h)
*/
@@syscallx:*:kfcntl:entry
{
if ((__arg2 == 12) || (__arg2 == 13)) {
thread:in_kfcntl = 1;
thread:kfcntl_fd = __arg1;
thread:kfcntl_cmd = __arg2;
}
}
@@syscallx:*:kfcntl:exit
when (thread:in_kfcntl == 1)
{
printf("[%s - %ld - %ld] kfcntl(%d, %d) = %d [errno = %d]\n",
__pname, __pid, __tid, thread:kfcntl_fd, thread:kfcntl_cmd, __rv, __errno);
thread:in_kfcntl = 0;
}
As root we start the 'chklock7.pb' script and let it run until the problem
reproduces. Of course the script could be modified to apply additional filter
or print when we enter 'fcntl()' and when we exit 'fcntl()' both so that we
could catch a request to get a lock with the 'wait' flag set. But we are only
interested on locks that are held on the file and prevent our program to
obtain it's lock. The less we dump the more efficient it is...
# probevue -t 10 -e 75 -s 64 -o chklock7.out chklock7.pb
Once the problem has reproduced we interrupt the script and check for the
'chklock7.out' file that will contain the info. In our case our program
failed to obtain a lock on '/home/dalla/tmp/getname'.
[db2sysc - 27066866 - 57737587] kfcntl(6, 13) = 0 [errno = 0]
[db2sysc - 27066866 - 60031439] kfcntl(6, 12) = 0 [errno = 0]
[db2sysc - 27066866 - 60031439] kfcntl(6, 12) = 0 [errno = 0]
[chklock6 - 13631800 - 75825413] open(/home/dalla/tmp/getname, 0x04000002) = 3
[db2sysc - 27066866 - 60031439] kfcntl(6, 12) = 0 [errno = 0]
[db2sysc - 27066866 - 60031439] kfcntl(6, 12) = 0 [errno = 0]
[db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
[db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
[db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
[db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
[db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
[db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
[chklock6 - 13631800 - 75825413] kfcntl(3, 13) = 0 [errno = 0]
As we can see the only one that matches is 'chklock6' that opens the file we
want and gets the lock (last line). Once again, the second argument to
'fcntl()' in the output, here 12 or 13, are the values matching the
F_SETLK64 and F_SETLKW64 flags in the 'fcntl.h' header file.
You can modify both 'chklock7.c' and/or 'chklock7.pb' and be able to identify
any process that might conflict with yours for obtaining a lock on a file.
UID
ibm13286371