Technical Blog Post
Abstract
AIX: TRACK LOAD/UNLOAD OF DYNAMIC LIBRARIES / OBJECTS
Body
Being able to track the list and frequency of dynamic libraries or objects being loaded and unloaded might be handy sometimes.
Loading a library is somewhat 'time consuming' because the dynamic loader will have, at one point or another, to resolve symbols used in the newly loaded object. Also care must be taken when unloading a library that nothing else still references that library. Failure to do so would most often lead to a crash with SIGSEGV or SIGILL.
It is not uncommon for database engines or other applications to make use of third parties products or UDFs (User Defined Functions). One of the most common ways of doing so is using dynamic libraries or objects. While the end user does not always have full control over the load/unload sequences for the dynamic objects, being able to see what is happening might be useful and help resolve issues.
This small article shows how to track the dynamic libraries or objects activity on AIX. While 'truss' might do the job, the way it is designed makes it way too intrusive for a production system. So instead we'll use 'probevue' to obtain the same result but much faster. We will use 2 examples. The first one to find out why a program is running slowly when calling multiple times a function defined in a dynamic library. The second example will help us find out why a program is crashing. Of course those examples here are simple programs but the approach for the case where things would be opaque like using a UDF or third party package would be the same.
== Example 1 ==
Here we have a simple program calling a function from a dynamic library
50 times. That function is simply multiplying an input value by 2 and returning
the result. The execution time shows this:
[/home3/dalla/tmp]->time dlopen1
[info] func(0) = 0
[info] func(1) = 2
[info] func(2) = 4
[info] func(3) = 6
...
[info] func(47) = 94
[info] func(48) = 96
[info] func(49) = 98
real 0m0.72s
user 0m0.01s
sys 0m0.08s
We notice that it is slow and also it seems to consume a lot of system time.
We know that 'func()' is actually from a dynamic library. So we use the following
probevue script to see if anything is wrong with the loading of the dynamic
library. It might be taking too long somehow to load. The script looks like
this and we run it as root:
/*
* dlopen1.pb: Track libraries/modules loaded/unloaded by a program.
* The program to trace (dlopen1 by default here) can be changed
* by modifying the filters 'when (__pname == "dlopen1")'.
*
* Run as user 'root' using the following command line:
*
* probevue -t 10 -e 75 -s 64 -o dlopen1.out dlopen1.pb
*
*
* dalla
*/
unsigned long long kload(char *);
unsigned long long kunload64(unsigned long long);
__thread void *libname;
__thread probev_timestamp_t ts1;
@@syscallx:*:kload:entry
when (__pname == "dlopen1")
{
thread:libname = __arg1;
thread:ts1 = timestamp();
thread:in_kload = 1;
}
@@syscallx:*:kload:exit
when (thread:in_kload == 1)
{
__auto String pathname[256];
__auto probev_timestamp_t ts2;
ts2 = timestamp();
pathname = get_userstring(thread:libname, 255);
printf("[%s - %d] kload(%s) = 0x%016llx [elapsed = %lld ms]\n",
__pname, __pid, pathname, (unsigned long long) __rv,
diff_time(thread:ts1, ts2, MICROSECONDS));
thread:in_kload = 0;
}
@@syscallx:*:kunload64:entry
when (__pname == "dlopen1")
{
printf("[%s - %d] kunload64(0x%016llx)\n",
__pname, __pid, (unsigned long long) __arg1);
}
So we run this as root before starting the 'dlopen1' test program:
# probevue -t 10 -e 75 -s 64 -o dlopen1.out dlopen1.pb
Once the test program has run we interrupt the probevue script and look
at the output file, here 'dlopen1.out':
[dlopen1 - 47644876] kload(mylib.so) = 0x08fffffff0000758 [elapsed = 38301 ms]
[dlopen1 - 47644876] kunload64(0x08fffffff0000758)
[dlopen1 - 47644876] kload(mylib.so) = 0x08fffffff0000808 [elapsed = 17203 ms]
[dlopen1 - 47644876] kunload64(0x08fffffff0000808)
[dlopen1 - 47644876] kload(mylib.so) = 0x08fffffff00008b8 [elapsed = 20170 ms]
[dlopen1 - 47644876] kunload64(0x08fffffff00008b8)
...
We see as many occurrences of 'load/unload' sequence than we have calls
to function 'func()'. So in this case the next step would be to check if there
is a parameter (or changing the program) so that once loaded we leave the
library loaded.
== Example 2 ==
In this case things are a bit different. Our test program still calls 'func()'
from the dynamic library. It works fine the first 10 times and then something
causes the program to crash:
[/home3/dalla/tmp]->dlopen2
calling func() with value 0
[info] func(0) = 0
calling func() with value 1
[info] func(1) = 2
...
calling func() with value 9
[info] func(9) = 18
calling func() with value 10
Illegal instruction(coredump)
So the program received a SIGILL when trying to call 'func()' even though
it seemed to work fine the first 10 times. This could happen because the
function pointer to 'func()' has been corrupted, or, because the library
has been unloaded. In that case as well the small probevue script would help
confirm if the library has been unloaded or not. The output file would show
the unload:
[dlopen2 - 1048862] kload(mylib.so) = 0x08fffffff0000758 [elapsed = 1254 ms]
[dlopen2 - 1048862] kunload64(0x08fffffff0000758)
If you are running AIX 7.1 TL04 or AIX 7.2 or above you can make the script
a bit more precise as probevue has a 'timestamp()' function as well. So you
can modify the probe for the 'unload' to print the time at which the 'unload'
took place:
@@syscallx:*:kunload64:entry
when (__pname == "dlopen2")
{
printf("[%s - %d] kunload64(0x%016llx)[%A]\n",
__pname, __pid, (unsigned long long) __arg1, timestamp());
}
The result would then show like this:
[dlopen2 - 30736748] kunload64(0x08fffffff0000750)[Jun/07/17 22:59:25]
This makes it easier to match times.
UID
ibm13286341