IBM Support

Core files filling important file systems? Want email alerts about each core dump?

How To


Summary

Normally, when a process fails the AIX kernel saves the content of memory to a "core" file so the cause of the failure can be investigated.
Oddly, I find many Linux systems set the core dump function switched to off. Perhaps, Linux core dumps happen so often that the system administrators can't investigate why so many the processes simply disappear!

Objective

Nigels Banner

Steps

I get asked these questions recently and had to go look up the subject again!  I forgot some of the details and then I thought I would use some new features of AIX for the second article.  In the distant past, there was various ways to stop core files being dumped in to the current working directory of the program that failed. In AIX 5.3, AIX 6 and 7, the "chcore" command does all the hard work for us:

  • Choose a specific directory for core files.  The best option is a separate file system, so important file systems don not get filled (options -p on and -l directory).
  • Get the AIX kernel to rename the core file to include the process ID and time stamp (option -n on).
  • Compress the core file.  The large core files can be large so it makes sense (option -c on).
  • Make these settings the default for the whole system (option -d).

Here is what I used as the root user:

Prepare a file system for the cores

# /usr/sbin/crfs -v jfs2 -a size=1G  -m /corefiles2 -A yes -p rw
# mount /corefiles
# chmod ugo+w /corefiles
# chown bin:bin /corefiles

Send cores to that directory with renaming and compression as default
# chcore -n on -p on -l /corefiles -c on -d

Check
# lscore
compression: on
path specification: on
corefile location: /corefiles
naming specification: on
#

One final point - you need to log in again for subsequent core files to get effected by these new settings.

For the second part of the question, we want to be quickly notified when a core is created. Normally, a core file is a catastrophic failure of an application, which can cause:

  • User complaints with annoyed users losing data.
  • Unexplained batch errors to log files. 

Rather than ignoring these symptoms, we need to determine where in the program it failed and why?  

In AIX 6 (from Technology 6 - I think) and AIX 7, we have this new monitoring subsystem called the AHA file system. AHA does all sorts of monitoring and alerting and we can use it to nearly instantaneously alert us on core files.  If you updated to an AIX level that supports AHA, then you need to install it from the AIX media.  Fresh AIX installs get AHA installed by default. Fortunately, there are examples of how to use the /aha files.  Check out the directory /usr/samples/ahafs and particularly the ones used in the next section are in /usr/samples/ahafs/bin.  Here we have a file called aha.pl, which is a Perl script, which can take command-line options from a file (which we use here).  I created a file called /etc/corefile with the following contents (the first three lines are comments that help get the layout right):

# Full-path filename of .mon file of the Event  CHANGED THRS_HI THRS_LO INF_LVL NTFY_CNT BUF_SZ  RE-ARM_INTVL
#                                                                                       (Bytes) (dd:hh:mm:ss)
#============================================== ======= ======= ======= ======= ======= ======= =============
/aha/fs/modDir.monFactory/corefiles.mon             YES    --      --       2       --      --   00:00:00:00


The first large filename string means monitor directory content for created, removed files and then specifically the directory /corefiles.

  • The CHANGED column = YES means monitor for directory changes. 
  • The INF_LVL = 2 it the information level of the output. Level 1 = does not include the filename involved and level 3 has a stack trace.  The stack trace is cool as it means you don't have to run the debugger to list the stack trace to find the failing code function and how it got there. 
  • The other parameters are defaults that work. 

I experimented with many options in the settings and found one that generated 500 emails a second, so be careful.


Next, prepare the /aha file, which tells the kernel about the new event to be monitored (as root):

# touch /aha/fs/modDir.monFactory/corefiles.mon

You get an error about not being able to set the file update time, which is normal as it is not a regular file but a device driver like you find in the /proc file system.  Start the Perl script to report core files arriving in the /corefiles directory with:

# /usr/samples/ahafs/bin/aha.pl -i /etc/corefile -e nag@blue.ibm.com

On the output I get the following at the startup time:

Attempting to open the AHAFS configuration file "corefile".
Monitoring the AHAFS event "/aha/fs/modDir.monFactory/corefiles.mon".

To test this alerting system, I just copied a file to /corefiles with: 

cp myfile /corefiles/testing

The Perl script outputs:

AHAFS event: /aha/fs/modDir.monFactory/corefiles.mon
---------------------------------------------------

BEGIN_EVENT_INFO
Time          : Fri May 31 16:25:34 2013
Sequence Num  : 1
Process ID    : 15925260
User Info     : userName=root, loginName=root, groupName=system
Program Name  : cp
RC_FROM_EVPROD=1000
BEGIN_EVPROD_INFO
testing
END_EVPROD_INFO
END_EVENT_INFO

Email is sent to nag@blue.ibm.com.
Then the email looks like this:

From root Fri May 31 16:25:34 2013
Date: Fri, 31 May 2013 16:25:34 +0100
From: root@bronze2.ibm.com
To: nag@blue.ibm.com
Subject: AHAFS event has occurred!

AHAFS event: /aha/fs/modDir.monFactory/corefiles.mon
---------------------------------------------------
BEGIN_EVENT_INFO
Time          : Fri May 31 16:25:34 2013
Sequence Num  : 1
Process ID    : 15925260
User Info     : userName=root, loginName=root, groupName=system
Program Name  : cp
RC_FROM_EVPROD=1000
BEGIN_EVPROD_INFO
testing
END_EVPROD_INFO
END_EVENT_INFO

Note: the "testing" in the output and email tells us about the new file including the name.

Next, I used a special program that core dumps itself after a second or two. Yes, I wrote it and it was hard work too - none of my programs normally core dump.  I can run from any directory and the kernel redirects the core dump to /corefiles.  I switched to Information Level (INF_LVL) = 3, so we get a stack trace in the output like the following sample:

AHAFS event: /aha/fs/modDir.monFactory/corefiles.mon
---------------------------------------------------
BEGIN_EVENT_INFO
Time          : Fri May 31 16:35:08 2013
Sequence Num  : 1
Process ID    : 16056558
User Info     : userName=root, loginName=root, groupName=system
Program Name  : coredumper
RC_FROM_EVPROD=1000
BEGIN_EVPROD_INFO
core.16056558.31153508.Z
END_EVPROD_INFO
STACK_TRACE
ahafs_evprods+6FC
aha_process_vnop+160
vnop_create_attr+528
openpnp+550
openpath+140
fp_open+9C
open_corefile+614
corex+2F8
core+64
psig+37C
issig+3B4
sig_deliver+1F0
main+28
[FFFFFFFFFFFFFFFC]
END_EVENT_INFO

Comments:

  1. The program is called "coredumper". 
  2. The core file is renamed to "core.16056558.31153508.Z" - which is PID=16056558 and date time=31153508 (May 31st then Greenwich Mean Time = 15:35 but running British Summer Time = 16:35 and 8 seconds) and compressed to a file name ending ".Z".
  3. The part after "STACK TRACE" is the stack trace!  The program suffered a memory fault signal arrived in the "main" function.

For production servers, we need to automate the running of the aha.pl Perl script from the /etc/rc* files or from inittab.

Note: this method does not require polling or crontab periodic checking of the /corefiles directory = zero CPU time.


Core dump notifications also get put in to the AIX Error Report - errpt like

# errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
A924A5FC   0531164313 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED

Or the detailed view:

# errpt -a | pg 
---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     A924A5FC

Date/Time:       Fri May 31 16:43:54 2013
Sequence Number: 68
Machine Id:      000E0A21D900
Node Id:         bronze2
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SYSPROC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER 11
USER'S PROCESS ID: 18022570
FILE SYSTEM SERIAL NUMBER 5
INODE NUMBER 2
CORE FILE NAME /corefiles/core.18022570.31154354
PROGRAM NAME   coredumper
STACK EXECUTION DISABLED 0
COME FROM ADDRESS REGISTER main 2C

PROCESSOR ID
  hw_fru_id: 0
  hw_cpu_id: 3

ADDITIONAL INFORMATION
main F8
main 2C
__start 6C

Symptom Data
REPORTABLE 1
INTERNAL ERROR 0
SYMPTOM CODE
PCSS/SPI2 FLDS/coredumpe SIG/11 FLDS/main VALU/f8 FLDS/__start

---------------------------------------------------------------------------

The AIX error report can be redirected into a System Log and transported remotely off machine - you would then have to be monitoring the system log for core dump creation events and would not be instantaneous.

Additional Information


Other places to find content from Nigel Griffiths IBM (retired)

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"Component":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power -\u003EPowerLinux"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"Component":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]

Document Information

Modified date:
12 June 2023

UID

ibm11165444