IBM Support

Troubleshooting hanging NIM mksysb operations.

Troubleshooting


Problem

In this document we go through a few of the most common causes for a hanging mksysb operation. 

We will dive in to the internals during NIM backups and learn how to troubleshoot and fix those problems. 

Symptom

  1. The mksysb displays 100% complete on your NIM, but prompt is never returned. 
  2. The mksysb is hanging at a certain percent of the backup.

Cause

1 The cause for a hang is usually a problem with the network. At the beginning of a NIM mksysb, the NIMSH daemon working on the client LPAR will open two TCP sessions, one on client port 3901 to > NIM 1023-513 and one on client port 3902 to > NIM 1023-513 where the second session is referred to as Auxiliary session and will be used to relay the mksysb command success/failure return code when the backup complete.

If this session is dropped or interrupted, the NIM master will keep waiting for that return code even after the process is fully complete and successful.

2 During mksysb backup we use the ‘backbyname’ command to backup the data we need, if the command is unable to access/read a specific file or directory, the process may hang. Normally, this would be caused by a hung NFS mount point or one where the root used has no read permissions for.

Additionally, this may be caused by a corrupt file system.

Diagnosing The Problem


Mksysb hanging after 100% complete. 

To diagnose this problem, we will need to take a iptrace/tcpdump from both the NIM LPAR and the Client LPAR during the hanging operation.

*Before starting ensure you have at least 500MB free in the /tmp file system.

You can do that by following the bellow steps:

  1. Start iptrace on the client LPAR:
    # startsrc -s iptrace -a "-s <NIM IP> -p 3901,3902 -b -L 1000000000 /tmp/<hostname>.iptrace"

     
  2. Start iptrace on the NIM LPAR:
    # startsrc -s iptrace -a "-s <Client IP> -p 3901,3902 -b -L 1000000000 /tmp/<hostname>.iptrace"

     
  3. Start the mksysb operation:
    # nim -o define -t mksysb -a server=master -a source=<Client LPAR> -a mk_image=yes -a location=<where to save our mksysb file> < NIM Resource name>

     
  4. Wait for the process to hang, you can verify its hanging by checking if the mksysb process is done on the Client LPAR, for example using “ps -ef’
    # ps -ef | grep backbyname
    root 19529868 18350146 120 06:23:24      -  0:10 backbyname -i -q -v -Z -p -U -f /tmp/20512972.mnt0/kronos.mksysb2
    If this process is gone, it means the backup operation has either failed or completed.
  5. *Note this process may start a minute or two after intiating the mksysb command on NIM.
     
  6. Stop the iptrace on both LPARs and analyze the data:
    # stopsrc -s iptrace

You can use a tool like Wireshark to open the trace files and analyze the data, you need to look for “Retrasmission” packages on the client side of the trace and on port 3902, those indicate the client LPAR is trying to send out something to NIM, but is not getting a reply. You can verify the NIM is not receiving those packages by looking for the same package ID on the NIM side of the trace. 

If the packages are missing from the NIM side, it means they were dropped in between the two LPARs, most likely by a firewall. 

WireShark analysis will show retransmissions for the "FIN" package on the NIM AUX port 3902: 

image 12564

The FIX 

This is generally caused by the TCP timeout setting on your firewall, because the AUX session may remain idle for a long time, depending on how long it takes for the mksysb to complete, some firewalls may consider the session inactive and drop it.

To fix this, your firewall's TCP timeout window must be increased to match the longest time it may take for an mksysb operation to complete. 

As a workaround, the TCP keepalive settings on AIX may be tuned, the most common one is tcp_keepidle, this attribute is responsible for the time it takes for AIX to start sending keepalive packages on idle sessions. The default is 14400 half seconds which are 2 hours.

IBM does not offer recommendations on what the value should be, but its good if it's not less than 15 minutes or 1800 half seconds.

To change the value on your AIX system use: 

# no -p -o tcp_keepidle=1800 

To verify the value is set, use # no -a  | grep keep 


Mksysb hanging at a certain percent. 

 You can confirm the mksysb is hanging when there is no progress seen in the mksysb file size and the “backbyname” is still running on the Client LPAR:

# ps -ef | grep backby

    root 19267770 20709568 101 07:24:20      -  0:00 backbyname -i -q -v -Z -U -f /tmp/20512946.mnt0/kronos.mksysb2

# ls -l /tmp/20512946.mnt0/kronos.mksysb2

-rw-r--r--    1 root     system    541900800 Apr  1 07:24 /tmp/20512946.mnt0/kronos.mksysb2

Wait 5 minutes and check again:

# ls -l /tmp/20512946.mnt0/kronos.mksysb2

-rw-r--r--    1 root     system    541900800 Apr  1 07:24 /tmp/20512946.mnt0/kronos.mksysb2

If true, you can stop the mksysb process and kill “backbyname” on the Client LPAR if it remains running after stopping mksysb.

Then follow the bellow steps to determine the hang point:

1. Start a new mksysb on NIM while enabling verbose with the “v” flag  -a mksysb_flags=<other flags you may use, “i” and “p” are recommended>, for example:
# nim -o define -t mksysb -a server=master -a mksysb_flags=piv  -a source=kronos -a mk_image=yes -a location=/export/mksysb/kronos.mksysb2 kronos_mksysb2

This will display the files that are being archived on the console like this:

 
a          572 ./usr/share/lib/me/local.me

a          331 ./usr/share/lib/me/null.me

a         2477 ./usr/share/lib/me/refer.me

a         1859 ./usr/share/lib/me/sh.me

a         1208 ./usr/share/lib/me/tbl.me

a          650 ./usr/share/lib/me/thesis.me <<<<



 

2. Next, go on the Client LPAR and locate the archive.list file, this is a temporary file including all files that are to be backed up. The file is located in /tmp  in a directory called “mksysb.<PID>”  for example /tmp/mksysb.20512946. . Inside you will find the file called “.archive.list.<PID>”.archive.list.20512946 .  The contents look like this:
./usr/share/lib/me/local.me

./usr/share/lib/me/null.me

./usr/share/lib/me/refer.me

./usr/share/lib/me/sh.me

./usr/share/lib/me/tbl.me

./usr/share/lib/me/thesis.me

./usr/share/lib/ms <<<<<<<< HANG POINT 


 

3. Now, in step one, the verbose options allows us to see the last successfully backed up file, meaning this last file “650 ./usr/share/lib/me/thesis.me” will not be the one we are hanging on, but the next one will be. To find out which is the next file, we open the archive.list  and search for that last file we see on the console, the next file/directory is the one “backbyname” is hanging on, in the above example  in step 2 we can see that is “./usr/share/lib/ms
The FIX 
The problem here will be caused either by a hung NFS mount, permission issues for root on an NFS mount or file system corruption. 
In those cases you would want to check the file you found is causing the hang, if it belongs to an NFS mount or is one, try remounting or unmounting the NFS share. If this is a normal file, try accessing the file, if you can't it is very likely this is due to a file system problem in which case we recommend logging a call with IBM Support to check it out. 

Document Location

Worldwide

Operating System

AIX:All operating systems listed

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[],"Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
16 December 2021

UID

ibm16149163