IBM Support

QRadar: Troubleshooting disk space usage problems

Troubleshooting


Problem

The partitions are critical for the regular functioning of Linux and QRadar® SIEM. The purpose of this article is to help the administrator with the identification of files and directories when a partition triggers the disk usage alerts.  These issues might also generate issues such as software upgrade failing disk space tests and configuration deployment not running.

Cause

By default, the QRadar disk sentry check runs every 60 seconds and looks for high disk usage across the following partitions:
QRadar Partition Critical Threshold Critical Services Stop (7.4.2 and later)
/ Yes, at 95% Yes, when less than 100GiB
/store Yes, at 95% Yes, when less than 100GiB
/transient Yes, at 95% Yes, when less than 100GiB
/storetmp Yes, at 95% Yes, when less than 100GiB
/opt Yes, at 95% Yes, when less than 100GiB
/var No No
/var/log No No, but can cause services to behave unexpectantly. 
/var/log/audit No No, but can cause services to behave unexpectantly. 
/tmp No No
/home No No

Note: 100GiB = 107.3GB

If any of these partitions exceeds 90% usage, a warning notification is sent to the UI. In /var/log/qradar.log, a log similar to the following appears:

[hostcontext.hostcontext] com.q1labs.hostcontext.ds.DiskSpaceSentinel: [WARN] [-/- -]System disk resources above warning threshold

IMPORTANT: For the partitions listed in the table as critical for system functionality, system services are stopped to avoid the partition becoming full and prevent further issues. A maximum threshold notification is sent to the UI and can also be seen in /var/log/qradar.log:

[hostcontext.hostcontext]com.q1labs.hostcontext.ds.DiskSpaceSentinel: [ERROR] [-/- -]Disk usage on at least one disk has exceeded 
the maximum threshold level of 0.95. The following disks have exceeded the maximum threshold level: /transient. Processes are being 
shut down to prevent data corruption. To minimize the disruption in service, reduce disk usage on this system.


While the other partitions denoted as noncritical, the disk sentry check gives a warning when the threshold is met, but system processes are not stopped and don't cause an outage. When the system recovers back under the threshold, a notification is sent to the UI, and the following message is seen in /var/log/qradar.log:

[hostcontext.hostcontext] com.q1labs.hostcontext.ds.DiskSpaceSentinel: [INFO] [-/- -]System disk resources back to normal levels

Diagnosing The Problem

The first step in diagnosing the problem is determining which partition has the problem.

Verify the managed host affected
  1. Log in to the QRadar user interface as an admin user.
  2. Click the bell icon and hover the Disk Sentry alert.
    Figure01
    In the previous image, the affected host has the IP 10.11.12.13 and the partition affected is "/".
  3. SSH to the Console, then to the affected managed host if not the Console.
  4. Use the df -Th command to get the output of the partitions.
    df -Th
    
    Example output:
    Filesystem                        Size  Used Avail Use% Mounted on
    /dev/mapper/rootrhel-root          13G  2.9G  9.7G  23% /
    devtmpfs                           16G     0   16G   0% /dev
    tmpfs                              16G   20K   16G   1% /dev/shm
    tmpfs                              16G  1.7G   15G  11% /run
    tmpfs                              16G     0   16G   0% /sys/fs/cgroup
    /dev/mapper/rootrhel-var          5.0G  208M  4.8G   5% /var
    /dev/sda3                          32G  4.1G   28G  13% /recovery
    /dev/mapper/rootrhel-home        1014M   33M  982M   4% /home
    /dev/sda2                        1014M  224M  791M  23% /boot
    /dev/mapper/rootrhel-tmp          3.0G   53M  3.0G   2% /tmp
    /dev/mapper/rootrhel-opt           13G  5.1G  7.5G  41% /opt
    /dev/mapper/rootrhel-storetmp      15G   34M   15G   1% /storetmp
    /dev/mapper/rootrhel-varlog        15G  3.6G   12G  24% /var/log
    /dev/mapper/storerhel-transient    40G   40G  236M 100% /transient
    /dev/mapper/rootrhel-varlogaudit  3.0G  205M  2.8G   7% /var/log/audit
    tmpfs                             3.2G     0  3.2G   0% /run/user/0
    /dev/drbd0                        158G   78G   80G  50% /store
    Notice that /dev/mapper/storerhel-transient has 100% in the Use% column. This means /transient is the partition causing the alert.
     
Finding undersized appliances
QRadar installed on virtual machines with less than 256GB (minimum disk storage) can cause some partitions to default to the "/" partition. Use the lsblk command to find out whether the disk size is less than 256GB and /store and /transient partition exists on the system.
Figure01
Once the conflicting managed host is identified, go to the Resolving The Problem section to find details about finding large files and directories, and review the linked article for the specific partition.

Resolving The Problem

There are a couple of reasons a QRadar partition might have high disk usage:

  • Undersized appliances not meeting the minimum disk requirements.
  • Large files or directories on the partition causing it to fill.
  • Lots of smaller files build up over time and cause a certain directory on the partition to grow excessively.
Identify directories and files with large disk usage
  1. Use the du and find commands to list the largest directories and files.
    1. The following du command return with a recursive directory output for the /partition/directory, sorted by the smallest to the largest.
      du -chaxd1 /<partition> | sort -h | tail
      Output Example for "/":
      61M     /usr/sbin
      122M    /usr/local
      444M    /usr/lib64
      589M    /usr/bin
      941M    /usr/lib
      958M    /usr/share
      3.2G    /usr
      7.9G    /root
      12G     /
      12G     total
      In the previous output, the /root partition is the largest.
      NOTE: Sometimes the ssh session can time out before the du command completes.  In this case, it is best to run the du command inside a screen session, which does not terminate upon ssh timeout, and is accessible until the session is terminated.  From the command prompt, run screen
      image-20230106120006-1
      Run the du command from step "a."  If the ssh session times out before the command completes, you can reattach to the screen session.  You first need to find the screen session ID.  Then, run
      image-20230106120716-2
      Output example:
      image-20230106121042-3
      In this case, the screen ID is 17376.  To reattach to the screen session, run
      image-20230106121202-4
      Once finished, the output of the du command is presented.
      You can also use the "exclude" option with du to identify other large directories when analyzing a partition with a known large directory (such as ariel on the /store partition).
      du -chaxd1 -exclude=ariel /store
      Once finished, the output of the du command is presented, omitting the excluded directory.
    2. The following find command returns the largest files found in a partition, sorted by the smallest to the largest.
      find /<partition> -xdev -type f -size +100M | xargs ls -lhSr
      Output Example for "/":
      -rw------- 1 root root 596M Jun 13 13:29 /core.26490
      -rw-r--r-- 1 root root 7.9G Jan 10 16:02 /root/scripts/test_file.zip
      In the previous output, the test_file.zip is a forgotten file, and core.26490 is a system core file resulted from an abnormal exit of a process.
       
  2. Search in the Disk Space 101 portal for specific information about each partition. Alternatively, use the direct links in the following table.
     QRadar Partition   Articles
    /
    /store
    /transient
    /storetmp
    /opt
    /var
    /var/log
    /var/log/audit
    /tmp
    /home


    Result
    The files and directories contributing to the lack of disk space are evident. Administrators can proceed with the troubleshooting steps to remove the files or directories.

    If the files are not evident enough, the administrators can contact QRadar Support for assistance.
  

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSBQAC","label":"IBM Security QRadar SIEM"},"ARM Category":[{"code":"a8m0z000000cwsyAAA","label":"Admin Tasks"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Document Information

Modified date:
13 June 2024

UID

ibm10881013