QRadar: Troubleshooting disk space usage problems

Troubleshooting

Problem

The partitions are critical for the regular functioning of Linux and QRadar® SIEM. The purpose of this article is to help the administrator with the identification of files and directories when a partition triggers the disk usage alerts. These issues might also generate issues such as software upgrade failing disk space tests and configuration deployment not running.

Cause

By default, the QRadar disk sentry check runs every 60 seconds and looks for high disk usage across the following partitions:

QRadar Partition	Critical Threshold	Critical Services Stop (7.4.2 and later)
/	Yes, at 95%	Yes, when less than 100GiB
/store	Yes, at 95%	Yes, when less than 100GiB
/transient	Yes, at 95%	Yes, when less than 100GiB
/storetmp	Yes, at 95%	Yes, when less than 100GiB
/opt	Yes, at 95%	Yes, when less than 100GiB
/var	No	No
/var/log	No	No, but can cause services to behave unexpectantly.
/var/log/audit	No	No, but can cause services to behave unexpectantly.
/tmp	No	No
/home	No	No

Note: 100GiB = 107.3GB

If any of these partitions exceeds 90% usage, a warning notification is sent to the UI. In /var/log/qradar.log, a log similar to the following appears:

[hostcontext.hostcontext] com.q1labs.hostcontext.ds.DiskSpaceSentinel: [WARN] [-/- -]System disk resources above warning threshold

IMPORTANT: For the partitions listed in the table as critical for system functionality, system services are stopped to avoid the partition becoming full and prevent further issues. A maximum threshold notification is sent to the UI and can also be seen in /var/log/qradar.log:

[hostcontext.hostcontext]com.q1labs.hostcontext.ds.DiskSpaceSentinel: [ERROR] [-/- -]Disk usage on at least one disk has exceeded 
the maximum threshold level of 0.95. The following disks have exceeded the maximum threshold level: /transient. Processes are being 
shut down to prevent data corruption. To minimize the disruption in service, reduce disk usage on this system.

While the other partitions denoted as noncritical, the disk sentry check gives a warning when the threshold is met, but system processes are not stopped and don't cause an outage. When the system recovers back under the threshold, a notification is sent to the UI, and the following message is seen in /var/log/qradar.log:

[hostcontext.hostcontext] com.q1labs.hostcontext.ds.DiskSpaceSentinel: [INFO] [-/- -]System disk resources back to normal levels

Diagnosing The Problem

The first step in diagnosing the problem is determining which partition has the problem.

Verify the managed host affected

Log in to the QRadar user interface as an admin user.
Click the bell icon and hover the Disk Sentry alert.

In the previous image, the affected host has the IP 10.11.12.13 and the partition affected is "/".
SSH to the Console, then to the affected managed host if not the Console.

Use the df -Th command to get the output of the partitions.

df -Th

Example output:

Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/rootrhel-root          13G  2.9G  9.7G  23% /
devtmpfs                           16G     0   16G   0% /dev
tmpfs                              16G   20K   16G   1% /dev/shm
tmpfs                              16G  1.7G   15G  11% /run
tmpfs                              16G     0   16G   0% /sys/fs/cgroup
/dev/mapper/rootrhel-var          5.0G  208M  4.8G   5% /var
/dev/sda3                          32G  4.1G   28G  13% /recovery
/dev/mapper/rootrhel-home        1014M   33M  982M   4% /home
/dev/sda2                        1014M  224M  791M  23% /boot
/dev/mapper/rootrhel-tmp          3.0G   53M  3.0G   2% /tmp
/dev/mapper/rootrhel-opt           13G  5.1G  7.5G  41% /opt
/dev/mapper/rootrhel-storetmp      15G   34M   15G   1% /storetmp
/dev/mapper/rootrhel-varlog        15G  3.6G   12G  24% /var/log
/dev/mapper/storerhel-transient    40G   40G  236M 100% /transient
/dev/mapper/rootrhel-varlogaudit  3.0G  205M  2.8G   7% /var/log/audit
tmpfs                             3.2G     0  3.2G   0% /run/user/0
/dev/drbd0                        158G   78G   80G  50% /store

Notice that /dev/mapper/storerhel-transient has 100% in the Use% column. This means /transient is the partition causing the alert.

Finding undersized appliances

QRadar installed on virtual machines with less than 256GB (minimum disk storage) can cause some partitions to default to the "/" partition. Use the lsblk command to find out whether the disk size is less than 256GB and /store and /transient partition exists on the system.

Once the conflicting managed host is identified, go to the Resolving The Problem section to find details about finding large files and directories, and review the linked article for the specific partition.

Resolving The Problem

There are a couple of reasons a QRadar partition might have high disk usage:

Undersized appliances not meeting the minimum disk requirements.
Large files or directories on the partition causing it to fill.
Lots of smaller files build up over time and cause a certain directory on the partition to grow excessively.

Identify directories and files with large disk usage

Use the du and find commands to list the largest directories and files.
1. The following du command return with a recursive directory output for the /partition/directory, sorted by the smallest to the largest.
```
du -chaxd1 /<partition> | sort -h | tail
```
  Output Example for "/":
```
61M     /usr/sbin
122M    /usr/local
444M    /usr/lib64
589M    /usr/bin
941M    /usr/lib
958M    /usr/share
3.2G    /usr
7.9G    /root
12G     /
12G     total
```
  In the previous output, the /root partition is the largest.
  
  NOTE: Sometimes the ssh session can time out before the du command completes. In this case, it is best to run the du command inside a screen session, which does not terminate upon ssh timeout, and is accessible until the session is terminated. From the command prompt, run screen
  
  Run the du command from step "a." If the ssh session times out before the command completes, you can reattach to the screen session. You first need to find the screen session ID. Then, run
  
  Output example:
  
  In this case, the screen ID is 17376. To reattach to the screen session, run
  
  Once finished, the output of the du command is presented.
  You can also use the "exclude" option with du to identify other large directories when analyzing a partition with a known large directory (such as ariel on the /store partition).
```
du -chaxd1 -exclude=ariel /store
```
  Once finished, the output of the du command is presented, omitting the excluded directory.
2. The following find command returns the largest files found in a partition, sorted by the smallest to the largest.
```
find /<partition> -xdev -type f -size +100M | xargs ls -lhSr
```
  Output Example for "/":
```
-rw------- 1 root root 596M Jun 13 13:29 /core.26490
-rw-r--r-- 1 root root 7.9G Jan 10 16:02 /root/scripts/test_file.zip
```
  In the previous output, the test_file.zip is a forgotten file, and core.26490 is a system core file resulted from an abnormal exit of a process.

Search in the Disk Space 101 portal for specific information about each partition. Alternatively, use the direct links in the following table.

QRadar Partition	Articles
/	QRadar: About / partition QRadar: Delete files or directories to gain space in / partition
/store	QRadar: About /store partition QRadar: Delete files or directories to gain space in /store partition
/transient	QRadar: About /transient partition QRadar: Delete files or directories to gain space in /transient partition
/storetmp	QRadar: About /storetmp partition QRadar: Delete files or directories to gain space in /storetmp partition
/opt	QRadar: About /opt partition QRadar: Delete files or directories to gain space in /opt partition
/var	QRadar: About /var partition QRadar: Delete files or directories to gain space in /var partition
/var/log	QRadar: About /var/log partition QRadar: /var/log and /var/log/audit fills to capacity due to logrotate issue
/var/log/audit	QRadar: About /var/log/audit partition QRadar: /var/log and /var/log/audit fills to capacity due to logrotate issue
/tmp	QRadar: About /tmp partition QRadar: Delete files or directories to gain space in /tmp partition
/home	QRadar: About /home partition QRadar: Delete files or directories to gain space in /home partition

Result
The files and directories contributing to the lack of disk space are evident. Administrators can proceed with the troubleshooting steps to remove the files or directories.

If the files are not evident enough, the administrators can contact QRadar Support for assistance.

Related Information

QRadar Disk Space 101

QRadar: Resolving high disk usage problems for /var/log partition

QRadar: /var/log and /var/log/audit fills to capacity due to logrotate issue

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSBQAC","label":"IBM Security QRadar SIEM"},"ARM Category":[{"code":"a8m0z000000cwsyAAA","label":"Admin Tasks"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Tips

QRadar: Troubleshooting disk space usage problems

Troubleshooting

Problem

Cause

Diagnosing The Problem

Resolving The Problem

Related Information

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?