IBM Support

QRadar: Performance gaps in EPS graphs

Troubleshooting


Problem

Gaps in any EPS related graph are a major concern because they suggest events are being lost. However, most of the time the gap is the result of a performance problem with no actual impact to event collection. This article explains how to identify if that is the case, and a work-around to restore the graphs.

Symptom

Gaps or dips in either EPS graph or Event Processor Distribution graph under System Monitoring Dashboard.

Cause

There are a few graphs that represent event volume in different ways, and each of these relies on slightly different mechanisms in QRadar. The two process we are interested in investigating are event correlation service event processor (ecs-ep) and event correlation service - event correlation (ecs-ec). Here, we review the most common sources of gaps in these graphs.
Gaps in Event Processor Distribution graph - Potential performance issues at ecs-ep
The most common scenario is gaps observed in the Event Processor Distribution graph located under the System Monitoring dashboard. The Event Processor Distribution graph is used to monitor performance of the custom rule engine, but this is not a true Events Per Second (EPS) graph. This graph plots the number of events written to disk by searching for all events and grouping by event processor. For an event to be counted for this graph, it must go through the complete pipeline and be written to disk. Therefore, gaps in this graph are usually caused by performance degradation on ecs-ep.
You can confirm whether a real issue exists by comparing this graph with the EPS graph. If that graph does not show a similar dip at the same time, then the observed gap in the Event Processor Distribution graph is due to performance issues at the custom rule engine. That host is causing delays in correlation and writing events to disk.
 
image-20221201131739-1
If you identify this behavior, open a case for performance degradation on the custom rule engine:


Gaps in Event Rate EPS Graph

Gaps here are caused by ecs-ec on the Console host.

System Notifications related events are collected in each managed host and sent to the console, which in turn receives them through the ecs-ec-ingress service. It is the case for the StatFilter events, which report the EPS count. So, all these types of events go through the complete pipeline in the Console. If the console has performance degradation at the custom rule engine or device parsing these events might not be processed every minute, resulting in gaps in the EPS graph. The same happens if the console is exceeding the allocated license and dropping events, a common scenario because the console is usually configured with a low license. Therefore, it does not mean we are dropping events in the event processor showing the gap, but it is the console that might not be processing the event that reports the EPS count.

How can we validate whether performance degradation is impacting graph accuracy?

If you encounter this scenario, check the allocated license, and make sure it's enough for burst handling.  License management - IBM Documentation

Accumulator errors on Console

The EPS graph takes the information from accumulated data. The accumulator service aggregates against all events seen in the previous 60 seconds and has 60 seconds to process that interval. If the accumulator service crashes due to an out of memory exception, the information from a few minutes might not be available, thus resulting in a gap in the graph.  

Another reason accumulator might cause performance gaps is when it falls behind or fails to complete accumulation for all configured global views within 60 seconds. Accumulator is falling behind - IBM Documentation
The search used to accumulate EPS related data can sometimes cause this behavior when the default view is deleted. Inadequate tuning of global views or higher than usual event volume can also cause accumulator to fall behind. For more information about this issue, see IJ31082: 'ACCUMULATOR FALLING BEHIND' NOTIFICATIONS AFTER DEFAULT GLOBAL VIEWS FOR EVENT RATE AND FLOW RATE HAVE BEEN RECREATED

Errors at ingress preventing events from reaching ecs-ec

Though more rarely seen, make sure the "Stream" threads are loaded by ingress. These are the threads that take events from ecs-ec-ingress and pass them over to ecs-ec. If these threads fail to load, it blocks ingress from sending data along to ecs-ec on the same managed host.

To run theadTop.sh

  1. SSH to the Console.
  2. From the Console use SSH to log in to the Managed Host having the issue.
  3. Type the command:
    /opt/qradar/support/threadTop.sh -p 7787 -e 'Stream*'
    
    System Time: 4/11/2022 at 12:14:56.108
    --------------  -----  ----------  ------------------------------------------
    Server          ID     MSecs  Name
    --------------  -----  ----------  ------------------------------------------
    7787              499      2  StreamProcessorThread
    7787               89      0  StreamListenerThread
    --------------  -----  ----------  ------------------------------------------
                               2  Total (2/2000)
    

Which is related to APAR IJ36277: QRADAR CAN FAIL TO PASS EVENTS FROM ECS-EC-INGRESS COLLECTION PROCCESS TO THE ECS-EC PROCESS 

Environment

QRadar 7.4.3 and 7.5.0 versions.

Resolving The Problem

The workaround for this behavior is:

  1. SSH to the Console as a root or SUDO user with sufficient privileges.
  2. From the Console SSH to the Managed Host with issues.
    Important:  Event Collection might be interrupted while the service restarts. Schedule a maintenance window before doing the next step. 
  3. Restart the ecs-ec-ingress by typing the command:
    systemctl restart ecs-ec-ingress 

    Results
    If you still fail to see the "Stream" threads in threadTop after restarting ecs-ec-ingress, open a case with IBM QRadar Support for further assistance.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSBQAC","label":"IBM Security QRadar SIEM"},"ARM Category":[{"code":"a8m0z000000cwtiAAA","label":"Performance"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"7.4.3;7.5.0"}]

Document Information

Modified date:
14 December 2022

UID

ibm16540284