QRadar SOAR: Elasticsearch index corruption caused by an OutOfMemoryError

Troubleshooting

Problem

An OutOfMemoryError in Elasticsearch can corrupt the indices that IBM Security QRadar SOAR uses to search. This document describes how to identify and resolve this kind of problem.

Symptom

If search is returning error messages to the UI, you are seeing error messages in the logs or IBM Security QRadar SOAR needs to be restarted then it is worth checking if there is a corruption of the indices.

Diagnosing The Problem

The two files that are useful in troubleshooting are:

/usr/share/co3/logs/client.log
/var/log/elasticsearch/elasticsearch.log

You might find it useful to look at the historical logs in /usr/share/co3/logs/daily and /var/log/elasticsearch that are compressed, renamed, and dated.

On start-up of Elasticsearch, you might see the following error.

2020-05-18T20:16:51,653][INFO ][o.e.n.Node               ] [jtlm4Nv] starting ...
[2020-05-18T20:16:51,827][INFO ][o.e.t.TransportService   ] [jtlm4Nv] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2020-05-18T20:16:54,904][INFO ][o.e.c.s.MasterService    ] [jtlm4Nv] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300}
[2020-05-18T20:16:54,909][INFO ][o.e.c.s.ClusterApplierService] [jtlm4Nv] new_master {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300}, reason: apply cluster state (from master [master {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2020-05-18T20:16:54,945][INFO ][o.e.h.n.Netty4HttpServerTransport] [jtlm4Nv] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2020-05-18T20:16:54,946][INFO ][o.e.n.Node               ] [jtlm4Nv] started
[2020-05-18T20:16:56,323][INFO ][o.e.g.GatewayService     ] [jtlm4Nv] recovered [27] indices into cluster_state
[2020-05-18T20:17:05,127][WARN ][o.e.i.c.IndicesClusterStateService] [jtlm4Nv] [[attachment][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [attachment][3]: Recovery failed on {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300}
    at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2043) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.IndexShard$$Lambda$1839.0000000044018070.run(Unknown Source) ~[?:?]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.4.jar:6.2.4]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) [?:1.8.0]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:1.8.0]
    at java.lang.Thread.run(Thread.java:812) [?:2.9 (09-15-2018)]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway
    at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:413) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:94) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.StoreRecovery$$Lambda$1840.000000004802D830.run(Unknown Source) ~[?:?]
    at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:300) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:92) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1607) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2039) ~[elasticsearch-6.2.4.jar:6.2.4]
    ... 5 more
Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=573579808 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/var/lib/elasticsearch/nodes/0/indices/C0V55tlzQYO6B524ALQ0wA/3/translog/translog.ckp")))

These errors might also appear at other times and not solely at start-up of Elasticsearch.

[2020-05-19T04:55:25,815][WARN ][r.suppressed             ] path: /attachment/_doc/349608, params: {index=attachment, id=349608, type=_doc}
org.elasticsearch.action.UnavailableShardsException: [attachment][3] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[attachment][3]] containing [index {[attachment][_doc][349608], source[{"inc_id": 315669, "task_id":4844153, "org_id": 201, "inc_create_date": "2020-05-19T02:49:06.186+0000", "attachment": {"content_type":"image/png","size":22050,"created":{"date":"2020-05-19T02:54:25.368+0000"},"name":"Offense summary.png","creator_id":{"mail":"xxxxx","name":"xxxx"}}, "source_data": {"actions":[],"content_type":"image/png","created":1589856865368,"creator_id":{"display_name":"xxxxx","id":55,"name":"xxxxx","type":"user"},"id":349608,"inc_id":315669,"inc_name":"xxxx","inc_owner":{"display_name":"xxxx","id":55,"name":"xxxx","type":"user"},"name":"Offense summary.png","size":22050,"task_at_id":{"id":116,"name":"perform_investigation"},"task_custom":true,"task_id":4844153,"task_members":null,"task_name":"Perform Investigation","type":"task","uuid":"fd831091-87fb-4a0a-952f-368e8a266b4b","vers":12}}]}]]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:944) [elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:781) [elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:734) [elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]

You might see errors in the client.log similar to this.

The indices do not become corrupted without a cause. Looking for that cause might involve looking at the historical elasticsearch.log and client.log files. In many instances, you find that an OutOfMemoryError is the cause of the corruption.

When Elasticsearch uses up all the memory assigned to it the following is seen in the elasticsearch.log. Elasticsearch is in an unstable state.

[2020-05-14T20:22:57,080][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[jtlm4Nv][search][T#6]], exiting
java.lang.OutOfMemoryError: Java heap space

Resolving The Problem

When the indices are corrupted, they need to be rebuilt and in this case the memory assigned to Elasticsearch increased.

The following describes the actions you can undertake:

Increase the memory as detailed in How to increase the Java heap size of Elasticsearch used by IBM Resilient by amending /etc/elasticsearch/jvm.options
Before you stop IBM Resilient run
```
sudo resutil configset -key elastic_server.init_schema -bvalue true
```
This will, on restart, initiate a reindex of all your data
Stop IBM Resilient by running
```
sudo systemctl stop resilient
```
Stop Elasticsearch by running
```
sudo systemctl stop elasticsearch
```
Start IBM Resilient and Elasticsearch by running
```
sudo systemctl start resilient
```
Follow the guidelines in How to increase the Java heap size of Elasticsearch used by IBM Resilient to check that the memory is set correctly.
Tail /usr/share/co3/logs/client.log looking for the following, which indicates that the index is rebuilt

16:18:01.211 [Thread-12] INFO  com.co3.search.ElasticSearchReindexer - beginning population of ElasticSearch indexes...
16:18:04.453 [Thread-12] INFO  com.co3.search.ElasticSearchReindexer - 100% complete
16:18:04.458 [Thread-12] INFO  com.co3.search.ElasticSearchReindexer - ElasticSearch indexes have been fully populated

The length of time to rebuild the index depends on the amount of data. During the reindex, you might find that search might not work as expected.

Related Information

How to increase the Java heap size of Elasticsearch used by IBM Resilient

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSA230","label":"IBM Security QRadar SOAR"},"ARM Category":[{"code":"a8m0z0000001grPAAQ","label":"Resilient Core-\u003ESearch"}],"ARM Case Number":"TS003720163","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Tips

QRadar SOAR: Elasticsearch index corruption caused by an OutOfMemoryError

Troubleshooting

Problem

Symptom

Diagnosing The Problem

Resolving The Problem

Related Information

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?