IBM Support

QRadar SOAR: Elasticsearch index corruption caused by an OutOfMemoryError

Troubleshooting


Problem

An OutOfMemoryError in Elasticsearch can corrupt the indices that IBM Security QRadar SOAR uses to search. This document describes how to identify and resolve this kind of problem.

Symptom

If search is returning error messages to the UI, you are seeing error messages in the logs or IBM Security QRadar SOAR needs to be restarted then it is worth checking if there is a corruption of the indices.

Diagnosing The Problem

The two files that are useful in troubleshooting are:
  • /usr/share/co3/logs/client.log
  • /var/log/elasticsearch/elasticsearch.log
You might find it useful to look at the historical logs in /usr/share/co3/logs/daily and /var/log/elasticsearch that are compressed, renamed, and dated.
On start-up of Elasticsearch, you might see the following error.
2020-05-18T20:16:51,653][INFO ][o.e.n.Node               ] [jtlm4Nv] starting ...
[2020-05-18T20:16:51,827][INFO ][o.e.t.TransportService   ] [jtlm4Nv] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2020-05-18T20:16:54,904][INFO ][o.e.c.s.MasterService    ] [jtlm4Nv] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300}
[2020-05-18T20:16:54,909][INFO ][o.e.c.s.ClusterApplierService] [jtlm4Nv] new_master {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300}, reason: apply cluster state (from master [master {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2020-05-18T20:16:54,945][INFO ][o.e.h.n.Netty4HttpServerTransport] [jtlm4Nv] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2020-05-18T20:16:54,946][INFO ][o.e.n.Node               ] [jtlm4Nv] started
[2020-05-18T20:16:56,323][INFO ][o.e.g.GatewayService     ] [jtlm4Nv] recovered [27] indices into cluster_state
[2020-05-18T20:17:05,127][WARN ][o.e.i.c.IndicesClusterStateService] [jtlm4Nv] [[attachment][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [attachment][3]: Recovery failed on {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300}
    at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2043) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.IndexShard$$Lambda$1839.0000000044018070.run(Unknown Source) ~[?:?]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.4.jar:6.2.4]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) [?:1.8.0]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:1.8.0]
    at java.lang.Thread.run(Thread.java:812) [?:2.9 (09-15-2018)]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway
    at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:413) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:94) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.StoreRecovery$$Lambda$1840.000000004802D830.run(Unknown Source) ~[?:?]
    at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:300) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:92) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1607) ~[elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2039) ~[elasticsearch-6.2.4.jar:6.2.4]
    ... 5 more
Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=573579808 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/var/lib/elasticsearch/nodes/0/indices/C0V55tlzQYO6B524ALQ0wA/3/translog/translog.ckp")))
These errors might also appear at other times and not solely at start-up of Elasticsearch.
[2020-05-19T04:55:25,815][WARN ][r.suppressed             ] path: /attachment/_doc/349608, params: {index=attachment, id=349608, type=_doc}
org.elasticsearch.action.UnavailableShardsException: [attachment][3] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[attachment][3]] containing [index {[attachment][_doc][349608], source[{"inc_id": 315669, "task_id":4844153, "org_id": 201, "inc_create_date": "2020-05-19T02:49:06.186+0000", "attachment": {"content_type":"image/png","size":22050,"created":{"date":"2020-05-19T02:54:25.368+0000"},"name":"Offense summary.png","creator_id":{"mail":"xxxxx","name":"xxxx"}}, "source_data": {"actions":[],"content_type":"image/png","created":1589856865368,"creator_id":{"display_name":"xxxxx","id":55,"name":"xxxxx","type":"user"},"id":349608,"inc_id":315669,"inc_name":"xxxx","inc_owner":{"display_name":"xxxx","id":55,"name":"xxxx","type":"user"},"name":"Offense summary.png","size":22050,"task_at_id":{"id":116,"name":"perform_investigation"},"task_custom":true,"task_id":4844153,"task_members":null,"task_name":"Perform Investigation","type":"task","uuid":"fd831091-87fb-4a0a-952f-368e8a266b4b","vers":12}}]}]]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:944) [elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:781) [elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:734) [elasticsearch-6.2.4.jar:6.2.4]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
You might see errors in the client.log similar to this.
The indices do not become corrupted without a cause. Looking for that cause might involve looking at the historical elasticsearch.log and client.log files. In many instances, you find that an OutOfMemoryError is the cause of the corruption.
When Elasticsearch uses up all the memory assigned to it the following is seen in the elasticsearch.log. Elasticsearch is in an unstable state.
[2020-05-14T20:22:57,080][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[jtlm4Nv][search][T#6]], exiting
java.lang.OutOfMemoryError: Java heap space

Resolving The Problem

When the indices are corrupted, they need to be rebuilt and in this case the memory assigned to Elasticsearch increased.
The following describes the actions you can undertake:
  • Increase the memory as detailed in How to increase the Java heap size of Elasticsearch used by IBM Resilient by amending /etc/elasticsearch/jvm.options
  • Before you stop IBM Resilient run
    sudo resutil configset -key elastic_server.init_schema -bvalue true
    This will, on restart, initiate a reindex of all your data
  • Stop IBM Resilient by running
    sudo systemctl stop resilient
  • Stop Elasticsearch by running
    sudo systemctl stop elasticsearch
  • Start IBM Resilient and Elasticsearch by running
    sudo systemctl start resilient
  • Follow the guidelines in How to increase the Java heap size of Elasticsearch used by IBM Resilient to check that the memory is set correctly.
  • Tail /usr/share/co3/logs/client.log looking for the following, which indicates that the index is rebuilt
16:18:01.211 [Thread-12] INFO  com.co3.search.ElasticSearchReindexer - beginning population of ElasticSearch indexes...
16:18:04.453 [Thread-12] INFO  com.co3.search.ElasticSearchReindexer - 100% complete
16:18:04.458 [Thread-12] INFO  com.co3.search.ElasticSearchReindexer - ElasticSearch indexes have been fully populated

The length of time to rebuild the index depends on the amount of data. During the reindex, you might find that search might not work as expected.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSA230","label":"IBM Security QRadar SOAR"},"ARM Category":[{"code":"a8m0z0000001grPAAQ","label":"Resilient Core-\u003ESearch"}],"ARM Case Number":"TS003720163","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
19 June 2024

UID

ibm16211016