Troubleshooting
Problem
An OutOfMemoryError in Elasticsearch can corrupt the indices that IBM Security QRadar SOAR uses to search. This document describes how to identify and resolve this kind of problem.
Symptom
If search is returning error messages to the UI, you are seeing error messages in the logs or IBM Security QRadar SOAR needs to be restarted then it is worth checking if there is a corruption of the indices.
Diagnosing The Problem
The two files that are useful in troubleshooting are:
- /usr/share/co3/logs/client.log
- /var/log/elasticsearch/elasticsearch.log
You might find it useful to look at the historical logs in /usr/share/co3/logs/daily and /var/log/elasticsearch that are compressed, renamed, and dated.
On start-up of Elasticsearch, you might see the following error.
2020-05-18T20:16:51,653][INFO ][o.e.n.Node ] [jtlm4Nv] starting ...
[2020-05-18T20:16:51,827][INFO ][o.e.t.TransportService ] [jtlm4Nv] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2020-05-18T20:16:54,904][INFO ][o.e.c.s.MasterService ] [jtlm4Nv] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300}
[2020-05-18T20:16:54,909][INFO ][o.e.c.s.ClusterApplierService] [jtlm4Nv] new_master {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300}, reason: apply cluster state (from master [master {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2020-05-18T20:16:54,945][INFO ][o.e.h.n.Netty4HttpServerTransport] [jtlm4Nv] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2020-05-18T20:16:54,946][INFO ][o.e.n.Node ] [jtlm4Nv] started
[2020-05-18T20:16:56,323][INFO ][o.e.g.GatewayService ] [jtlm4Nv] recovered [27] indices into cluster_state
[2020-05-18T20:17:05,127][WARN ][o.e.i.c.IndicesClusterStateService] [jtlm4Nv] [[attachment][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [attachment][3]: Recovery failed on {jtlm4Nv}{jtlm4NvNSXub1gAvQZbLCA}{AtbMZN4dQ8OU92Qq1zIMbw}{127.0.0.1}{127.0.0.1:9300}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2043) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.IndexShard$$Lambda$1839.0000000044018070.run(Unknown Source) ~[?:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) [?:1.8.0]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:1.8.0]
at java.lang.Thread.run(Thread.java:812) [?:2.9 (09-15-2018)]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:413) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:94) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.StoreRecovery$$Lambda$1840.000000004802D830.run(Unknown Source) ~[?:?]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:300) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:92) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1607) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2039) ~[elasticsearch-6.2.4.jar:6.2.4]
... 5 more
Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=573579808 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/var/lib/elasticsearch/nodes/0/indices/C0V55tlzQYO6B524ALQ0wA/3/translog/translog.ckp")))
These errors might also appear at other times and not solely at start-up of Elasticsearch.
[2020-05-19T04:55:25,815][WARN ][r.suppressed ] path: /attachment/_doc/349608, params: {index=attachment, id=349608, type=_doc}
org.elasticsearch.action.UnavailableShardsException: [attachment][3] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[attachment][3]] containing [index {[attachment][_doc][349608], source[{"inc_id": 315669, "task_id":4844153, "org_id": 201, "inc_create_date": "2020-05-19T02:49:06.186+0000", "attachment": {"content_type":"image/png","size":22050,"created":{"date":"2020-05-19T02:54:25.368+0000"},"name":"Offense summary.png","creator_id":{"mail":"xxxxx","name":"xxxx"}}, "source_data": {"actions":[],"content_type":"image/png","created":1589856865368,"creator_id":{"display_name":"xxxxx","id":55,"name":"xxxxx","type":"user"},"id":349608,"inc_id":315669,"inc_name":"xxxx","inc_owner":{"display_name":"xxxx","id":55,"name":"xxxx","type":"user"},"name":"Offense summary.png","size":22050,"task_at_id":{"id":116,"name":"perform_investigation"},"task_custom":true,"task_id":4844153,"task_members":null,"task_name":"Perform Investigation","type":"task","uuid":"fd831091-87fb-4a0a-952f-368e8a266b4b","vers":12}}]}]]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:944) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:781) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:734) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
You might see errors in the client.log similar to this.
The indices do not become corrupted without a cause. Looking for that cause might involve looking at the historical elasticsearch.log and client.log files. In many instances, you find that an OutOfMemoryError is the cause of the corruption.
When Elasticsearch uses up all the memory assigned to it the following is seen in the elasticsearch.log. Elasticsearch is in an unstable state.
[2020-05-14T20:22:57,080][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[jtlm4Nv][search][T#6]], exiting
java.lang.OutOfMemoryError: Java heap space
Resolving The Problem
When the indices are corrupted, they need to be rebuilt and in this case the memory assigned to Elasticsearch increased.
The following describes the actions you can undertake:
- Increase the memory as detailed in How to increase the Java heap size of Elasticsearch used by IBM Resilient by amending /etc/elasticsearch/jvm.options
- Before you stop IBM Resilient run
sudo resutil configset -key elastic_server.init_schema -bvalue true
- Stop IBM Resilient by running
sudo systemctl stop resilient
- Stop Elasticsearch by running
sudo systemctl stop elasticsearch
- Start IBM Resilient and Elasticsearch by running
sudo systemctl start resilient
- Follow the guidelines in How to increase the Java heap size of Elasticsearch used by IBM Resilient to check that the memory is set correctly.
- Tail /usr/share/co3/logs/client.log looking for the following, which indicates that the index is rebuilt
16:18:01.211 [Thread-12] INFO com.co3.search.ElasticSearchReindexer - beginning population of ElasticSearch indexes...
16:18:04.453 [Thread-12] INFO com.co3.search.ElasticSearchReindexer - 100% complete
16:18:04.458 [Thread-12] INFO com.co3.search.ElasticSearchReindexer - ElasticSearch indexes have been fully populated
The length of time to rebuild the index depends on the amount of data. During the reindex, you might find that search might not work as expected.
Related Information
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSA230","label":"IBM Security QRadar SOAR"},"ARM Category":[{"code":"a8m0z0000001grPAAQ","label":"Resilient Core-\u003ESearch"}],"ARM Case Number":"TS003720163","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]
Was this topic helpful?
Document Information
Modified date:
19 June 2024
UID
ibm16211016