Memory Allocation Troubles (of Any Kind)

Linux Out Of Memory Killer (OOM Killer)

Problem

In the Linux environment, an OOM Killer is often enabled, which terminates processes when the system is running low on memory.

Symptoms

Investigate /var/log/messages for symptoms indicating OOM Killer activity based on process ID, or just search for oom-kill to find messages similar to those below.

Apr 20 04:50:31 mntplulp kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1079183.slice/session-728522.scope,task=java,pid=3410420,uid=3003095
Apr 20 04:50:31 mntplulp kernel: Out of memory: Killed process 3410420 (java) total-vm:102938344kB, anon-rss:83624480kB, file-rss:0kB, shmem-rss:0kB, UID:3003095 pgtables:170432kB oom_score_adj:0

Solution

Verify that the combined maximum memory allocation for all IBM Manta Data Lineage components does not exceed the physical memory of the server. Follow the recommendations in Manta Flow Memory Settings.

Neo4j: MemoryAllocationLimitException

Problem

The Neo4j database off-heap transaction state memory limit has been exceeded (prior R42.1).

Symptoms

The following exception occurs.

org.neo4j.kernel.impl.util.collection.MemoryAllocationLimitException: Can't allocate extra 512 bytes due to exceeding memory limit; used=2147483392, max=2147483648

Solution

Increase Manta Admin GUI > Configuration > Server > Common > Neo4j Configuration > Advanced > dbms.tx_state.memory_allocation. Restart Flow Server to apply the change.

Additional Information

The off-heap transaction state memory is the total amount of memory shared across all active transactions. This exception often occurs when an individual script being stored has extremely large lineage (> 0.5 GB merger CSV request).

As of R42.1, the transaction state memory has moved to on-heap, because off-heap has been deprecated by Neo4j. If Flow Server runs out of memory, the usual Java heap space exception is thrown.

More details are available in the Neo4j Guide |Transaction-State-Memory.

Neo4j: DatabaseNotFoundException + quarantine_marker

Problem

If Flow Server crashes during an extensive Neo4j write operation (Analysis, Delete) due to a heap OOM exception, Neo4j may mark the database as quarantined for investigation to prevent further corruption (since R42).

Symptoms

The following exception occurs.

Caused by: org.neo4j.dbms.api.DatabaseNotFoundException: neo4j

The top-level exception is likely going to look misleading, because the top-level exception message is just neo4j. It’s necessary to scroll down to the down-most exception in the stack trace to confirm this.

The first occurrence of the exception is likely preceded by a heap OOM exception.

There is a binary file at <MANTA_HOME>/server/manta-dataflow-server-dir/data/neo4j/data/databases/neo4j/quarantine_marker containing the quarantine reason Automatic quarantine because of panic: Java heap space.

Restarting Flow Server doesn’t fix the exception.

Solution

Make sure that Flow Server max heap setting is configured as expected in <MANTA_HOME>/conf/manta.properties. Especially confirm that the value has been migrated during the upgrade from earlier versions. Increase if necessary.

Delete the file <MANTA_HOME>/server/manta-dataflow-server-dir/data/neo4j/data/databases/neo4j/quarantine_marker.

Restart Flow Server to apply updated memory settings and to see that removing the quarantine file unblocks the database.

Additionally, if the heap OOM exception was triggered during a delete scenario (Rollback Revision Scenario, Prune Revision Scenario, Delete Last Committed Revision Scenario), the database is now in an inconsistent state. The deleted revision seems to be deleted, but the actual data are still there.

Create a new revision of the same type as the deleted revision (major or minor), confirm that the same revision number got created, commit it with no data. The original data prior the deletion is expected to appear in it again. Now with the updated memory setting, it’s possible to repeat the deletion to delete it for good.

Additional Information

More details about Neo4j quarantine status are available at https://neo4j.com/docs/operations-manual/current/database-administration/standard-databases/errors/#quarantine.