IBM Support

Spark task lost and failed due to timeout

Troubleshooting


Problem

Spark task lost and failed due to timeout

Symptom

Spark job failed with task timeout. Spark driver log captured following messages:

19/10/31 18:31:53 INFO TaskSetManager: Starting task 823.0 in stage 2.0 (TID 1116, <hostname>, executor 3-46246ed5-2297-4a85-a088-e133fa202c6b, partition 823, PROCESS_LOCAL, 8509 bytes)

19/10/31 18:32:07 INFO TaskSetManager: [task] [failed] taskName:823.0 taskId:1116 stageId:2.0 executorId:3-46246ed5-2297-4a85-a088-e133fa202c6b

19/10/31 18:32:07 WARN TaskSetManager: Lost task 823.0 in stage 2.0 (TID 1116, <hostname>, executor 3-46246ed5-2297-4a85-a088-e133fa202c6b): ExecutorLostFailure (executor 3-46246ed5-2297-4a85-a088-e133fa202c6b exited caused by one of the running tasks) Reason: remote Rpc client disassociated

Cause

By default executor updates driver every 10 seconds. The timeout value is set by spark.executor.heartbeat. Due to high network traffic, driver may not receive executor update in time then will consider task on this executor lost and failed. 

Resolving The Problem

Increase spark.executor.heartbeat value to tolerate network latency in a busy network.

Document Location

Worldwide

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS4H63","label":"IBM Spectrum Conductor"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
24 December 2019

UID

ibm11163848