IBM Support

QRadar SOAR: Disaster Recovery playbook stops with "async task did not complete within the requested time"

Troubleshooting


Problem

When running QRadar SOAR Ansible Disaster Recovery playbooks, such as enabling DR, the playbook might stop with "async task did not complete within the requested time."

Symptom

The playbook stops running and in this case, DR is not enabled.
2024-05-15 20:19:43,952 p=18081 u=resadmin |  TASK [pg_receiver : create a basebackup of the primary] *************************************************************************************************************************************************
2024-05-15 20:19:43,952 p=18081 u=resadmin |  task path: /usr/share/resilient-dr/ansible/roles/pg_receiver/tasks/main.yml:276
2024-05-15 20:19:44,027 p=18081 u=resadmin |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
2024-05-15 20:19:45,249 p=18081 u=resadmin |  Escalation succeeded
2024-05-15 20:20:01,457 p=18081 u=resadmin |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/utilities/logic/async_status.py
2024-05-15 20:20:02,376 p=18081 u=resadmin |  Escalation succeeded

....... LINES REMOVED .......

2024-05-15 23:32:49,949 p=18081 u=resadmin |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/utilities/logic/async_status.py
2024-05-15 23:32:50,853 p=18081 u=resadmin |  Escalation succeeded
2024-05-15 23:33:06,051 p=18081 u=resadmin |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/utilities/logic/async_status.py
2024-05-15 23:33:07,082 p=18081 u=resadmin |  Escalation succeeded
2024-05-15 23:33:07,267 p=18081 u=resadmin |  fatal: [<RECEIVER SERVER>]: FAILED! => {
    "changed": false, 
    "msg": "async task did not complete within the requested time"
}
2024-05-15 23:33:07,267 p=18081 u=resadmin |  ...ignoring
2024-05-15 23:33:07,287 p=18081 u=resadmin |  TASK [pg_receiver : Restore database archive on pg_basebackup failure (this task is skipped if no failure occurred or skip_receiver_db_backup is specified)] ********************************************
2024-05-15 23:33:07,287 p=18081 u=resadmin |  task path: /usr/share/resilient-dr/ansible/roles/pg_receiver/tasks/main.yml:287
2024-05-15 23:33:07,315 p=18081 u=resadmin |  fatal: [<RECEIVER SERVER>]: FAILED! => {
    "msg": "The conditional check 'pg_basebackup_result.rc != 0' failed. The error was: error while evaluating conditional (pg_basebackup_result.rc != 0): 'dict object' has no attribute 'rc'\n\nThe error appears to have been in '/usr/share/resilient-dr/ansible/roles/pg_receiver/tasks/main.yml': line 287, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n- block:\n  - name: Restore database archive on pg_basebackup failure (this task is skipped if no failure occurred or skip_receiver_db_backup is specified)\n    ^ here\n"
}
2024-05-15 23:33:07,317 p=18081 u=resadmin |  PLAY RECAP **********************************************************************************************************************************************************************************************
2024-05-15 23:33:07,317 p=18081 u=resadmin |  <MASTER SERVER>         : ok=76   changed=45   unreachable=0    failed=0   
2024-05-15 23:33:07,317 p=18081 u=resadmin |  <RECEIVER SERVER>         : ok=80   changed=27   unreachable=0    failed=1   
2024-05-15 23:33:07,317 p=18081 u=resadmin |  localhost                  : ok=6    changed=0    unreachable=0    failed=0 

Cause

In /usr/share/resilient-dr/ansible/group_vars/all/vault is a section that describes how long an task Ansible can run for and how often it will check if the task has finished.
# Asynchronous Ansible task ssh timout settings
## This is the maximum amount of time to wait for a single Asynchronous Ansible task to complete (in minutes).
## Currently, this is used for long running tasks such as backing up the receiver db and performing a pg_basebackup of the master db to the receiver.
## Defaults to 180 minutes, though very large databases may require longer, depending on db size and in some cases the network conditions.
vault_vars_maximum_async_wait_in_minutes: 180
## This is the poll interval time (in seconds) specifying how long to wait between each new ssh connection to check if the Asynchronous task has completed.
## Defaults to 15 seconds. this means a new ssh connection will be established every 15 seconds to check for task completion until the timeout value is
## reached or the task itself completes.
vault_vars_poll_interval_in_seconds: 15
By default, the value is 180 minutes. Once the task reaches this period of time "async task did not complete within the requested time" is returned and the playbook stops.

Environment

This might happen on servers that have a large database. Tasks that backup the database will normally take longer to finish.

Diagnosing The Problem

Review /usr/share/resilient-dr/ansible/files/logs/resilient-dr-ansible.log and identify what time the current task starts and the time the Ansible task is stopped.
Review /usr/share/resilient-dr/ansible/group_vars/all/vault and check the value of vault_vars_maximum_async_wait_in_minutes. The value in the vault file should closely match the duration of the failing task.

Resolving The Problem

Update vault_vars_maximum_async_wait_in_minutes with a value that provides adequate time for the task to complete.
Run the playbook again.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSA230","label":"IBM Security QRadar SOAR"},"ARM Category":[{"code":"a8m0z000000cvfMAAQ","label":"Resilient Core-\u003EDisaster Recovery"}],"ARM Case Number":"TS016234810","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
17 May 2024

UID

ibm17153648