APAR status
Closed as program error.
Error description
When Java batch queues the dispatch request, under load the expected response may not sent back in a timely manner, causing the dispatch request to timeout and be dispatched again even though it was actually dispatched successfully the first time. The re-dispatch may result in exceptions like those below as a result of the log directory already existing: 1) CWLRB3860W: "04/15/21 22:45:31:135 EDT" Job "<JobId>" ended abnormally "and is restartable". 2) java.lang.Exception: Job log part already exists at com.ibm.ws.gridcontainer.services.impl.JobLogManagerImpl$Jo bLogWriter._openCurrentLogPartFile(JobLogManagerImpl.java:1289) The reporting customer also saw exceptions raised by their batch job application when a dataset the job needed was already gone/modified because the job had already run/was already running: com.ibm.batch.api.BatchContainerApplicationException: CWLRB2240E: "Grid Execution Environment step setup open Batch Data Stream failed¨ "jobid <JobId>¨: <AppExceptionClass>: <AppExceptionMessage>
Local fix
N/A
Problem summary
**************************************************************** * USERS AFFECTED: All users of IBM WebSphere Application * * Server * * Java Batch * **************************************************************** * PROBLEM DESCRIPTION: WebSphere Java Batch job dispatch * * requests could be dispatched twice * * under * * load * **************************************************************** * RECOMMENDATION: * **************************************************************** When the Java batch Job Scheduler sends a job dispatch request via HTTP to the batch endpoint servlet, the typical behavior is that the request is processed by the Channel Framework and queued to WLM, and a response is returned to the Job Scheduler dispatch client relatively quickly so it can continue on to the next job to dispatch. Rarely under heavy load the expected response may not be sent back within 30 seconds, causing the dispatch request to timeout with an error like the following: Retry attempt#0 failed: caught exception during http POST: java.net.SocketTimeoutException: Read timed out The original dispatch does eventually complete queuing by Channel Framework but due to the timeout, the dispatch process of the same job would be retried by the Job Scheduler. The re-dispatch may result in an exception due to the job log directory already existing if the dispatch ended up on the same endpoint, and a CWLRB5815E message: Job nnn cannot be dispatched when it is in [executing or submitted] state
Problem conclusion
A code update has been made to add a new Job Scheduler custom property job.dispatch.wait.timeout with default value of 60000 milliseconds. This property value can be increased to wait long before timing out the dispatch request. A documentation update has been made to add this new property to the Job scheduler custom properties documentation. The fix for this APAR is targeted for inclusion in fix pack 8.5.5.22 and 9.0.5.12. For more information, see 'Recommended Updates for WebSphere Application Server': https://www.ibm.com/support/pages/node/715553
Temporary fix
Comments
APAR Information
APAR number
PH39030
Reported component name
WEBSPHERE FOR Z
Reported component ID
5655I3500
Reported release
850
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-07-15
Closed date
2022-03-18
Last modified date
2022-03-18
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
WEBSPHERE FOR Z
Fixed component ID
5655I3500
Applicable component levels
[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS7K4U","label":"WebSphere Application Server for z\/OS"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"850"}]
Document Information
Modified date:
19 March 2022