IBM Support

PH43100: IBM DEVELOPER FOR Z SYSTEMS (IDZ V15.0.1) RSE DAEMON TIMEOUT WAITING FOR SERVER W/SSL=TRUE

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • Using IBM Developer for Z systems IDZ V15.0.1 RSE Daemon Timeout
    Waiting for Server w/SSL=TRUE
    When attempting to bring-up RSED with SSL=TRUE on an LPAR with
    relatively minimal CPU resources, an issue occurs where the RSE
    daemon apparently times-out while waiting for the server to
    finish (SSL) initialization:
    
    If  SSL=FALSE is set in ssl.properties, then RSED will
    initialize successfully.
    
    This issue occurs during the RSED started task when
    ssl.properties specifies SSL=TRUE.
    
    The initialization of the SSL environment during RSED startup is
    taking too long, and the daemon times-out waiting for the server
    to finish initializing. Hence, the RSED STC terminates.
    

Local fix

  • A diagnostic TestFix is created to allow the server process to
    complete initialization when SSL=TRUE
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED: 01.All users for a z/OS host system having   *
    *                    CPU resource constraint.                  *
    *                 02.Users connect using certificate           *
    *                    authentication.                           *
    *                 03.All RSE users.                            *
    *                 04.All RSE and RSEAPI users when             *
    *                    downloading dataset.                      *
    *                 05.All RSE users.                            *
    *                 06.z/OS Explorer                             *
    *                 07.z/OS Explorer                             *
    *                 08.z/OS Explorer                             *
    *                 09.Users connect using certificate           *
    *                 authentication.                              *
    ****************************************************************
    * PROBLEM DESCRIPTION: 01.On a z/OS CPU constraint host        *
    *                         system, ThreadPool may fail to       *
    *                         start up in SSL mode.                *
    *                      02.Customer cert-authentication         *
    *                         succeeds only 1 out of 6 attempts.   *
    *                      03.JMON received a PSIRT because it is  *
    *                         an authorized task that binds to a   *
    *                         TCPIP port and it can be started as  *
    *                         a job. RSE API is in the same        *
    *                         situation, the defect was planned    *
    *                         to be resolve to avoid another       *
    *                         PSIRT. It is considered a security   *
    *                         breach because now a hacker who has  *
    *                         gained access to the system can      *
    *                         start the authorized server and use  *
    *                         it as an unmonitored entry point,    *
    *                         and it gives the hacker more         *
    *                         chances on finding a bug in the      *
    *                         code and use RSE API to do actions   *
    *                         with elevated permits.               *
    *                      04.When z/OS Explorer download dataset  *
    *                         members, especially in high level    *
    *                         of concurrency, RSEAPI raw content   *
    *                         download may encounter this          *
    *                         exception                            *
    *                         java.lang.NumberFormatException:     *
    *                         For input string: " " at             *
    *                         java.lang.NumberFormatException.forI *
    *                                                              *
    *                         nputString(NumberFormatException.jav *
    *                         a:76) at                             *
    *                         java.lang.Integer.parseInt(Integer.j *
    *                         ava:581) at                          *
    *                         java.lang.Integer.parseInt(Integer.j *
    *                         ava:627) at                          *
    *                         com.ibm.ftt.rse.mvs.util.FFSAttribut *
    *                                                              *
    *                         eParser.parse(FFSAttributeParser.jav *
    *                         a:183) at                            *
    *                         com.ibm.rse.rest.adapters.dstore.DSt *
    *                                                              *
    *                         oreMVSFilesAdapter.parseAttributes(D *
    *                         StoreMVSFilesAdapter.java:1941) at   *
    *                         com.ibm.rse.rest.adapters.dstore.DSt *
    *                                                              *
    *                         oreMVSFilesAdapter.getMemberAttribut *
    *                         e(DStoreMVSFilesAdapter.java:1809)   *
    *                         at                                   *
    *                         com.ibm.rse.rest.adapters.dstore.DSt *
    *                                                              *
    *                         oreMVSFilesAdapter.determineResource *
    *                                                              *
    *                         ForDownload(DStoreMVSFilesAdapter.ja *
    *                         va:809)                              *
    *                                                              *
    *                         In this case if came as a result of  *
    *                         calling:                             *
    *                         datasets/SYS1.MACLIB(AXREXX)/rawCont *
    *                         ent                                  *
    *                      05.Add the missing validation for       *
    *                         input for fgets() call before using  *
    *                         it in some of the RSE core           *
    *                         functions.                           *
    *                      06.Observed high memory consumption     *
    *                         when JESMiner sending data to JMON.  *
    *                      07.Enable JESMiner to use the trusted   *
    *                         TCP/IP                               *
    *                      08.JESMiner searchPlus command returns  *
    *                         NullPointer exception when the JMON  *
    *                         server is not available.             *
    *                      09.When ThreadPool experiences a        *
    *                      leftover user thread locking a file     *
    *                      and a current user of the ThreadPool    *
    *                      attempt to query the lockinfo of the    *
    *                      file, a NullPointerException (NPE)      *
    *                      could occur. Furthermore, the issue     *
    *                      could trigger a repetition of the       *
    *                      query, and could cause exception and    *
    *                      more leftover threads when the current  *
    *                       user logging off.                      *
    ****************************************************************
    01.Due to high CPU consumption of RSE activities during
       startup, especially the ones related to SSL, ThreadPool may
       not be able to compete for the CPU time to complete its
       startup routine under the expected time interval of 10s.
    02.GSK trace shows EWOULDBLOCK when reading the certificate
       within the gsk_secure_socket_read() (a single read as
       originally implemented for zRSE certificate get).
    03.Since some customers might rely on starting RSE API in
       batch, it is not recommended to allow only starting as STC.
       Instead, detect if RSE API server started not as a task,
       test a RACF profile to see if starting as job as the userid
       is allowed or not. This approach is approved by the Z Secure
       Engineering team.
    04.Catching the exception parsing an empty string for an
       integer. Reset to -1 as the initialized value for record
       count.
    05.As from l standard for a software routine, input to some of
       the RSE backend routines, needs to be validated to avoid
       malicious usage from external caller entrance.
    06.JESMiner calls PrintStream to send data to JMON. Memory
       usage can be improved if it is sending data in a buffer
       instead.
    07.JESMiner can use trusted TCP/IP to log on to the JMON server
       with protocol level 12 and up without providing user name
       and pass ticket.
    08.When the JMON server is not available, the socket, input
       stream, and out stream will be closed and set to null. But
       the message send and read method for communicating between
       JESMiner and JMON is still being called by the JESMiner
       command can caused NullPointer exception.
    09.The NPE during the lockowner query is due to the ThreadPool
    could not map the TCB of the info to any of its current
    connection.
    The leftover thread might happen when the command is a
    cancelable command and the client attempt to repeat it due to
    the NPE error when the connection is terminated.
    

Problem conclusion

  • 01.On a system with CPU constraint resource, when starting up
       in SSL mode, ThreadPool may be time out with the expired
       interval of around 10 min. Moving up the other activities,
       including Daemon's SSL certificate validation and ZOS
       service startup, before starting the ThreadPool (and
       starting the expired timer) helps the ThreadPool make its
       startup time line.
    02.Per gsk documentation:
       https://www.ibm.com/docs/en/zos/2.2.0?topic=reference-gsk-sec
       ure-socket-read gsk_secure_socket_read() [GSK_WOULD_BLOCK] A
       complete SSL record is not available. When a socket is in
       non-blocking mode and a complete SSL record is not
       available, gsk_secure_socket_read() will return with
       GSK_WOULD_BLOCK. No data will be returned in the application
       buffer when GSK_WOULD_BLOCK is returned. The application
       should call gsk_secure_socket_read() again when there is
       data available to be read from the socket.  The fix is to
       have the gsk_secure_socket_read() looping (wait for data to
       be ready and reread) if it is under the GSK_WOULD_BLOCK
       status with a max retries of 3 times (10 sec timeout each).
    03.In RSE API server startup time, check if it is started as a
       job. If it is, check if the current user id allows to access
       the RACF profile name, HUH.START.BATCH.jobname.port in the
       FACILITY class. The profile name format has been discussed
       with RACF team. : Exit if it does not have proper permission
       accessing the profile, an error message will be logged in
       the job log.
    04.Parsing result should be resilient on the obtained result,
       including in error condition. The change improve that
       characteristic of the internal parser.
    05.The validation helps to strengthen zRSE product security and
       behavior, especially in error condition.
    06.The memory consumption is improved by using BufferedWriter
       instead of PrintStream in JESMiner to send data to JMON.
    07.JESMiner can use trusted TCP/IP to log on to the JMON server
       with protocol level 12 and up without providing user name
       and pass ticket.
    08.Update message send and read method to return proper status
       and message when the JMON server is down and the
       socket/input streams/out stream are closed.
    09.The fix is to have the lock info discovery to adjust the
    ownerid to jobbname when no current TCB could be matched.
    The cancelable threads should be cleaned up properly to avoid
    exception during logging off.
      Note: this fix does not resolve the issue of leftover user
      threads, some of them still holding file exclusive lock.
    

Temporary fix

Comments

APAR Information

  • APAR number

    PH43100

  • Reported component name

    DEV FOR Z/OS

  • Reported component ID

    5724T0700

  • Reported release

    320

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-01-04

  • Closed date

    2022-03-07

  • Last modified date

    2022-04-01

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    UI79568

Modules/Macros

  • FEJENF70 FEJJCNFG FEJJJCL  FEJJMON  FEJTSO   FEK1SMPE FEK2RCVE
    FEK3ALOC FEK4ZFS  FEK5MKD  FEK6DDEF FEK7APLY FEK8ACPT FEK@CERR
    FEK@CONE FEK@CONF FEK@CUST FEK@DEB  FEK@DESC FEK@FLOW FEK@GEN
    FEK@GENW FEK@ISPF FEK@IVP  FEK@IVPD FEK@IVPW FEK@JCN1 FEK@JCNE
    FEK@JESJ FEK@MAIN FEK@MIGO FEK@OPTE FEK@OPTG FEK@OPTN FEK@PRIM
    FEK@RSE1 FEK@RSEO FEK@STRT FEK@TAB1 FEK@TAB2 FEK@TAB3 FEK@WRK1
    FEK@WRK2 FEK@WRK3 FEK@WRK4 FEK@WRK5 FEKAPPCC FEKAPPCL FEKAPPCX
    FEKATTR  FEKDSI   FEKEESX0 FEKFASIZ FEKFATT1 FEKFBLD  FEKFCIPH
    FEKFCLIE FEKFCMOD FEKFCMPR FEKFCMSG FEKFCOMM FEKFCOPY FEKFCOR6
    FEKFCORE FEKFDBG  FEKFDBG6 FEKFDBGM FEKFDIR  FEKFDIR6 FEKFDIVP
    FEKFDST0 FEKFDST1 FEKFDST2 FEKFENVF FEKFENVI FEKFENVP FEKFENVR
    FEKFENVS FEKFEPL  FEKFERRF FEKFGDGE FEKFICUL FEKFISPF FEKFIVP0
    FEKFIVPA FEKFIVPD FEKFIVPI FEKFIVPJ FEKFIVPT FEKFJESM FEKFJESU
    FEKFJLIC FEKFJSON FEKFJVM  FEKFLATR FEKFLDSI FEKFLDSL FEKFLEOP
    FEKFLOGS FEKFLPTH FEKFMAI6 FEKFMAIN FEKFMINE FEKFMNTL FEKFNTCE
    FEKFOMVS FEKFPATT FEKFPLUG FEKFPTC  FEKFRIVP FEKFRMSG FEKFRSES
    FEKFRSRV FEKFSCMD FEKFSEND FEKFSSL  FEKFSTUP FEKFT000 FEKFT001
    FEKFT002 FEKFT003 FEKFT004 FEKFT005 FEKFT006 FEKFT007 FEKFT008
    FEKFT009 FEKFT010 FEKFT011 FEKFT012 FEKFT013 FEKFT014 FEKFT015
    FEKFT016 FEKFT017 FEKFT018 FEKFT019 FEKFT020 FEKFT021 FEKFTIVP
    FEKFTSO  FEKFUTIL FEKFVERS FEKFXITA FEKFXITL FEKFZOS  FEKHCONF
    FEKHCUST FEKHDEB  FEKHDESC FEKHFLOW FEKHGEN  FEKHISPF FEKHIVP
    FEKHIVPD FEKHJESJ FEKHMAIN FEKHMIGO FEKHOPTE FEKHOPTN FEKHPRIM
    FEKHRSE1 FEKHRSEO FEKHSTRT FEKHTAB1 FEKHTAB2 FEKINIT  FEKKEYS
    FEKLOCKA FEKLOGR  FEKLOGS  FEKM00   FEKM01   FEKM02   FEKMKDIR
    FEKMOUNT FEKMSGC  FEKMSGS  FEKRACF  FEKRSED  FEKSAPF  FEKSAPPL
    FEKSBPX  FEKSCLAS FEKSCLOG FEKSCMD  FEKSCPYM FEKSCPYU FEKSDSN
    FEKSENV  FEKSETUP FEKSISPF FEKSJCFG FEKSJCMD FEKSJMON FEKSLPA
    FEKSPROG FEKSPTKT FEKSRSED FEKSSERV FEKSSTC  FEKSSU   FEKSUSER
    FEKXCFGE FEKXCFGI FEKXCFGM FEKXCFGT FEKXMAIN FEKXML   HUHFCOR6
    HUHFCORE
    

Fix information

  • Fixed component name

    EXP FOR Z/OS HO

  • Fixed component ID

    5655EXP23

Applicable component levels

  • R320 PSY UI79568

       UP22/04/01 P F203

Fix is available

  • Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

[{"Line of Business":{"code":"LOB35","label":"Mainframe SW"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSBDYH","label":"IBM Explorer for z\/OS"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"320"}]

Document Information

Modified date:
02 April 2022