Reporting reliability

The following examples describe a few ways of using SMF to report the reliability of the system.

Approximate system availability

SMF produces records at IPL time and when the operator enters a HALT EOD command preceding the scheduled shutdown of the system. By examining these records and the last SMF record recorded before shutdown of the system, an installation can establish the following information for a given time period:
  • Reporting interval
  • Number of IPLs
  • System up time and system down time
  • Number of scheduled stoppages and the approximate amount of scheduled down time
  • Number of unscheduled stoppages and the approximate amount of unscheduled down time
  • Reasons for system failure
  • Operator name

In addition, JES2 and JES3 produce the SMF subsystem start (type 43) and subsystem stop (type 45) records. From these records, an installation can further analyze the system's availability by checking the start time, stop times, and circumstances under which JES2 or JES3 was started (for example, a cold start versus a warm start).

Abend code summary

SMF reports a system or user abend (abnormal end of task) code for each job (and job step) that abends. By tracking those codes issued by operational procedures (such as codes 122 and 222 for operator cancels), an installation can account for any loss of CPU time due to job reruns. More generally, a summary of the abend codes by program name or code allows an installation to determine which programs are abending frequently and which codes are occurring most often. This information might show the need for software error corrections, JCL revisions, or better operating instructions.

Direct access VTOC errors

The SMF record type 19 has a VTOC indicator bit that the system sets if there is a failure while updating a VTOC (volume table of contents). By checking the setting of this bit, operations personnel can identify any VTOCs that might have missing tracks or overlapping data sets.

Tape error statistics

SMF record type 21 provides tape error statistics such as the number of temporary read and write errors, permanent read and write errors, noise blocks, erase gaps, and cleaner actions. By sorting and summarizing these error statistics by tape volume (or tape unit), operations personnel can identify volumes that might need reconditioning or replacement, or point out tape drives that might require cleaning or maintenance.