Troubleshooting
Problem
Run nzhealthcheck correctly to avoid excessive failures due to data not being collected
Symptom
disk17[spa1.encl8] (from catalog) - spa 1 enclosure 8 slot 17
Netezza System Health Check 2.2
Fatal error.
Reason: System Health Check Daemon is not running
Program terminated.
[nz@nzhost~]$ service nzhealthcheck start
nzhostNPS environment
==================================================
HOME = /export/home/nz
USER = nz
NZ_USER = admin
NZ_DATABASE = system
Starting daemon...
Netezza System Health Check 2.2
Spawning child complete.
Nzhealthcheck started [ OK ]
[nz@nzhost~]$ nzhealthcheck
Netezza System Health Check 2.2
Preparing System Health Check Report
************************************************************************
********************** System Health Check Report **********************
************************************************************************
Report generation date: 2015-05-12 18:30:25
***************************** MINI SYSINFO *****************************
+ Product : IBM PureData System for Analytics N2001-010
+ Model : Q100
+ HPF : 5.3.2
+ FDT : 3.0.5.3
+ NPS : 7.0.2.11-P2 [Build 35921]
+ NPS State : online
+ Online since: 2014-08-09 00:08:38
+ MTM(s) : MTM has not been set
+ NzId :
+ NZ Owner : nz
+ OS : Red Hat Enterprise Linux Server release 6.4 (Santiago)
+ Kernel : 2.6.32-358.el6.x86_64
+ HealthCheck : 2.2 [20140221105807]
+ Hostname : nzhost
+ Up Time : 277 days, 20 hrs, 43 mins, 38 secs
+ Host Up Time: No information
Non-failed (27):
L +- Rule --+--------- Info ---------+----- Result ------+- Severity --
2 | BOM001 | Missing or incorrectly | No data collected | High
| | inserted chassis | |
| | component | |
2 | BOM002a | Missing chassis switch | No data collected | High
2 | BOM004 | Missing or incorrectly | No data collected | High
| | inserted disk | |
| | enclosure component | |
2 | BOM005 | Missing or incorrectly | No data collected | High
| | inserted disk | |
2 | BOM007 | Incorrect number of | No data collected | High
| | DACs | |
2 | BOX001 | Wrong number of CPU | No data collected | High
| | cores detected for a | |
| | blade | |
2 | BOX002 | Incorrect amount of | No data collected | High
| | memory detected for a | |
| | blade | |
2 | BOX003 | Broken FPGA | No data collected | High
2 | BOX005 | Not enabled or wrongly | No data collected | High
| | enabled ESM PHYs | |
2 | BOX006 | Not enabled or wrongly | No data collected | High
| | enabled Blade HBA PHYs | |
2 | BOX007 | Not enabled or wrongly | No data collected | High
| | enabled DAC HBA PHYs | |
2 | BOX008 | Not enabled or wrongly | No data collected | High
| | enabled DAC HBA PHYs | |
2 | BOX009 | Not enabled or wrongly | No data collected | High
| | enabled SAS Switch | |
| | PHYs | |
2 | DM030 | Bad sectors on disk | No data collected | High
2 | DM032 | Bad sectors on new | No data collected | High
| | disk | |
2 | SHC104 | Degraded disk to ESM | No data collected | High
| | link speed | |
2 | SHC110 | Blade ECC memory error | No data collected | High
| | detected | |
2 | DM012 | Multiple SCSI Log page | No data collected | Medium
| | 0x15 events on disk | |
| | head | |
2 | DM015 | Multiple SCSI Log page | No data collected | Medium
| | 0x15 events on disk | |
2 | DM031 | Bad sectors on disk | No data collected | Medium
2 | SHC105 | Disk path not optimal | No data collected | Medium
| | or some paths disabled | |
2 | SHC109 | Blade Throttled | No data collected | Medium
| | Warning detected | |
2 | SHC111 | Suboptimal ethernet | No data collected | Medium
| | interface bonding on | |
| | host | |
2 | DM001 | Disk firmware version | No data collected | Low
| | check | |
2 | DM011 | Multiple SCSI log page | No data collected | Low
| | 0x15 events occuring | |
| | daily | |
2 | DM013 | SCSI rewrite-in-place | No data collected | Low
| | errors | |
2 | SHC106 | Disk path was not | No data collected | Low
| | optimal (historical) | |
L +- Rule --+--------- Info ---------+----- Result ------+- Severity --
Failures (2):
L +- Rule --+-------- Issue ---------+---- Component ----+- Severity --
2 | SHC011 | Broken or reseated | disk8[spa1. (...) | Medium
| | disk or broken | disk9[spa1. (...) |
| | enclosure slot | disk10[spa1 (...) |
| | | ...output omitted |
2 | INT001 | One or more device | ...output omitted | Low
L +- Rule --+-------- Issue ---------+---- Component ----+- Severity --
************************************************************************
********************** System Health Check Report **********************
************************************************************************
Report generation date: 2015-05-12 18:30:25
***************************** MINI SYSINFO *****************************
+ Product : IBM PureData System for Analytics N2001-010
+ Model : Q100
+ HPF : 5.3.2
+ FDT : 3.0.5.3
+ NPS : 7.0.2.11-P2 [Build 35921]
+ NPS State : online
+ Online since: 2014-08-09 00:08:38
+ MTM(s) : MTM has not been set
+ NzId :
+ NZ Owner : nz
+ OS : Red Hat Enterprise Linux Server release 6.4 (Santiago)
+ Kernel : 2.6.32-358.el6.x86_64
+ HealthCheck : 2.2 [20140221105807]
+ Hostname : nzhost
+ Up Time : 277 days, 20 hrs, 43 mins, 38 secs
+ Host Up Time: No information
******************************* WARNINGS *******************************
Warning - these rules could not be evaluated because required data has not been collected yet:
~ BOM001
~ BOM002a
~ BOM004
~ BOM005
~ BOM007
~ BOX001
~ BOX002
~ BOX003
~ BOX005
~ BOX006
~ BOX007
~ BOX008
~ BOX009
~ DM030
~ DM032
~ SHC104
~ SHC110
~ DM012
~ DM015
~ DM031
~ SHC105
~ SHC109
~ SHC111
~ DM001
~ DM011
~ DM013
~ SHC106
******************************** ISSUES ********************************
L2 Rule : SHC011
Issue Detected : Broken or reseated disk or broken enclosure slot
Severity : Medium
Components : disk8[spa1.encl12] (from catalog) - spa 1 enclosure 12 slot 8
disk9[spa1.encl12] (from catalog) - spa 1 enclosure 12 slot 9
...
disk22[spa1.encl10] (from catalog) - spa 1 enclosure 10 slot 22
Expert's Advice :
Inconsistency between system catalog and actual disk inventory was
detected. If there are too many entries in the catalog, information
about some of the disks is duplicated. This may suggest that the
disk or enclosure slot is broken but the failure condition was not
appropriately handled by the NPS software. It may also mean that disk
was reseated or replaced accidentally without appropriate procedure.
In order to fix the problem remove duplicate entries from the system
catalog and diagnose the disk. It may be required to failover the
disk manually. Duplicates can be removed by means of the nzhw
command. Duplicates are usually entries with higher IDs.
------------------------------------------------------------------------
L2 Rule : INT001
Issue Detected : One or more device manager failed
Severity : Low
Components : Devmgr blade_sol for location mm1 failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/blade_sol.pl --loc=mm1 reason:Resource temporarily unavailable ]
Devmgr disktopo for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nzpush_wrap disktopo reason:Resource temporarily unavailable ]
Devmgr env for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nz_env reason:Resource temporarily unavailable ]
Devmgr hwinfo_bom for location failed with code -2 and message [ Could not execute: /usr/bin/nzhwinfo -o bom reason:Resource temporarily unavailable ]
Devmgr hwinfo_logtree for location failed with code -2 and message [ Could not execute: /usr/bin/nzhwinfo -o logtree reason:Resource temporarily unavailable ]
Devmgr hwinfo_mtm for location failed with code -2 and message [ Could not execute: /usr/bin/nzhwinfo -o mtm reason:Resource temporarily unavailable ]
Devmgr hwinfo_registry for location failed with code -2 and message [ Could not execute: /usr/bin/nzhwinfo -o registry reason:Resource temporarily unavailable ]
Devmgr ibm_amm_log for location spa1.mm failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/ibm_amm --loc=spa1.mm --logs reason:Resource temporarily unavailable ]
Devmgr ibm_ethsw for location spa1.ethsw.slot1 failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/ibm_ethsw --loc=spa1.ethsw.slot1 --status=stack reason:Resource temporarily unavailable ]
Devmgr ibm_host for location rack1.host1 failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/ibm_host --loc=rack1.host1 --exceptModules nps::host::hpfhealth nps::host::sysrevcheck nps::host::ptscheck nps::host::concheck --status reason:Resource temporarily unavailable ]
Devmgr ibm_host for location rack1.host2 failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/ibm_host --loc=rack1.host2 --exceptModules nps::host::hpfhealth nps::host::sysrevcheck nps::host::ptscheck nps::host::concheck --status reason:Resource temporarily unavailable ]
Devmgr ibm_hostx for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/ibm_hostx reason:Resource temporarily unavailable ]
Devmgr ibm_mgtsw for location rack1.mgtsw.slot1 failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/ibm_mgtsw --loc=rack1.mgtsw.slot1 --status reason:Resource temporarily unavailable ]
Devmgr ibm_rpc for location rack1.rpc1 failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/ibm_rpc --loc=rack1.rpc1 --status reason:Resource temporarily unavailable ]
Devmgr ibm_rpc for location rack1.rpc2 failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/ibm_rpc --loc=rack1.rpc2 --status reason:Resource temporarily unavailable ]
Devmgr ibm_rpc for location rack1.rpc3 failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/ibm_rpc --loc=rack1.rpc3 --status reason:Resource temporarily unavailable ]
Devmgr ibm_rpc for location rack1.rpc4 failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/ibm_rpc --loc=rack1.rpc4 --status reason:Resource temporarily unavailable ]
Devmgr nz_nps for location failed with code 254 and message [ Command: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nz_nps returned status: 254 ]
Devmgr nz_spux for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nzpush_wrap spux reason:Resource temporarily unavailable ]
Devmgr nzencl for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nzpush_wrap encl reason:Resource temporarily unavailable ]
Devmgr nzgeninventory for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nzpush_wrap inventory reason:Resource temporarily unavailable ]
Devmgr nzmpath for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nzpush_wrap mpath reason:Resource temporarily unavailable ]
Devmgr nzsasphyesm for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nzpush_wrap sasphyesm reason:Resource temporarily unavailable ]
Devmgr nzsasphyhba for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nzpush_wrap sasphyhba reason:Resource temporarily unavailable ]
Devmgr scsi_ls for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nz_query_logsense reason:Resource temporarily unavailable ]
Devmgr vtabledump for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nz_vtable_dumper hwcomps scsi-errors scsi-logsenses scsi-defects reason:Resource temporarily unavailable ]
Devmgr vtabledump_raw for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nz_vtable_dumper scsi-logsense-raws reason:Resource temporarily unavailable ]
Devmgr vtabletransactions for location failed with code -2 and message [ Could not execute: /nz/kit.7.0.2.11-P2/share/nzhealthcheck/bin/../devmgrs/bin/adm/nz_vtable_transaction reason:Resource temporarily unavailable ]
Expert's Advice :
If this issue repeats, report it to IBM Netezza Support.
------------------------------------------------------------------------
************************************************************************
**************************** End of Report *****************************
************************************************************************
Cause
Incorrect execution of nzhealthcheck results in many rules reporting data not collected and device manager reporting many failures due to resources being temporarily unavailable
Diagnosing The Problem
Notice lots of failures during rule evaluation and a high number of device manager failures
Resolving The Problem
Execute nzhealtcheck correctly either in daemon mode or standalone mode:
Daemon mode: /nz/kit/bin/adm/nzhealthcheck
Standalone Bin: (Takes a bit longer)
/nz/support/contrib/bin/adm/nzhealthcheck/nzhealthcheck -S
Enter root password for all data to be collected, if root password is not available you can enter ctl-D when prompted to continue but above issue may occur and data will not be collected for some rules.
The results of nzhealthcheck are placed into a file with more in-depth explanations
Report stored in /nz/kit.<NPS version>/log/nzhealthcheck/Netezza_System_Health_Check_Report_<date>_<time>.txt
Was this topic helpful?
Document Information
Modified date:
17 October 2019
UID
swg21903166