Monitoring IBM Z HMC

The IBM Z HMC sensor is automatically deployed and installed after you install the Instana agent.

To monitor IBM Z HMC in a Kubernetes or Red Hat OpenShift cluster, do not install Instana host agents on each node of the cluster. Install host agents on dedicated host machines.

This is an optional feature, disabled by default in the Instana backend. To enable this optional feature, see the page for your Instana deployment: SaaS, Self-Hosted Custom Edition (Kubernetes or Red Hat OpenShift Container Platform), or Self-Hosted Classic Edition (Docker)

Supported versions

IBM Z HMC sensor is now supported as a platform. Confirmed support for metrics and configuration data for Z/HMC API version 2.x, 3.x and 4.x.

Configuration

Required permissions

To make the sensor connect to Z HMC and monitor the metrics, ensure that the following conditions are met:

  • You need to have the API access, which can be enabled by using HMC Management - Customize API settings.
  • API users need to have the action/task permission to access the Audit and Log Management and View Security Logs tasks.
  • The Allow access to Web Services management interfaces field needs to be checked in the User Details section for the API user.
  • To monitor types of objects or specific objects such as cpcs and lpars, you need to be assigned with the related roles so that you can access these objects.

Sensor configuration

To connect to zHMC server, you need to configure the following fields in the agent configuration <agent_install_dir>/etc/instana/configuration.yaml:

Note: Only remote monitoring is supported. You can have multiple HMCs configured as follows:

com.instana.plugin.zhmc:
  remote:
    - host: ''             # IP address of the HMC
      port: ''             # HMC port
      user: ''             # userid on the HMC to be used for logging on
      password: ''         # password for the userid
      poll_rate: 15        # metrics poll rate in seconds. Poll rate can not be less than 15 seconds.
      eventsPollRate: 60   # event poll rate in seconds (optional). Comment this configuration to stop the events.
      connectionTimeout: 50 # It is the timeout until a connection with the server is established. Default is 50 seconds.
      connectionRequestTimeout: 50 # It is the time to fetch a connection from the connection pool. Default is 50 seconds.
      socketTimeout: 50 # It is socket read time out. Default is 50 seconds.

Metrics collection

To view the metrics, select Platforms in the sidebar of the Instana UI, click zHMC in the listed platforms, and then you can see a dashboard with a list of IBM Z HMC servers in the zHMCs tab and a list of Central Processor Complex (CPC) in the Systems tab.

Currently, this supports following 11 Metric Groups in both Classic and DPM operational mode.

SI No. Metrics Group Name Mode
1 cpc-usage-overview C
2 logical-partition-usage C
3 channel-usage C
4 dpm-system-usage-overview D
5 partition-usage D
6 zcpc-environmentals-and-power C+D
7 zcpc-processor-usage C+D
8 crypto-usage C
9 flash-memory-usage D
10 adapter-usage C
11 network-physical-adapter-port D

C - Classic and D - DPM mode.

Performance metrics

CPC overview (C)

This metric group reports the aggregated processor usage and channel usage, the ambient temperature, and total system power consumption for each system. The cpc-processor-usage is the average of the percentages of processing capacity for all the physical processors in the CPC. The channel-usage is the average of the percentages of I/O capacity for all the channels and adapters in the CPC.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
CPC Processor Usage The processor percent usage for Central Processor Complex processors. 15 seconds
Channel Usage The channel percent usage. 15 seconds
Power Consumption Watts The total system power consumption in watts. 15 seconds
Temperature Celsius The ambient temperature in Degree Celsius. 15 seconds
CP Shared Processor Usage The processor percent usage for shared Central Processors. 15 seconds
CP Dedicated Processor Usage The processor percent usage for dedicated Central Processors. 15 seconds
IFL Shared Processor Usage The processor percent usage for shared Integrated Facility for Linux processors. 15 seconds
IFL Dedicated Processor Usage The processor percent usage for dedicated Integrated Facility for Linux processors. 15 seconds
ICF Shared Processor Usage The processor percent usage for shared Internal Coupling Facility processors. 15 seconds
ICF Dedicated Processor Usage The processor percent usage for dedicated Internal Coupling Facility processors. 15 seconds
IIP Shared Processor Usage The processor percent usage for shared Integrated Information processors. 15 seconds
IIP Dedicated Processor Usage The processor percent usage for dedicated Integrated Information Processors. 15 seconds
AAP Shared Processor Usage The processor percent usage for shared Application Assist Processors. 15 seconds
AAP Dedicated Processor Usage The processor percent usage for dedicated Application Assist Processors. 15 seconds
ALL Shared Processor Usage The processor percent usage for all the shared processors, combined together. 15 seconds
ALL Dedicated Processor Usage The processor percent usage for all the dedicated processors, combined together. 15 seconds
CP ALL Processor Usage The processor percent usage for all the Central Processors, combined together. 15 seconds
IFL ALL Processor Usage The processor percent usage for all the Integrated Facility for Linux processors, combined together. 15 seconds
ICF ALL Processor Usage The processor percent usage for all the Internal Coupling Facility processors, combined together. 15 seconds
IIP ALL Processor Usage The processor percent usage for all the Integrated Information Processors, combined together. 15 seconds
CBP Shared Processor Usage The processor percent usage for shared Container Based Processors. 15 seconds
CBP Dedicated Processor Usage The processor percent usage for dedicated Container Based Processors. 15 seconds
CBP ALL Processor Usage The processor percent usage for all the Container Based Processors. 15 seconds

Logical partitions (C)

This metric group reports the processor usage for each active logical partition (Image, LPAR Image, Zone, PR/SM virtual server) on the system.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
Processor Usage The processor percent usage of the Logical Partition. 15 seconds
CP Processor Usage The processor percent usage for Central Processor. 15 seconds
IFL Processor Usage The processor percent usage for Integrated Facility for Linux processors. 15 seconds
ICF Processor Usage The processor percent usage for Internal Coupling Facility processors. 15 seconds
IIP Processor Usage The processor percent usage for Integrated Information Processors. 15 seconds
CBP Processor Usage The processor percent usage for Container Based Processor. 15 seconds

Channels Usage (C)

This metric group reports the channel usage for each channel on the system. An instance of this metric group is created for each channel of a CPC.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
Channel Name The name of the channel in the form channel subsystem path id. 15 seconds
Shared Channel True if the channel is shared among logical partitions, and false if it is not. 15 seconds
Logical Partition Name The name of the owning logical partition or the value "shared" if the channel is shared. 15 seconds
Channel Usage The channel percent usage (0 – 100%). 15 seconds

DPM system overview (D)

This metric group reports the aggregated processor usage, network usage, storage usage, accelerator usage, crypto usage, power consumption and temperature for each DPM enabled system.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
Processor usage The processor percent usage. 15 seconds
Network usage The network percent usage. 15 seconds
Storage usage The storage percent usage. 15 seconds
Accelerator usage The accelerator percent usage. 15 seconds
Crypto usage The crypto percent usage. 15 seconds
Power consumption watts The power consumption in watts. 15 seconds
Temperature celsius The ambient temperature. 15 seconds
CP shared- processor usage The processor percent usage for all CP shared processors. 15 seconds
CP all processor usage The processor percent usage for all CP processors. 15 seconds
IFL shared processor usage The processor percent usage for all IFL shared processors. 15 seconds
All processor usage The processor percent usage for all IFL processors. 15 seconds
All shared processor usage The processor percent usage for all shared processors. 15 seconds

Partitions (D)

This metric group reports the processor usage, network usage, storage usage, accelerator usage, and crypto usage for each active partition on a DPM enabled system.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
Processor usage The processor percent usage. 15 seconds
Network usage The network percent usage. 15 seconds
Storage usage The storage percent usage. 15 seconds
Accelerator usage The accelerator percent usage. 15 seconds
Crypto usage The crypto percent usage. 15 seconds

zCPC environmentals and power (C+D)

This metric group reports environmental data and power consumption for the zCPC.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
Temperature celsius The ambient temperature 15 seconds
Humidity The relative humidity 15 seconds
Dew point celsius The dew point 15 seconds
Power consumption watts The power consumption in watts 15 seconds
Heat load The total heat load of the system (heat load forced-air + heat load water) 15 seconds
Heat load forced air The heat load covered by forced-air 15 seconds
Heat load water The heat load covered by water 15 seconds
Exhaust temperature celsius The exhaust temperature 15 seconds

zCPC processors (C+D)

This metric group reports the processor usage for each physical zCPC processor on the system. This includes the System Assist Processors (SAPs). An instance of this metric group is created for each processor of a CPC.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
Processor name The name of the zCPC processor in the form processor-type + processor ID. 15 seconds
Processor type The type of zCPC processor. 15 seconds
Processor usage The processor percent usage. 15 seconds
Smt usage The percentage of time the processor is running in simultaneous multithreading (SMT) mode. 15 seconds
Thread 0 usage The percent usage of thread 0 when the processor is running in simultaneous multithreading (SMT) mode 15 seconds
Thread 1 usage The percent usage of thread 1 when the processor is running in simultaneous multithreading (SMT) mode 15 seconds

Cryptos (C)

This metric group reports the adapter usage for each crypto on the system. An instance of this metric group is created for each crypto adapter. This metric group is not used for a DPM system. For DPM, crypto adapters are reported in the Adapters metric group.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
Channel id The physical channel identifier of the crypto 15 seconds
Crypto id The crypto identifier of the crypto, decimal 0-15 15 seconds
Adapter usage The adapter percent usage (0-100%) 15 seconds

Adapters (D)

This metric group reports the adapter usage for each adapter on the DPM enabled system. An instance of this metric group is created for each adapter.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
Adapter usage The adapter percent usage (0-100%) 15 seconds

Flash memory adapters (C)

This metric group reports the adapter usage for each Flash memory (Flash Express) adapter on the system. An instance of this metric group is created for each Flash memory adapter of the CPC. If a CPC has no flash memory adapters, then no data will appear in this metric group for that CPC.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
Channel id The physical channel identifier of the Flash memory adapter 15 seconds
Adapter usage The adapter percent usage (0-100%) 15 seconds

Network adapter port metric group (D)

OSA and RoCE network adapters have up to two physical ports that connect to the network. Metrics are collected from these ports on a DPM enabled system and provided to the user. This metrics group will contain metrics data representing metrics for one physical port.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
network-port-id Numerical value corresponding to the network adapter's physical port. 15 seconds
bytes-sent Number of bytes this physical port sent out to the attached network. 15 seconds
bytes-received Number of unicast packets this physical port received from the attached network. 15 seconds
packets-sent Number of unicast packets this physical port sent out to the attached network. 15 seconds
packets-received Number of unicast packets this physical port received from the attached network. 15 seconds
packets-sent-dropped Number of packets that were dropped when this physical port was sending them out to the attached network. 15 seconds
packets-received- dropped Number of packets that were dropped when this physical port was receiving them from the attached network. 15 seconds
packets-sent- discarded Number of packets that were discarded when this physical port was sending them out to the attached network. 15 seconds
packets-received- discarded Number of packets that were discarded when this physical port was receiving them from the attached network. 15 seconds
multicast-packets-sent Number of multicast packets this physical port sent out to the attached network. 15 seconds
multicast-packets received Number of multicast packets this physical port received from the attached network. 15 seconds
broadcast-packets sent Number of broadcast packets this physical port sent out to the attached network. 15 seconds
broadcast-packets received Number of broadcast packets this physical port received from the attached network. 15 seconds
interval-bytes-sent Number of bytes sent by this physical port over the collection interval. 15 seconds
interval-bytes-received Number of bytes received by this physical port over the collection interval. 15 seconds
bytes-per-second-sent Number of bytes sent per second by this physical port over the collection interval. 15 seconds
bytes-per-second- received Number of bytes per second received by this physical port over the collection interval. 15 seconds
utilization Link utilization expressed as usage percentage of overall link bandwidth. 15 seconds
mac-address The MAC address of this uplink, if known. 15 seconds
flags Flags indicating the types of metrics that are supported by this interface. 15 seconds

Network interface metric group (D)

This metric group reports metrics for NICs on a DPM enabled system. NICs are network resources associated with DPM partitions. Only NICs that are activated will report metric data. This metrics group contains metrics data representing metrics for one NIC. Metrics are collected and provided on an interval, and each metric provided is the total cumulative value, and not a delta.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
partition-id The unique identifier for the partition that owns the NIC whose metric is contained within this metric group. 30 seconds
bytes-per-second-sent Number of bytes sent per second by this network adapter over the collection interval. 30 seconds
bytes-per-second-received Number of bytes per second received by this network adapter over the collection interval. 30 seconds
packets-sent Number of unicast packets this network adapter sent out to the attached network. 30 seconds
packets-received Number of unicast packets this network adapter received from the attached network. 30 seconds
packets-sent-dropped Number of packets that were dropped when this network adapter was sending them out to the attached network. 30 seconds
packets-received-dropped Number of packets that were dropped when this network adapter was receiving them from the attached network. 30 seconds
packets-sent-discarded Number of packets that were discarded when this network adapter was sending them out to the attached network. 30 seconds
packets-received-discarded Number of packets that were discarded when this network adapter was receiving them from the attached network. 30 seconds

RoCE adapters (C)

This metric group reports the adapter usage for each RoCE (10GbE RoCE) adapter on the system. An instance of this metric group is created for each RoCE adapter of the CPC.

The following metrics are provided in each entry of this metric group:

Metric Description Granularity
channel-id The physical channel identifier of the RoCE adapter. 15 seconds
adapter-usage The adapter percent usage (0-100%). 15 seconds

Manage events from CPC(Server) and HMC console

Critical events gets triggered from these servers due to some failures or incidents. Following types of events are sent to Instana to be displayed on the event page:

  • Problematic Hardware Messages from CPC
  • Problematic Hardware Messages from Console
  • Critical Console Audit Events
  • Critical Console Security Events

Troubleshoot

  • Import self-signed certificate used by Z HMC server If Z HMC server uses self-signed certificate make sure it is imported into the jvm's cacert. Also, if you see following exception in the log it implies Z HMC server uses self-signed certificate and it needs to be imported into the jvm's cacert. sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target. PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.

    Solution: Import a self-signed certificate by following the instructions from Self-Signed Certificate.

  • Self-signed certificate having invalid SAN The server certificate is invalid if the SAN does not have the expected ip address.

    Solution: Correct the server certificate and import it again.

  • 403 Forbidden The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.

    Solution: Verify the credentials provided in the configuration.yaml file along with all the required permissions to the user.