IBM Cloud Private cluster monitoring

You can use the IBM® Cloud Private cluster monitoring dashboard to monitor the status of your cluster and applications.

The monitoring dashboard uses Grafana and Prometheus to present detailed data about your cluster nodes and containers. For more information about Grafana, see the Grafana documentation Opens in a new tab . For more information about Prometheus, see the Prometheus documentation .

Accessing the monitoring dashboard
Metrics collected out of the box
Role-based access
Installing monitoring service in IBM Cloud Private
Configuring Prometheus server
Alerts
Managing Grafana dashboards
Configuring applications to use monitoring service
Logs and metrics management for Prometheus
Accessing monitoring service APIs
Support for custom cluster access URL in monitoring service

Accessing the monitoring dashboard

Log in to the IBM Cloud Private management console.

Note: When you log in to the management console, you have administrative access to Grafana. Do not create more users within the Grafana dashboard or modify the existing users or org.
To access the Grafana dashboard, click Menu > Platform > Monitoring. Alternatively, you can open https://<IP_address>:<port>/grafana, where <IP_address> is the DNS or IP address that is used to access the IBM Cloud Private console. <port> is the port that is used to access the IBM Cloud Private console.
To access the Alertmanager dashboard, click Menu > Platform > Alerting. Alternatively, you can open https://<IP_address>:<port>/alertmanager.
To access the Prometheus dashboard, open https://<IP_address>:<port>/prometheus.
From the Grafana dashboard, open one of the following default dashboards:
- ElasticSearch
  Provides information about ElasticSearch cluster statistics, shard, and other system information.
- Etcd by Prometheus
  Etcd Dashboard for Prometheus metrics scraper.
- Helm Release Metrics
  Provides information about system metrics such as CPU and Memory for each Helm release that is filtered by pods.
- ICP Namespaces Performance IBM Provided 2.5
  Provides information about namespace performance and status metrics.
- Cluster Network Health (Calico)
  Calico hosts workload and system metric performance information.
- ICP Performance IBM Provided 2.5
  Provides TCP system performance information about Nodes, Memory, and Containers.
- Kubernetes Cluster Monitoring
  Monitors Kubernetes clusters that use Prometheus. Provides information about cluster CPU, Memory, and Filesystem usage. The dashboard also provides statistics for individual pods, containers, and systemd services.
- Kubernetes POD Overview
  Monitors pod metrics such as CPU, Memory, Network pod status, and restarts.
- NGINX Ingress controller
  Provides information about NGINX Ingress controller metrics that can be sorted by namespace, controller class, controller, and ingress.
- Node Performance Summary
  Provides information about system performance metrics such as CPU, Memory, Disk, and Network for all nodes in the cluster.
- Prometheus Stats
  Dashboard for monitoring Prometheus v2.x.x.
- Storage GlusterFS Health
  Provides GlustersFS Health metrics such as Status, Storage, and Node.
- Rook-Ceph
  Dashboard that provides statistics about Ceph instances.
- Storage Minio Health
  Provides storage and network details about Minio server instances.
Note: If you configure pods to use host level resource such as host network, the dashboards display the metrics of the host but not the pod itself.

If you want to view other data, you can create new dashboards or import dashboards from JSON definition files for Grafana.

Metrics collected out of the box

IBM Cloud Private provides the following exporters to provide metrics. The exporters expose metrics endpoints as Kubernetes services.

node-exporter
Provides the node-level metrics, including metrics for CPU, memory, disk, network, and other components.
kube-state-metrics
Provides the metrics for Kubernetes objects, including metrics for pod, deployment, statefulset, daemonset, replicaset, configmap, service, job, and other objects.
elasticsearch-exporter
Provides metrics for the IBM Cloud Private Elasticsearch logging service, including the status for Elasticsearch cluster, shards, and other components.
collectd-exporter
Provides metrics that are sent from the collectd netowork plug-in.

Some IBM Cloud Private Kubernetes pods provide metrics endpoints for Prometheus:

calico-node
Provides metrics for the Calico nodes.
nginx-ingress-controller
Provides metrics for the Nginx ingress controller.

In addition, Prometheus has preconfigured scrape targets that communicate with several targets to scrape metrics:

cAdvisor
Provides container metrics that include CPU, memory, network, and other components.
Prometheus
Provides metrics for the Prometheus server that include metrics for request handle, alert rule evaluation, TSDB status, and other components.
kubernetes-apiservers
Provide metrics for the Kubernetes API servers.
etcd
Provides metrics for the IBM Cloud Private etcd.

Prometheus displays scrape targets in its user interface as links. These addresses are typically not accessible from a user's browser as they are on the Kubernetes cluster internal network. Only the Prometheus server needs to be able to access the addresses.

Role-based access

Role-base access for monitoring API

A user with role ClusterAdministrator，Administrator or Operator can access monitoring service. A user with role ClusterAdministrator or Administrator can perform write operations in monitoring service, including deleting Prometheus metrics data, and updating Grafana configurations.

Role-base access for monitoring data

Starting with version 1.2.0, the ibm-icpmonitoring Helm chart introduces an important feature. It offers a new module that provides role-based access controls (RBAC) for access to the Prometheus metrics data.

The RBAC module is effectively a proxy that sits in front of the Prometheus client pod. It examines the requests for authorization headers, and at that point, enforces role-based controls. In general, the rules concerning RBAC are as follows:

A user with the role ClusterAdministrator can access any resource. A user with any other role can only access data in the namespaces for which that user is authorized.

Installing monitoring service in IBM Cloud Private

Monitoring service is installed by default during IBM Cloud Private installation. You can also select to install monitoring service from the Catalog or CLI.

Installing monitoring service from the Catalog

You can deploy more monitoring stacks with customized configurations from the Catalog in the IBM Cloud Private management console.

From the Catalog page, click the ibm-icpmonitoring Helm chart to configure and install it.
Provide required values for the following parameters.
- Helm release name: "monitoring"
- Target namespace: "kube-system"
- Mode of deployment: "Managed"
- Cluster access address: Specify the Domain Name Service (DNS) or IP address that is used to access the IBM Cloud Private console.
- Cluster access port: Specify the port that is used to access the IBM Cloud Private console. The default port is 8443.
- etcd address: Specify the Domain Name Service (DNS) or IP address for etcd node(s).

Installing monitoring service from the CLI

Install the Kubernetes command line (kubectl). For information about the kubectl CLI, see Accessing your cluster from the Kubernetes CLI (kubectl).
Install the Helm command line interface (CLI). For information about the Helm CLI, see Installing the Helm CLI (helm).

Install the ibm-icpmonitoring Helm chart. Run the following command:

helm install -n monitoring --namespace kube-system --set mode=managed --set clusterAddress=<IP_address> --set clusterPort=<port> ibm-icpmonitoring-1.4.0.tgz

<IP_address> is the DNS or IP address that is used to access the IBM Cloud Private console.

<port> is the port that is used to access the IBM Cloud Private console.

For more information about parameters that you can configure during installation, see Parameters.

Data persistence configuration

By default, user data in the monitoring service components such as Prometheus, Grafana, or AlertManager, is not stored in persistent volumes. The user data is lost if the monitoring service component crashes. To store user data in persistent volumes, you must configure related parameters when you install the monitoring service. Use one of the following options to enable persistent volumes:

Use volumes that are dynamically provisioned. You must use a storage provider that supports dynamic provisioning. For example, you can configure GlusterFS to dynamically create persistent volumes. During configuration, select the checkbox for Persistent volume, and provide values for the following parameters:
- Size for the persistent volume
- Name of the storageClass for the persistentVolume
Use the existing PersistentVolume. You must manually create the persistent volumes before you install the monitoring service. When you define the persistent volumes, you must create labels that are used as an identification mechanism for monitoring service components.

During configuration, select the checkbox for Persistent volume, and provide values for the following parameters:

Size for the persistent volume
Name of the storageClass for the persistentVolume
Field to select the volume
Value of the field to select the volume

In the following example, the value of Field to select the volume is component. The value of Value of the field to select the volume is prometheus:

    apiVersion: v1
    kind: PersistentVolume
    metadata:
        name: monitoring-prometheus-pv
        labels:
            component: prometheus
    .......

Use the existing PersistentVolumeClaims. You must manually create persistent volumes and persistent volume claims. During configuration, select the checkbox for Persistent volume, and provide a value for the Name of existing persistentVolumeClaim parameter.

For information about creating storage classes, PersistentVolume, and PersistentVolumeClaim, see Storage.

Configuring Prometheus server

You can configure the following Prometheus server parameters during preinstallation or post installation:

scrape_Interval

The parameter for the frequency to scrape targets. The default value is 1 minute (1m).
evaluation_Interval

The parameter for the frequency to evaluate rules. The default value is 1 minute (1m).
retention

The parameter for the frequency to remove old data. The default value is 24 hours (24h).
resources.limits.memory

The parameter for the memory limitation for the Prometheus container. The default value is 2048Mi. The Prometheus container crashes if the memory limitation is not fulfilled. You must increase the value of this parameter to ensure that the Prometheus container can work correctly.

Preinstallation configuration

For monitoring service installation and IBM Cloud Private, you can configure the parameters in the config.yaml file before installation. For example, your config.yaml file might resemble the following content:

monitoring:
  prometheus:
    scrape_Interval: 1m
    evaluation_Interval: 1m
    retention: 24h
    resources:
      limits:
        memory: 2048Mi

If you choose to install monitoring service from the Catalog, you can configure the parameters in related console fields.

Post installation configuration

You can also update parameters after you install the monitoring service.

Update the parameters for scrape_Interval and evaluation_Interval in the monitoring-prometheus ConfigMap:
1. Open the monitoring-prometheus ConfigMap from either the IBM Cloud Private console or use the kubectl CLI.
2. Update values for scrape_Interval: 1m or evaluation_Interval: 1m and submit your changes. The parameter updates are applied to the active Prometheus server. You do not have to restart the server.
To update retention parameters:
1. Open the monitoring-prometheus deployment from either the IBM Cloud Private console or use the kubectl CLI.
2. Update the value for - --storage.tsdb.retention=24h, and submit your changes.

Notes:

When you update the retention or resources.limits.memory value, the active Prometheus pod is deleted, and a new Prometheus pod is started.
Modifications to the ConfigMap or deployments are lost if you redeploy the monitoring chart. For example, if you upgrade to a new version.

Alerts

Default alerts

Capability to install default alerts is available in version 1.3.0 of the ibm-icpmonitoring chart. Some alerts provide customizable parameters to control the alert frequency. You can configure the following alerts during installation.

Node memory usage
Default alert to trigger when the node memory threshold exceeds 85%. The threshold is configurable and is installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
prometheus.alerts.nodeMemoryUsage.nodeMemoryUsage.enabled	true
prometheus.alerts.nodeMemoryUsage.nodeMemoryUsageThreshold	85

High CPU Usage
Default alert to trigger when the CPU threshold exceeds 85%. The threshold is configurable and is installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
prometheus.alerts.highCPUUsage.enabled	true
prometheus.alerts.highCPUUsage.highCPUUsageThreshold	85

Failed jobs
Default alert if a job did not compete successfully. Is installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
prometheus.alerts.failedJobs	true

Elasticsearch cluster health
Default alert triggered if the Elasticsearch cluster of the system is not green. This alert is not installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
prometheus.alerts.elasticsearchClusterHealth	false

Pods terminated
Default alert if a pod was terminated and did not complete successfully. This alert is installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
prometheus.alerts.podsTerminated	true

Pods restarting
Default alert is triggered if a pod is restarting more than 5 times in 10 minutes. This is installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
prometheus.alerts.podsRestarting	true

Managing alert rules

You can use the Kubernetes custom resource, AlertRule, to manage alert rules in IBM Cloud Private.

The following sample-rule.yaml file is an example of an AlertRule resource definition.

apiVersion: monitoringcontroller.cloud.ibm.com/v1
kind: AlertRule
metadata:
  name: sample-rule
spec:
  enabled: true
  data: |-
    groups:
      - name: a.rules
        rules:
          - alert: NodeMemoryUsage
            expr: ((node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes))/ node_memory_MemTotal_bytes) * 100 > 5
            annotations:
              DESCRIPTION: '{{ $labels.instance }}: Memory usage is above the 15% threshold.  The current value is: {{ $value }}.'
              SUMMARY: '{{ $labels.instance }}: High memory usage detected'

You must provide the following parameter values:

apiVersion
monitoringcontroller.cloud.ibm.com/v1
kind
AlertRule
spec.data
Contains the content of the alert rule. For detailed information about alert rule files, see Recording Rules .
spec.enabled
Set the flag to specify whether the alert rule is enabled or not enabled.

Use kubectl to manage alert rules.

Create new alert rule.

kubectl apply -f sample-rule.yaml -n kube-system

Edit existing alert rules.

kubectl edit alertrules/sample-rule -n kube-system

Delete existing alert rules.

kubectl delete alertrules/sample-rule -n kube-system

Configuring AlertManager

You can configure Prometheus AlertManager to integrate external alert service receivers, such as Slack or PagerDuty, for IBM Cloud Private.

Important: ConfigMap changes are lost when you upgrade, roll back, or update the monitoring release. In addition, the ConfigMap format can change between releases.

Edit configuration map monitoring-prometheus-alertmanager to update AlertManager configurations.
```
kubectl edit configmap monitoring-prometheus-alertmanager -n kube-system
```
For more information about configuring AlertManager, see Configuration and Notification template examples
Allow several minutes for the updates to take effect. Open the AlertManager dashboard at https://<Cluster Master Host>:<Cluster Master API Port>/alertmanager. Where, <Cluster Master Host>:<Cluster Master API Port> is defined in Master endpoint.
- If you configured alerts, and they are triggered, you can see the alerts in the AlertManager dashboard.
- If you configured an external alert receiver such as Slack or PagerDuty, you can view the alerts in the dashboard for that particular service.
- You can return to the dashboards to view alerts at any time.

Managing Grafana dashboards

You can manage Grafana dashboards by operating on a Kubernetes custom resource MonitoringDashboard in IBM Cloud Private. The following sample-dashboard.yaml file is an example of a MonitoringDashboard resource definition.

apiVersion: monitoringcontroller.cloud.ibm.com/v1
kind: MonitoringDashboard
metadata:
  name: sample-dashboard
spec:
  enabled: true
  data: |-
    {
        "id": null,
        "uid": null,
        "title": "Marco Test Dashboard",
        "tags": [ "test" ],
        "timezone": "browser",
        "schemaVersion": 16,
        "version": 1
      }

You must provide the following parameter values:

apiVersion
monitoringcontroller.cloud.ibm.com/v1
kind
MonitoringDashboard
spec.data
Contains the content of the Grafana dashboard definition file. For more information about dashboard files, see Dashboard JSON .
spec-enabled
Set the flag to specify whether the dashboard is enabled or not enabled.

You can use kubectl to manage the dashboard.

Create a new dashboard using the sample .yaml file.

kubectl apply -f sample-dashboard.yaml -n kube-system

Edit the sample dashboard.

kubectl edit monitoringdashboards/sample-dashboard -n kube-system

Delete the sample dashboard.

kubectl delete monitoringdashboards/sample-dashboard -n kube-system

Configure applications to use monitoring service

Modify the application to expose the metrics.

For applications that have a metrics endpoint, you must define the metrics endpoint as a Kubernetes service by using the annotation prometheus.io/scrape: 'true'. The service definition resembles the following code:
```
   apiVersion: v1
   kind: Service
   metadata:
     annotations:
       prometheus.io/scrape: 'true'
     labels:
       app: liberty
     name: liberty
   spec:
     ports:
     - name: metrics
       targetPort: 5556
       port: 5556
       protocol: TCP
     selector:
       app: liberty
     type: ClusterIP
```
Note: For more information about configuring the metrics endpoint for Prometheus, see CLIENT LIBRARIES in the Prometheus documentation.

Applications can have more than one port defined in the service definition. You might not want to expose monitoring metrics on some ports or have the ports be discovered by Prometheus. You can add annotation filter.by.port.name: 'true' so the port whose name does not start with metrics is ignored by Prometheus. In the following service definition, Prometheus collects metrics from port metrics, and ignores metrics from port collector.

      apiVersion: v1
      kind: Service
      metadata:
        annotations:
          prometheus.io/scrape: 'true'
          filter.by.port.name: 'true'
        labels:
          app: liberty
        name: liberty
      spec:
        ports:
        - name: metrics
          targetPort: 5556
          port: 5556
          protocol: TCP
        - name: collector
          targetPort: 8443
          port: 8443
          protocol: TCP
        selector:
          app: liberty
        type: ClusterIP

For applications that have a metrics endpoint with TLS enabled, you must use IBM Cloud Private cert-manager to generate a secret and use it to configure the metrics endpoint.

Use cert-manager to create a certificate resource for a workload.

     apiVersion: certmanager.k8s.io/v1alpha1
     kind: Certificate
     metadata:
       name: {{ .Release.Name }}-foo-certs
       namespace: {{ .Release.Namespace }}
     spec:
       secretName: {{ .Release.Name }}-foo-certs
       issuerRef:
         name: icp-ca-issuer
         kind: ClusterIssuer
       commonName: "foo"
       dnsNames:
            - "*.{{ .Release.Namespace }}.pod.cluster.local"

Mount the secret to your pod. You can retrieve the cert/key from the mounted path. Under the mounted path, there are two files named tls.crt and tls.key. tls.crt includes a workload cert file and a ca cert file that must use to configure the application metrics endpoint.

    containers:
      - image: foo-image:latest
        name: foo
        volumeMounts:
          - mountPath: "/foo/certs"
            name: certs
    volumes:
      - name: certs
        secret:
          # secretName should be the same as the one defined in step 1.
         secretName: {{ .Release.Name }}-foo-certs

Define annotations on workload service to allow Prometheus to use TLS to scrape metrics, prometheus.io/scrape and prometheus.io/scheme.

     apiVersion: v1
     kind: Service
     metadata:
       annotations:
         prometheus.io/scrape: 'true'
         prometheus.io/scheme: 'https'

For applications that use collectd and depend on collectd-exporter to expose metrics, you update collectd configuration file within the application container. In this configuration file, you must add the network plug-in and point to collectd exporter. Add the following text to the configuration file:
```
      LoadPlugin network
      <Plugin network>
          Server "monitoring-prometheus-collectdexporter.kube-system" "25826"
      </Plugin>
```

Logs and metrics management for Prometheus

You can modify the time period for metric retention by updating the storage.tsdb.retention parameter in the config.yaml file. By default this value is set at 24h, which means that the metrics are kept for 24 hours and then purged. See Configuring the monitoring service.

However, if you need to manually remove this data from the system, you can use the rest API that is provided by the Prometheus component.

To delete metrics data, see Delete Series .
To remove the deleted data from the disk, and clean up the disk space, see Clean Tombstones .

The target URL must have the format:

https://<IP_address>:<Port>/prometheus

<IP_address> is the IP address that is used to access the management console.
<Port> is the port that is used to access the management console.
- The command to delete metrics data resembles the following code:
```
 https://<IP_address>:<Port>/prometheus/api/v1/admin/tsdb/delete_series?*******
```
- The command to remove deleted data and clean up the disk, resembles the following code:
```
 https://<IP_address>:<Port>/prometheus/api/v1/admin/tsdb/clean_tombstones
```

Accessing monitoring service APIs

You can access monitoring service APIs such as Prometheus and Grafana APIs. Before you can access the APIs, you must obtain authentication tokens to specify in your request headers. For information about obtaining authentication tokens, see Preparing to run component or management API commands.

After you obtain the authentication tokens, complete the following steps to access the Prometheus and Grafana APIs.

Access the Prometheus API at url, https://<Cluster Master Host>:<Cluster Master API Port>/prometheus/* and get boot times of all nodes.
- $ACCESS_TOKEN is the variable that stores the authentication token for your cluster.
- <Cluster Master Host> and <Cluster Master API Port> are defined in Master endpoints.
```
curl -k -s -X GET -H "Authorization:Bearer $ACCESS_TOKEN" https://<Cluster Master Host>:<Cluster Master API Port>/prometheus/api/v1/query?query=node_boot_time_seconds
```
For detailed information about Prometheus APIs, see Prometheus HTTP API .
Access the Grafana API at url, https://<Cluster Master Host>:<Cluster Master API Port>/grafana/* and obtain the sample dashboard.
- $ACCESS_TOKEN is the variable that stores the authentication token for your cluster.
- <Cluster Master Host> and <Cluster Master API Port> are defined in Master endpoints.
```
curl -k -s -X GET -H "Authorization: Bearer $ACCESS_TOKEN” "https://<Cluster Master Host>:<Cluster Master API Port>/grafana/api/dashboards/db/sample"
```
For detailed information about Grafana APIs, see Grafana HTTP API Reference .

Support for custom cluster access URL in monitoring service

You can customize the cluster access URL. For more information, see Customizing the cluster access URL. After you complete the customization, you must manually edit the deployments for Prometheus and Alertmanager and verify that all external links are correct.

`monitoring-prometheus` deployment

Edit the monitoring-prometheus deployment from the IBM Cloud Private console or by using kubectl. For example:

 kubectl edit deployment monitoring-prometheus -n kube-system

In the monitoring-prometheus deployment, change --web.external-url=* to the following:

 --web.external-url=https://<custom_host>:<custom_port>/prometheus

<custom_host> and <custom_port> are the customized host name and port that you defined in the custom cluster access URL.

`monitoring-prometheus-alertmanager` deployment

Edit the monitoring-prometheus-alertmanager deployment from the IBM Cloud Private console or by using kubectl. For example:

 kubectl edit deployment monitoring-prometheus -n kube-system

In the monitoring-prometheus-alertmanager deployment, change --web.external-url=* to the following:

 https:<custom_host>:<custom_port>/alertmanager