Monitoring service for foundational services versions 3.7 and prior

IBM Cloud Pak® foundational services monitoring service is built on top of the Prometheus stack. It provides pre-configured, self-updated monitoring service for clusters and applications.

This topic covers information about how to use the monitoring service in foundational services versions 3.7 and prior.

Features

Metrics visualization

Grafana is installed to query and visualize your metrics. Some built-in dashboards for cluster metrics visualization are created by default. You can also create your custom dashboards.

Multi-tenancy

Monitoring provides Kubernetes namespace level isolation. Grafana Organizations are created automatically per Kubernetes namespace. Users can access dashboards and metrics they are allowed only based on the namespaces to which they have access in Red Hat OpenShift Container Platform.

Alerts

Alerts can be triggered automatically and sent to 3rd-party applications like Slack and PagerDuty.

Customization

Adopters and user can integrate with it easily to query and visualize their application metrics, and create Alerts.

Operators

ibm-monitoring-exporters-operator

This operator installs kube-state-metrics and nodeexporter for cluster metrics collection. Install this operator only when IBM Cloud Pak foundational services Prometheus is configured as the Grafana data source.

ibm-monitoring-prometheusext-operator

This operator installs Prometheus and Alertmanager. It is an extension of the community Prometheus operator. Install this operator only when CS Prometheus is configured as the Grafana datasource.

ibm-monitoring-grafana-operator

Installs Grafana.

IBM Cloud Pak foundational services Monitoring is a single-instance service. Therefore, only one instance of the Grafana pod would be running in your cluster.

Red Hat OpenShift Container Platform monitoring mode and IBM Cloud Pak foundational services monitoring mode

Evolution of Red Hat OpenShift Container Platform monitoring

In Red Hat® OpenShift® Container Platform version 3.4 and earlier, Red Hat® OpenShift® Container Platform does not provide capability for application metrics. The IBM Cloud Pak foundational services monitoring service installs a full Prometheus stack that includes Exporters, Prometheus, Alertmanager, and Grafana.

Red Hat® OpenShift® Container Platform version 4.4 introduced their monitoring feature as a technology preview. IBM Cloud Pak foundational services version 3.5 also introduced a technology preview that provides a migration path to implement Red Hat® OpenShift® Container Platform monitoring.

Red Hat® OpenShift® Container Platform monitoring is generally available in version 4.6. IBM Cloud Pak foundational services version 3.6 offers two monitoring modes; one of which allows Red Hat® OpenShift® Container Platform monitoring to configure Prometheus as a Grafana datasource.

Introduction of two monitoring service modes

IBM Cloud Pak foundational services version 3.6 includes two monitoring service modes, IBM Cloud Pak foundational services monitoring, and Red Hat® OpenShift® Container Platform monitoring.

Red Hat® OpenShift® Container Platform monitoring means that IBM Cloud Pak foundational services monitoring installs only Grafana, and Prometheus is configured as the datasource for Red Hat® OpenShift® Container Platform monitoring.

IBM Cloud Pak foundational services (CS) monitoring means that IBM Cloud Pak foundational services installs its full Prometheus stack, which is configured as the Grafana datasource. You must configure this mode before installation. This mode is the default.

Support for monitoring modes
Mode	Operators	Support for OCP 4.5 and earlier	Support for OCP 4.6 and later
IBM Cloud Pak foundational services Monitoring	Exporter PrometheusExt Grafana	Yes	Yes
Red Hat® OpenShift® Container Platform Monitoring	Grafana	No	Yes

Note: IBM Cloud Pak foundational services monitoring mode is removed from IBM Cloud Pak foundational services version 3.8.x onwards. For more information, see Monitoring service for foundational services versions 3.8 and later.

Accessing the monitoring dashboard
Role-based access control (RBAC)
Installing monitoring service
Configuring monitoring service
Configuring applications to use monitoring service
Managing Grafana dashboards
Alerts
Accessing monitoring service APIs

Accessing the monitoring dashboard

Log in to the IBM Cloud Pak foundational services console.

Note: When you log in to the console, you have administrative access to Grafana. Do not create more users within the Grafana dashboard or modify the existing users or org.
To access the Grafana dashboard, click Menu > Monitor Health > Monitoring.

Alternatively, you can open https://<IP_address>:<port>/grafana, where <IP_address> is the DNS or IP address that is used to access the console. <port> is the port that is used to access the console.

Note: If you are logged in as a Cluster Administrator, you can access the Monitoring dashboard from the Administration panel dashboard. This dashboard provides Cluster Administrators overviews of clusters. The overview includes key metrics for various services and components. It provides links to open other dashboards, pages, and consoles to administer those services and components. From this Administration panel dashboard, you can view and click Monitoring link on the Welcome widget to access the Grafana dashboard. The Administration panel can be accessed by clicking Home within the main navigation menu. Only Cluster Administrators can access the Administration panel dashboard.
(For CS monitoring mode) To access the Alertmanager dashboard, open https://<IP_address>:<port>/alertmanager.
(For CS monitoring mode) To access the Prometheus dashboard, open https://<IP_address>:<port>/prometheus.
The following default Grafana dashboards are created in the Grafana main-org. You must first grant ibm-common-services namespace access to the user.
- Namespaces Performance IBM Provided 2.5
  Provides information about namespace performance and status metrics.
- Performance IBM Provided 2.5
  Provides TCP system performance information about Nodes, Memory, and Containers.
- Kubernetes Cluster Monitoring
  Monitors Kubernetes clusters that use Prometheus. Provides information about cluster CPU, Memory, and Filesystem usage. The dashboard also provides statistics for individual pods, containers, and systemd services.
- Kubernetes POD Overview
  Monitors pod metrics such as CPU, Memory, Network pod status, and restarts.
- NGINX Ingress controller
  Provides information about NGINX Ingress controller metrics that can be sorted by namespace, controller class, controller, and ingress.
- Node Performance Summary
  Provides information about system performance metrics such as CPU, Memory, Disk, and Network for all nodes in the cluster.
- Prometheus Stats
  Dashboard for monitoring Prometheus v2.x.x.

Role-based access control (RBAC)

RBAC for monitoring API

A user with role ClusterAdministrator，Administrator or Operator can access monitoring service. A user with role ClusterAdministrator or Administrator can use write operations in monitoring service, including deleting Prometheus metrics data, and updating Grafana configurations.

RBAC for monitoring data

Starting with version 1.2.0, the ibm-icpmonitoring Helm chart introduces an important feature. It offers a new module that provides role-based access controls (RBAC) for access to the Prometheus metrics data.

The RBAC module is effectively a proxy that sits in front of the Prometheus client pod. It examines the requests for authorization headers, and at that point, enforces role-based controls. The general RBAC rules are as follows.

A user with the ClusterAdministrator role can access any resource. A user with any other role can access data in only the namespaces for which that user is authorized.

If metrics data includes the label, kubernetes_namespace, then it is recognized as being in the namespace, which is the value of that label. If metrics data has no such label, then it is recognized as system level metrics. Only users with the role ClusterAdministrator can access system level metrics.

In a IBM Multicloud Manager hub cluster environment, users can access metrics from managed clusters. A user with the role ClusterAdministrator can access data from all managed clusters. A user with any other role can access data from only the managed clusters whose related namespaces that user is authorized.

RBAC for monitoring dashboards

Starting with version 1.5.0, the ibm-icpmonitoring Helm chart offers a new module that provides role-based access controls (RBAC) for access to the monitoring dashboards in Grafana.

In Grafana, users can belong to one or more organizations. Each organization contains its own settings for resources such as data sources and dashboards. For the Grafana running in your product, each namespace in your product has a corresponding organization with the same name. For example, if you create a new namespace that is named test in your product, an organization that is named test is generated in Grafana. If you delete the test namespace, the test organization is also removed. The only exception is the ibm-common-services namespace. The corresponding organization for ibm-common-services is the Grafana default of Main Org.

When you log in to your product, you can access a Grafana organization only if you are authorized to access the corresponding namespace. If you have access to more than one Grafana organization, use the Grafana console to switch to a different organization. Message, UNAUTHORIZED appears when you do not have access to a Grafana organization.

Different users access Grafana organizations by using different organization roles. In the corresponding namespace, if you are assigned the role of ClusterAdministrator or Administrator, you have Admin access to the Grafana organization. Otherwise, you have Viewer access to the Grafana organization.

When you access Grafana as a user of your product, a user with the same name is created in Grafana. If the user in your product is deleted, the corresponding user is not deleted from Grafana. The user account becomes stale. Run the following command to request the removal of stale users:

  curl -k -s -X POST -H "Authorization:$ACCESS_TOKEN" https://<Cluster Master Host>:<Cluster Master API Port>/grafana/check_stale_users

For information about Grafana APIs, see Accessing monitoring service APIs.

Note: Monitoring service does not provide RBAC support for Prometheus and Alertmanager alerts.

Installing monitoring service

Prerequisites

Foundational service
The monitoring service depends on other services that are provided by IBM Cloud Pak foundational services. If IBM Cloud Pak foundational services is not installed in your OpenShift cluster, see Installing IBM Cloud Pak foundational services online to install the bootstrap operator and initial custom Resource (CR) instances in the ibm-common-services namespace.
Dynamic volume provisioning and storage class for CS monitoring

Prometheus and Alertmanager that are included in the IBM Cloud Pak foundational services monitoring service store metrics and alerts to persistent volumes (PV). ReadWriteOnce (RWO) mode Storage Class and corresponding provisioner is required. Cluster default Storageclass is used by default.
Monitoring for user-defined projects must be enabled and configured for Red Hat® OpenShift® Container Platform.

If you want to use Red Hat OpenShift Container Platform monitoring as a data source for IBM Cloud Pak foundational services Grafana on OpenShift version 4.6, you must first enable monitoring for user-defined projects. For more information, see Enabling monitoring for user-defined projects .

For configuration information, see OpenShift documentation .

Installing IBM Cloud Pak foundational services

Complete the following steps to install IBM Cloud Pak foundational services. For more information, see Installing IBM Cloud Pak foundational services online.

Create or edit the OperandRequest CR.

The following example resembles a CR for Red Hat OpenShift Container Platform monitoring.

apiVersion: operator.ibm.com/v1alpha1
kind: OperandRequest
metadata:
  name: common-service
  namespace: ibm-common-services
spec:
  requests:
    - operands:
        - name: ibm-monitoring-grafana-operator
      registry: common-service

For IBM Cloud Pak foundational services version 3.6, you must edit the OperandConfig CR to enable Red Hat OpenShift Container Platform monitoring as shown in the following examples. You must enable Red Hat OpenShift Container Platform monitoring before you create the OperandRequest CR in Step 1.

   - name: ibm-monitoring-grafana-operator
     spec:
       grafana:
         datasourceConfig:       ### this is the configuration
           type: "openshift"     ### to enable OCP monitoring
       operandRequest: {}

Example CR for CS monitoring:

apiVersion: operator.ibm.com/v1alpha1
kind: OperandRequest
metadata:
  name: common-service
  namespace: ibm-common-services
spec:
  requests:
    - operands:
        - name: ibm-monitoring-exporters-operator
        - name: ibm-monitoring-prometheusext-operator
        - name: ibm-monitoring-grafana-operator
      registry: common-service

Run the following command to check the status of your pods.

oc get po -n ibm-common-services | grep monitoring

For Red Hat OpenShift Container Platform monitoring:

Your output might resemble the following example, which shows that all pods are Running and all containers are available; for example, 4/4 for Prometheus.

   ibm-monitoring-grafana-5b9bbdcd-495dg              4/4     Running     15         3d21h
   ibm-monitoring-grafana-operator-76bc8bbdc8-5vsns   1/1     Running     0          3d22h

**Note:** Four containers are running in the Grafana pod for Red Hat OpenShift Container Platform monitoring, and three containers are running in the Grafana pod for CS monitoring.

For CS monitoring:

   alertmanager-ibm-monitoring-alertmanager-0                3/3       Running                      0          6m46s
   ibm-monitoring-collectd-694dd7868-wsvss                   2/2       Running                      0          6m48s
   ibm-monitoring-exporters-operator-55fd6c876d-44h67        1/1       Running                      0          9m37s
   ibm-monitoring-grafana-7cbc65885f-gnsgk                   3/3       Running                      4          8m55s
   ibm-monitoring-grafana-operator-c8867db64-7b4lj           1/1       Running                      0          9m22s
   ibm-monitoring-kube-state-6f588b8dfd-fl447                2/2       Running                      0          6m47s
   ibm-monitoring-mcm-ctl-6647759b47-2qfv8                   1/1       Running                      0          6m46s
   ibm-monitoring-nodeexporter-6qlhg                         2/2       Running                      0          6m48s
   ibm-monitoring-nodeexporter-7jgsb                         2/2       Running                      0          6m48s
   ibm-monitoring-nodeexporter-nw5qg                         2/2       Running                      0          6m48s
   ibm-monitoring-prometheus-operator-6bbb48d8cb-wd5r5       1/1       Running                      0          7m18s
   ibm-monitoring-prometheus-operator-ext-86cdbc7644-qb4ph   1/1       Running                      0          9m20s
   prometheus-ibm-monitoring-prometheus-0                    4/4       Running                      4          6m47s

Configuring monitoring service

You can configure the monitoring service by editing the Operand Deployment Lifecycle Manager (ODLM) OperandRequest CR. Following is an example of a default CR.

apiVersion: operator.ibm.com/v1alpha1
kind: OperandConfig
metadata:
  name: common-service
  namespace: ibm-common-services
spec:
  services:
    - name: ibm-monitoring-exporters-operator
      spec:
        exporter: {}
    - name: ibm-monitoring-prometheusext-operator
      spec:
        prometheusExt: {}
    - name: ibm-monitoring-grafana-operator
      spec:
        grafana: {}

You can update the configuration parameters. For more information, see Configuring IBM Cloud Pak foundational services by using the CommonService custom resource.

Configuring applications to use the monitoring service

You can configure your applications in any namespace to expose metrics to the monitoring service.

Create a Service object and add specified annotations to it. This step is required for CS monitoring.
- prometheus.io/scrape: 'true'
  
  Required.
- prometheus.io/scheme: 'https'
  
  Optional. Use this parameter when TLS is enabled for your metrics endpoint. Prometheus is configured to skip certificate verification so you can use any certificate to secure your endpoint. For example, you can use Red Hat OpenShift Container Platform annotation, service.beta.openshift.io/serving-cert-secret-name or the IBM Certificate Manager service.
- prometheus.io/path
  
  Optional. Use this parameter when your default value for endpoint is not /metrics.
- prometheus.io/port
  
  Optional. Use this parameter to specify the port for metrics.

The following example illustrates annotations for metrics. It also illustrates how to create certificates by using the Red Hat OpenShift Container Platform service.beta.openshift.io/serving-cert-secret-name annotation.

  apiVersion: v1
  kind: Service
  metadata:
    name: prometheus-metrics-server-demo
    namespace: default
    labels:
      name: prometheus-metrics-server-demo
    annotations:
      ## Generate certificate secret which is used by metrics pod. Only works on OpenShift
      service.beta.openshift.io/serving-cert-secret-name: prometheus-metrics-server-demo
      ## enable cs monitoring metrics scrape
      prometheus.io/scrape: "true"
      ## it uses 8443 port which is https. 
      ## comment it out to use 8080 port
      prometheus.io/scheme: "https"
  spec:
    ports:
    - name: https
      port: 8443
      protocol: TCP
      targetPort: 8443
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
    selector:
      name: prometheus-metrics-server-demo
    type: ClusterIP

You can choose to add the annotations to Pod objects instead of Service objects. However, Service objects are recommended because they support TLS.

Create ServiceMonitor or PodMonitor CRs in the same namespace with your Service object. This step is required for Red Hat OpenShift Container Platform monitoring. For more information, see Prometheus design documentation Opens in a new tab .

Following is an example of a ServiceMonitor CR.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-metrics-server-demo
  namespace: default
spec:
  selector:
    matchLabels:
      name: prometheus-metrics-server-demo
  endpoints:
  - scheme: https
    port: https
    tlsConfig:
      insecureSkipVerify: true

Following is an example of a PodMonitor CR, which is not recommended because they do not support TLS.

  apiVersion: monitoring.coreos.com/v1
  kind: PodMonitor
  metadata:
    name: prometheus-metrics-server-demo
    namespace: default
  spec:
    selector:
      matchLabels:
       name: prometheus-metrics-server-demo
    podMetricsEndpoints:
    - scheme: http
      targetPort: 9157

Managing Grafana dashboards

You can create custom Grafana dashboards by creating MonitoringDashboard CRs. CRs can be created in any namespace and will appear in the corresponding Grafana organization.

Notes:

You must switch to the Grafana organization before you browse the dashboard. Dashboards that are created directly in Grafana are lost when you restart pods.

Create a dashboard on Grafana, and then generate a JSON string for the dashboard. From the dashboard, click Dashboard Setting > JSON Model. For more information about dashboard files, see Dashboard JSON .

Create the MonitoringDashboard CR in following format:

apiVersion: monitoringcontroller.cloud.ibm.com/v1
kind: MonitoringDashboard
metadata:
  name: sample-dashboard
spec:
  enabled: true
  data: |-
     {
        ...
     }

Copy the generated JSON string and use it as the value in the spec.data field of the MonitoringDashboard CR from Step 2.

Note: Remove id and uid fields of the top-level object.

Following is an example of the MonitoringDashboard CR.

   apiVersion: monitoringcontroller.cloud.ibm.com/v1
   kind: MonitoringDashboard
   metadata:
     name: dashboard-demo
     namespace: default
   spec:
     enabled: true
     data: |-
      {
        "annotations": {
          "list": [
            {
              "builtIn": 1,
              "datasource": "-- Grafana --",
              "enable": true,
              "hide": true,
              "iconColor": "rgba(0, 211, 255, 1)",
              "name": "Annotations & Alerts",
              "type": "dashboard"
            }
          ]
        },
        "editable": true,
        "gnetId": null,
        "graphTooltip": 0,
        "links": [],
        "panels": [
          {
            "cacheTimeout": null,
            "colorBackground": false,
            "colorValue": false,
            "colors": [
              "#299c46",
              "rgba(237, 129, 40, 0.89)",
              "#d44a3a"
            ],
            "datasource": "prometheus",
            "format": "none",
            "gauge": {
              "maxValue": 100,
              "minValue": 0,
              "show": false,
              "thresholdLabels": false,
              "thresholdMarkers": true
            },
            "gridPos": {
              "h": 9,
              "w": 12,
              "x": 0,
              "y": 0
            },
            "id": 2,
            "interval": null,
            "links": [],
            "mappingType": 1,
            "mappingTypes": [
              {
                "name": "value to text",
                "value": 1
              },
              {
                "name": "range to text",
                "value": 2
              }
            ],
            "maxDataPoints": 100,
            "nullPointMode": "connected",
            "nullText": null,
            "options": {},
            "postfix": "",
            "postfixFontSize": "50%",
            "prefix": "",
            "prefixFontSize": "50%",
            "rangeMaps": [
              {
                "from": "null",
                "text": "N/A",
                "to": "null"
              }
            ],
            "sparkline": {
              "fillColor": "rgba(31, 118, 189, 0.18)",
              "full": false,
              "lineColor": "rgb(31, 120, 193)",
              "show": false,
              "ymax": null,
              "ymin": null
            },
            "tableColumn": "",
            "targets": [
              {
                "expr": "sum(kube_pod_info{namespace=~\"ibm-common-services\"})",
                "refId": "A"
              }
            ],
            "thresholds": "",
            "timeFrom": null,
            "timeShift": null,
            "title": "Demo Panel",
            "type": "singlestat",
            "valueFontSize": "80%",
            "valueMaps": [
              {
                "op": "=",
                "text": "N/A",
                "value": "null"
              }
            ],
            "valueName": "avg"
          }
        ],
        "schemaVersion": 21,
        "style": "dark",
        "tags": [],
        "templating": {
          "list": []
        },
        "time": {
          "from": "now-6h",
          "to": "now"
        },
        "timepicker": {
          "refresh_intervals": [
            "5s",
            "10s",
            "30s",
            "1m",
            "5m",
            "15m",
            "30m",
            "1h",
            "2h",
            "1d"
          ]
        },
        "timezone": "",
        "title": "Demo Dashboard",
        "version": 0
      }

Save the YAML string as a file and run command oc apply -f <file location>.
Log in to Grafana and switch to the ibm-common-services organization to check the new dashboard.
To delete the dashboard, run command, oc delete monitoringdashboards/dashboard-demo -n default.

Alerts

Default alerts created by IBM Cloud Pak foundational services monitoring

Capability to install default alerts is available in version 1.3.0 of the ibm-icpmonitoring chart. Some alerts provide customizable parameters to control the alert frequency. You can configure the following alerts during installation.

Node memory usage

Default alert to trigger when the node memory threshold exceeds 85%. The threshold is configurable and is installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
`prometheus.alerts.nodeMemoryUsage.nodeMemoryUsage.enabled`	True
`prometheus.alerts.nodeMemoryUsage.nodeMemoryUsageThreshold`	85

High CPU Usage

Default alert to trigger when the CPU threshold exceeds 85%. The threshold is configurable and is installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
`prometheus.alerts.highCPUUsage.enabled`	True
`prometheus.alerts.highCPUUsage.highCPUUsageThreshold`	85

Failed jobs

Default alert if a job did not complete successfully. Is installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
`prometheus.alerts.failedJobs`	True

Pods terminated

Default alert if a pod was terminated and did not complete successfully. This alert is installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
`prometheus.alerts.podsTerminated`	True

Pods restarting

Default alert is triggered if a pod is restarting more than five times in 10 minutes. This parameter is installed by default. If you use the CLI, the following values control this alert:

Field	Default Value
prometheus.alerts.podsRestarting	True

Managing alert rules

You can use the Kubernetes custom resource, PrometheusRule, to manage alert rules in your product.

The following sample-rule.yaml file is an example of an PrometheusRule resource definition:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
    labels:
        component: icp-prometheus
    name: sample-rule
spec:
    groups:
      - name: a.rules
        rules:
          - alert: NodeMemoryUsage
            expr: ((node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes))/ node_memory_MemTotal_bytes) * 100 > 5
            annotations:
              DESCRIPTION: '{{ $labels.instance }}: Memory usage is greater than the 15% threshold.  The current value is: {{ $value }}.'
              SUMMARY: '{{ $labels.instance }}: High memory usage detected'

You must provide the following parameter values:

apiVersion

monitoring.coreos.com/v1

kind

PrometheusRule

metadata.labels.component

icp-prometheus

spec

Contains the content of the alert rule. For more information, see Recording Rules Opens in a new tab .

Accessing monitoring service APIs ()

You can access monitoring service APIs such as Prometheus and Grafana APIs. Before you can access the APIs, you must obtain authentication tokens to specify in your request headers. For information about obtaining authentication tokens, see Preparing to run component or management API commands.

After you obtain the authentication tokens, complete the following steps to access the Prometheus and Grafana APIs.

(For CS monitoring mode only) Access the Prometheus API at url, https://<Cluster Master Host>:<Cluster Master API Port>/prometheus/* and get boot times of all nodes.
- $ACCESS_TOKEN is the variable that stores the authentication token for your cluster.
- <Cluster Master Host> and <Cluster Master API Port> are defined in Master endpoints.
```
curl -k -s -X GET -H "Authorization:Bearer $ACCESS_TOKEN" https://<Cluster Master Host>:<Cluster Master API Port>/prometheus/api/v1/query?query=node_boot_time_seconds
```
For more information, see Prometheus HTTP API .
Access the Grafana API at url, https://<Cluster Master Host>:<Cluster Master API Port>/grafana/*, and obtain the sample dashboard.
- $ACCESS_TOKEN is the variable that stores the authentication token for your cluster.
- <Cluster Master Host> and <Cluster Master API Port> are defined in Master endpoints.
```
curl -k -s -X GET -H "Authorization: Bearer $ACCESS_TOKEN” "https://<Cluster Master Host>:<Cluster Master API Port>/grafana/api/dashboards/db/sample"
```
For more information, see Grafana HTTP API Reference .