Configuring Prometheus in Kubernetes from the command line

Procedure for configuring Prometheus to point to the probe's webhook running on Red Hat Open Shift Container Platform (OCP). It can also be used with the on-premises version of the probe.

Modifying Prometheus Alert Manager and Alert Rules Configuration for OCP Monitoring

Determine the Prometheus Alert Manager configuration secret in the cluster. The default Secret that contains the Alert Manager configuration is in openshift-monitoring. See Applying custom Alertmanager configuration

A sample Alert Manager configuration with the probe webhook config applied is shown below. The sample endpoint http://<probehost>:80/probe/webhook/prometheus.

global:
  resolve_timeout: '5m'
receivers:
- name: 'null'
- name: 'netcool_probe'
  webhook_configs:
  - url: 'http://<probehost>:80/probe/webhook/prometheus'
    send_resolved: true
route:
  group_by:
  - alertname
  group_interval: 5m
  group_wait: 30s
  receiver: netcool_probe
  repeat_interval: 5s
  routes:
  - receiver: netcool_probe
    match:
      alertname: Watchdog

Apply the updated Alert Manager configuration file.
For details about applying custom alerting rules, see Managing cluster alerts.
Verify that your probe is receiving the OCP Monitoring alerts and events appear on the Netcool/OMNIbus Event List.

Modifying Prometheus Alert Manager and Alert Rules on IBM Cloud Platform Common Services in Red Hat OCP 4.2

To modify the default CS Monitoring configuration, use the following steps:

Determine the Prometheus Alert Manager config map in the kube-system namespace. In the default configmaps in the kube-system namespace it is: monitoring-prometheus-alertmanager.
Edit the Prometheus Alert Manager config map to add a new receiver in the receivers section. The default Prometheus deployment config map name is monitoring-prometheus-alertmanager in the kube-system namespace. If a separate Prometheus or CS Monitoring instance is deployed, determine the alertmanager config map and add the new receiver. To do this from the command line, configure the kubectl client and follow the steps below.
Load the config map into a file using the following command:
kubectl get configmap monitoring-prometheus-alertmanager --namespace=kube-system -o yaml > alertmanager.yaml
Edit the alertmanager.yaml file and add the configuration as shown below:
```
route:
    receiver: 'netcool_probe'

receivers:
- name: 'netcool_probe'
  webhook_configs:
  - url: 'http://<probehost>:80/probe/webhook/prometheus'
    send_resolved: true
```
Replace the url parameter with the probe's webhook URL. This can be the probe's webhook URL deployed either on Kubernetes or on-premises.
Save the changes in the file and replace the config map using the following command:
$ kubectl replace configmaps monitoring-prometheus-alertmanager --namespace=kube-system -f alertmanager.yaml

configmap "monitoring-prometheus-alertmanager" replaced

Review the sample alert rules CRD YAML below. You may update the rules or add more rules to generate more alerts to monitor your cluster. The Message Bus Probe rules file expects the following attributes from the alerts generated by Prometheus Alert Manager:

labels.severity: The severity of the alert. Should be set to critical, major, minor, or warning. This is mapped to the Severity field in the ObjectServer alerts.status table.
labels.instance: The instance generating the alert. This is mapped to the Node field in the ObjectServer alerts.status table.
labels.alertname: The alert rule name. This is mapped to the AlertGroup field in the ObjectServer alerts.status table.
annotations.description: (Optional) The full description of the alert. This is mapped to the Summary field in the ObjectServer alerts.status table.
annotations.summary: A short description or summary of the alert. This is mapped to the Summary field in the ObjectServer alerts.status table if annotations.description is unset.
annotations.type: The alert type. For example, "Container", "Service", or "Service". This is mapped to the AlertKey field in the ObjectServer alerts.status table.
labels.release: (Optional) If set, will be mapped to the ScopeId field in the ObjectServer alerts.status table which will be used as the first level group to group related events.
labels.job: (Optional) If set, will be mapped to the SiteName field in the ObjectServer alerts.status table which will be used as the sub-group to group related events.

Note: Sample alert-rules CRD. This file is also available in the included CloudPak under pak_extensions/prometheus-rules.

# File: netcool-rules.yaml
# Please modify these rules to monitor specific workloads,
# containers, services or nodes in your cluster
apiVersion: monitoringcontroller.cloud.ibm.com/v1
kind: AlertRule
metadata:
  name: netcool-rules
spec:
  enabled: true
  data: |-
    groups:
    - name: alertrules.rules
      rules:
      ## Sample workload monitoring rules
      - alert: jenkins_down
        expr: absent(container_memory_usage_bytes{pod_name=~".*jenkins.*"})
        for: 30s
        labels:
          severity: critical
        annotations:
          description: Jenkins container is down for more than 30 seconds.
          summary: Jenkins down
          type: Container
      - alert: jenkins_high_cpu
        expr: sum(rate(container_cpu_usage_seconds_total{pod_name=~".*jenkins.*"}[1m]))
          / count(node_cpu_seconds_total{mode="system"}) * 100 > 70
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Jenkins CPU usage is {{ humanize $value}}%.
          summary: Jenkins high CPU usage
          type: Container
      - alert: jenkins_high_memory
        expr: sum(container_memory_usage_bytes{pod_name=~".*jenkins.*"}) > 1.2e+09
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Jenkins memory consumption is at {{ humanize $value}}.
          summary: Jenkins high memory usage
          type: Container
      ## End - Sample workload monitoring rules.
      ## Sample container monitoring rules
      - alert: container_restarts
        expr: delta(kube_pod_container_status_restarts_total[1h]) >= 1
        for: 10s
        labels:
          severity: warning
        annotations:
          description: The container {{ $labels.container }} in pod {{ $labels.pod }}
            has restarted at least {{ humanize $value}} times in the last hour on instance
            {{ $labels.instance }}.
          summary: Containers are restarting
          type: Container
      ## End - Sample container monitoring rules.
      ## Sample node monitoring rules
      - alert: high_cpu_load
        expr: node_load1 > 1.5
        for: 30s
        labels:
          severity: critical
        annotations:
          description: Docker host is under high load, the avg load 1m is at {{ $value}}.
            Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server under high load
          type: Server
      - alert: high_memory_load
        expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes
          + node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes) * 100 > 85
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Docker host memory usage is {{ humanize $value}}%. Reported by
            instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server memory is almost full
          type: Server
      - alert: high_storage_load
        expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"})
          / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85
        for: 30s
        labels:
          severity: warning
        annotations:
          description: Docker host storage usage is {{ humanize $value}}%. Reported by
            instance {{ $labels.instance }} of job {{ $labels.job }}.
          summary: Server storage is almost full
          type: Server
      - alert: monitor_service_down
        expr: up == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          description: Service {{ $labels.instance }} is down.
          summary: Monitor service non-operational
          type: Service
      ## End - Sample node monitoring rules.

Use the following command to create a new AlertRule in the kube-system namespace.
$ kubectl apply -f netcool-rules.yaml --namespace kube-system

Note: It usually takes a couple of minutes for Prometheus to reload the updated config maps and apply the new configuration.
Verify that Prometheus events appear on the OMNIbus Event List.