Configuring Prometheus in Kubernetes from the command line

Procedure for configuring Prometheus to point to the probe's webhook running on Red Hat Open Shift Container Platform (OCP). It can also be used with the on-premises version of the probe.

Modifying Prometheus Alert Manager and Alert Rules Configuration for OCP Monitoring

  1. Determine the Prometheus Alert Manager configuration secret in the cluster. The default Secret that contains the Alert Manager configuration is in openshift-monitoring. See Applying custom Alertmanager configuration
  2. A sample Alert Manager configuration with the probe webhook config applied is shown below. The sample endpoint http://<probehost>:80/probe/webhook/prometheus.
    global:
      resolve_timeout: '5m'
    receivers:
    - name: 'null'
    - name: 'netcool_probe'
      webhook_configs:
      - url: 'http://<probehost>:80/probe/webhook/prometheus'
        send_resolved: true
    route:
      group_by:
      - alertname
      group_interval: 5m
      group_wait: 30s
      receiver: netcool_probe
      repeat_interval: 5s
      routes:
      - receiver: netcool_probe
        match:
          alertname: Watchdog
  3. Apply the updated Alert Manager configuration file.
  4. For details about applying custom alerting rules, see Managing cluster alerts.
  5. Verify that your probe is receiving the OCP Monitoring alerts and events appear on the Netcool/OMNIbus Event List.

Modifying Prometheus Alert Manager and Alert Rules on IBM Cloud Platform Common Services in Red Hat OCP 4.2

To modify the default CS Monitoring configuration, use the following steps:

  1. Determine the Prometheus Alert Manager config map in the kube-system namespace. In the default configmaps in the kube-system namespace it is: monitoring-prometheus-alertmanager.
  2. Edit the Prometheus Alert Manager config map to add a new receiver in the receivers section. The default Prometheus deployment config map name is monitoring-prometheus-alertmanager in the kube-system namespace. If a separate Prometheus or CS Monitoring instance is deployed, determine the alertmanager config map and add the new receiver. To do this from the command line, configure the kubectl client and follow the steps below.
  3. Load the config map into a file using the following command:

    kubectl get configmap monitoring-prometheus-alertmanager --namespace=kube-system -o yaml > alertmanager.yaml

  4. Edit the alertmanager.yaml file and add the configuration as shown below:
    route:
        receiver: 'netcool_probe'
    
    receivers:
    - name: 'netcool_probe'
      webhook_configs:
      - url: 'http://<probehost>:80/probe/webhook/prometheus'
        send_resolved: true

    Replace the url parameter with the probe's webhook URL. This can be the probe's webhook URL deployed either on Kubernetes or on-premises.

  5. Save the changes in the file and replace the config map using the following command:

    $ kubectl replace configmaps monitoring-prometheus-alertmanager --namespace=kube-system -f alertmanager.yaml

    configmap "monitoring-prometheus-alertmanager" replaced

  6. Review the sample alert rules CRD YAML below. You may update the rules or add more rules to generate more alerts to monitor your cluster. The Message Bus Probe rules file expects the following attributes from the alerts generated by Prometheus Alert Manager:
    • labels.severity: The severity of the alert. Should be set to critical, major, minor, or warning. This is mapped to the Severity field in the ObjectServer alerts.status table.
    • labels.instance: The instance generating the alert. This is mapped to the Node field in the ObjectServer alerts.status table.
    • labels.alertname: The alert rule name. This is mapped to the AlertGroup field in the ObjectServer alerts.status table.
    • annotations.description: (Optional) The full description of the alert. This is mapped to the Summary field in the ObjectServer alerts.status table.
    • annotations.summary: A short description or summary of the alert. This is mapped to the Summary field in the ObjectServer alerts.status table if annotations.description is unset.
    • annotations.type: The alert type. For example, "Container", "Service", or "Service". This is mapped to the AlertKey field in the ObjectServer alerts.status table.
    • labels.release: (Optional) If set, will be mapped to the ScopeId field in the ObjectServer alerts.status table which will be used as the first level group to group related events.
    • labels.job: (Optional) If set, will be mapped to the SiteName field in the ObjectServer alerts.status table which will be used as the sub-group to group related events.
    Note: Sample alert-rules CRD. This file is also available in the included CloudPak under pak_extensions/prometheus-rules.
    # File: netcool-rules.yaml
    # Please modify these rules to monitor specific workloads,
    # containers, services or nodes in your cluster
    apiVersion: monitoringcontroller.cloud.ibm.com/v1
    kind: AlertRule
    metadata:
      name: netcool-rules
    spec:
      enabled: true
      data: |-
        groups:
        - name: alertrules.rules
          rules:
          ## Sample workload monitoring rules
          - alert: jenkins_down
            expr: absent(container_memory_usage_bytes{pod_name=~".*jenkins.*"})
            for: 30s
            labels:
              severity: critical
            annotations:
              description: Jenkins container is down for more than 30 seconds.
              summary: Jenkins down
              type: Container
          - alert: jenkins_high_cpu
            expr: sum(rate(container_cpu_usage_seconds_total{pod_name=~".*jenkins.*"}[1m]))
              / count(node_cpu_seconds_total{mode="system"}) * 100 > 70
            for: 30s
            labels:
              severity: warning
            annotations:
              description: Jenkins CPU usage is {{ humanize $value}}%.
              summary: Jenkins high CPU usage
              type: Container
          - alert: jenkins_high_memory
            expr: sum(container_memory_usage_bytes{pod_name=~".*jenkins.*"}) > 1.2e+09
            for: 30s
            labels:
              severity: warning
            annotations:
              description: Jenkins memory consumption is at {{ humanize $value}}.
              summary: Jenkins high memory usage
              type: Container
          ## End - Sample workload monitoring rules.
          ## Sample container monitoring rules
          - alert: container_restarts
            expr: delta(kube_pod_container_status_restarts_total[1h]) >= 1
            for: 10s
            labels:
              severity: warning
            annotations:
              description: The container {{ $labels.container }} in pod {{ $labels.pod }}
                has restarted at least {{ humanize $value}} times in the last hour on instance
                {{ $labels.instance }}.
              summary: Containers are restarting
              type: Container
          ## End - Sample container monitoring rules.
          ## Sample node monitoring rules
          - alert: high_cpu_load
            expr: node_load1 > 1.5
            for: 30s
            labels:
              severity: critical
            annotations:
              description: Docker host is under high load, the avg load 1m is at {{ $value}}.
                Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.
              summary: Server under high load
              type: Server
          - alert: high_memory_load
            expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes
              + node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes) * 100 > 85
            for: 30s
            labels:
              severity: warning
            annotations:
              description: Docker host memory usage is {{ humanize $value}}%. Reported by
                instance {{ $labels.instance }} of job {{ $labels.job }}.
              summary: Server memory is almost full
              type: Server
          - alert: high_storage_load
            expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"})
              / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85
            for: 30s
            labels:
              severity: warning
            annotations:
              description: Docker host storage usage is {{ humanize $value}}%. Reported by
                instance {{ $labels.instance }} of job {{ $labels.job }}.
              summary: Server storage is almost full
              type: Server
          - alert: monitor_service_down
            expr: up == 0
            for: 30s
            labels:
              severity: critical
            annotations:
              description: Service {{ $labels.instance }} is down.
              summary: Monitor service non-operational
              type: Service
          ## End - Sample node monitoring rules.
  7. Use the following command to create a new AlertRule in the kube-system namespace.

    $ kubectl apply -f netcool-rules.yaml --namespace kube-system

    Note: It usually takes a couple of minutes for Prometheus to reload the updated config maps and apply the new configuration.
  8. Verify that Prometheus events appear on the OMNIbus Event List.