Configuring Prometheus in Kubernetes from the command line
Procedure for configuring Prometheus to point to the probe's webhook running on Red Hat Open Shift Container Platform (OCP). It can also be used with the on-premises version of the probe.
Modifying Prometheus Alert Manager and Alert Rules Configuration for OCP Monitoring
- Determine the Prometheus Alert Manager configuration secret in the cluster. The default Secret
that contains the Alert Manager configuration is in
openshift-monitoring
. See Applying custom Alertmanager configuration - A sample Alert Manager configuration with the probe webhook config applied is shown below. The
sample endpoint
http://<probehost>:80/probe/webhook/prometheus
.global: resolve_timeout: '5m' receivers: - name: 'null' - name: 'netcool_probe' webhook_configs: - url: 'http://<probehost>:80/probe/webhook/prometheus' send_resolved: true route: group_by: - alertname group_interval: 5m group_wait: 30s receiver: netcool_probe repeat_interval: 5s routes: - receiver: netcool_probe match: alertname: Watchdog
- Apply the updated Alert Manager configuration file.
- For details about applying custom alerting rules, see Managing cluster alerts.
- Verify that your probe is receiving the OCP Monitoring alerts and events appear on the Netcool/OMNIbus Event List.
Modifying Prometheus Alert Manager and Alert Rules on IBM Cloud Platform Common Services in Red Hat OCP 4.2
To modify the default CS Monitoring configuration, use the following steps:
- Determine the
Prometheus Alert Manager
config map in thekube-system
namespace. In the default configmaps in thekube-system
namespace it is:monitoring-prometheus-alertmanager
. - Edit the Prometheus Alert Manager config map to add a new receiver in the
receivers
section. The default Prometheus deployment config map name is monitoring-prometheus-alertmanager in the kube-system namespace. If a separate Prometheus or CS Monitoring instance is deployed, determine the alertmanager config map and add the new receiver. To do this from the command line, configure the kubectl client and follow the steps below. - Load the config map into a file using the following command:
kubectl get configmap monitoring-prometheus-alertmanager --namespace=kube-system -o yaml > alertmanager.yaml
- Edit the alertmanager.yaml file and add the configuration as shown
below:
route: receiver: 'netcool_probe' receivers: - name: 'netcool_probe' webhook_configs: - url: 'http://<probehost>:80/probe/webhook/prometheus' send_resolved: true
Replace the
url
parameter with the probe's webhook URL. This can be the probe's webhook URL deployed either on Kubernetes or on-premises. - Save the changes in the file and replace the config map using the following command:
$ kubectl replace configmaps monitoring-prometheus-alertmanager --namespace=kube-system -f alertmanager.yaml
configmap "monitoring-prometheus-alertmanager" replaced
- Review the sample alert rules CRD YAML below. You may update the rules or add more rules to
generate more alerts to monitor your cluster. The Message Bus Probe rules file expects the following
attributes from the alerts generated by Prometheus Alert Manager:
labels.severity
: The severity of the alert. Should be set tocritical
,major
,minor
, orwarning
. This is mapped to theSeverity
field in the ObjectServeralerts.status
table.labels.instance
: The instance generating the alert. This is mapped to theNode
field in the ObjectServeralerts.status
table.labels.alertname
: The alert rule name. This is mapped to theAlertGroup
field in the ObjectServeralerts.status
table.annotations.description
: (Optional) The full description of the alert. This is mapped to theSummary
field in the ObjectServeralerts.status
table.annotations.summary
: A short description or summary of the alert. This is mapped to theSummary
field in the ObjectServeralerts.status
table ifannotations.description
is unset.annotations.type
: The alert type. For example, "Container
", "Service
", or "Service
". This is mapped to theAlertKey
field in the ObjectServeralerts.status
table.labels.release
: (Optional) If set, will be mapped to theScopeId
field in the ObjectServeralerts.status
table which will be used as the first level group to group related events.labels.job
: (Optional) If set, will be mapped to theSiteName
field in the ObjectServeralerts.status
table which will be used as the sub-group to group related events.
Note: Sample alert-rules CRD. This file is also available in the included CloudPak under pak_extensions/prometheus-rules.# File: netcool-rules.yaml # Please modify these rules to monitor specific workloads, # containers, services or nodes in your cluster apiVersion: monitoringcontroller.cloud.ibm.com/v1 kind: AlertRule metadata: name: netcool-rules spec: enabled: true data: |- groups: - name: alertrules.rules rules: ## Sample workload monitoring rules - alert: jenkins_down expr: absent(container_memory_usage_bytes{pod_name=~".*jenkins.*"}) for: 30s labels: severity: critical annotations: description: Jenkins container is down for more than 30 seconds. summary: Jenkins down type: Container - alert: jenkins_high_cpu expr: sum(rate(container_cpu_usage_seconds_total{pod_name=~".*jenkins.*"}[1m])) / count(node_cpu_seconds_total{mode="system"}) * 100 > 70 for: 30s labels: severity: warning annotations: description: Jenkins CPU usage is {{ humanize $value}}%. summary: Jenkins high CPU usage type: Container - alert: jenkins_high_memory expr: sum(container_memory_usage_bytes{pod_name=~".*jenkins.*"}) > 1.2e+09 for: 30s labels: severity: warning annotations: description: Jenkins memory consumption is at {{ humanize $value}}. summary: Jenkins high memory usage type: Container ## End - Sample workload monitoring rules. ## Sample container monitoring rules - alert: container_restarts expr: delta(kube_pod_container_status_restarts_total[1h]) >= 1 for: 10s labels: severity: warning annotations: description: The container {{ $labels.container }} in pod {{ $labels.pod }} has restarted at least {{ humanize $value}} times in the last hour on instance {{ $labels.instance }}. summary: Containers are restarting type: Container ## End - Sample container monitoring rules. ## Sample node monitoring rules - alert: high_cpu_load expr: node_load1 > 1.5 for: 30s labels: severity: critical annotations: description: Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}. summary: Server under high load type: Server - alert: high_memory_load expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes) * 100 > 85 for: 30s labels: severity: warning annotations: description: Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}. summary: Server memory is almost full type: Server - alert: high_storage_load expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"}) / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85 for: 30s labels: severity: warning annotations: description: Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}. summary: Server storage is almost full type: Server - alert: monitor_service_down expr: up == 0 for: 30s labels: severity: critical annotations: description: Service {{ $labels.instance }} is down. summary: Monitor service non-operational type: Service ## End - Sample node monitoring rules.
- Use the following command to create a new AlertRule in the kube-system namespace.
$ kubectl apply -f netcool-rules.yaml --namespace kube-system
Note: It usually takes a couple of minutes for Prometheus to reload the updated config maps and apply the new configuration. - Verify that Prometheus events appear on the OMNIbus Event List.