IBM Support

How to setup monitor tools for IBM® Cloud Infrastructure Center

How To


Summary

This blog provides details on about how to integrate Prometheus and Grafana, working together in support of monitoring IBM Cloud Infrastructure Center nodes. Prometheus is an open source monitoring solution for storing time series data like metrics, and Grafana, an open source interactive data-visualization platform, enables visualization the data stored in Prometheus.

Environment

s390x architecture node (IBM Z® or IBM® LinuxONE)

Steps

Introduction

Cloud Infrastructure Center can manage a large installation of nodes from the clusters. To enable a constant status checking of the nodes, we implemented a Prometheus infrastructure to monitor, aggregate data, and collect critical metric data. We have setup several crisis alert rules and send notification to inform user in time.

  • Data collection tools, like Node Exporter, RabbitMQ, OpenStack Exporter, and MySQL exporter, export the data into a specific format that can be loaded into Prometheus.

  • The data management tool Prometheus manages the data from different data collection tools and different nodes from Cloud Infrastructure Center clusters·       

  • The data visualization tool Grafana provides a dashboard for the user that aggregates and visualizes the key metrics, like CPU, memory, and network usage.

image-20230713095619-1


Set up steps on monitor node

Steps to install and config Prometheus and Grafana on the monitor node.

Prometheus

Prometheus is an open source monitoring and alerting solution written in Go that collects metrics data and stores that data in a database.

image-20230713095619-2

Download and Installation

a. Login with an s390x architecture monitor node (IBM Z® or IBM® LinuxONE) with root, for example

ssh root@172.26.XXX.XXX

b. Create new folder for the download file

sudo mkdir downloads  cd downloads

c. Get the installation package from https://github.com/prometheus/prometheus/releases

d. Copy the link for s390x architecture version (IBM Z or IBM® LinuxONE) and download

​​​​​​​sudo wget https://github.com/prometheus/prometheus/releases/download/v2.37.6/prometheus-2.37.6.linux-s390x.tar.gz

e. Extract the build with command

sudo tar -xzvf prometheus-2.37.6.linux-s390x.tar.gz

f.  Copy the binary file

sudo cp prometheus-2.37.6.linux-s390x/prometheus /usr/local/bin

sudo cp prometheus-2.37.6.linux-s390x/promtool /usr/local/bin

g. Change folder name

sudo mv prometheus-2.37.6.linux-s390x prometheus

sudo mv prometheus /etc/

chmod 777 /var/lib/prometheus/

Configuration

a. Edit the service file:

vi /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target
[Service]
User=root
Restart=on-failure
ExecStart=/bin/sh -c '/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --web.listen-address=0.0.0.0:9090'
[Install]
WantedBy=multi-user.target

b. Edit the service file:

vi /etc/prometheus/prometheus.yml
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "node_exporter"
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ["localhost:9100","172.26.XXX.XXX:9100"]

Enable and start the Service

sudo systemctl daemon-reload

sudo systemctl enable prometheus.service

sudo systemctl start prometheus.service

Test the Status

sudo systemctl status prometheus.service

sudo firewall-cmd --add-port=9090/tcp

sudo firewall-cmd --reload

Access the service URL, for example http://172.26.3.XXX:9090/


Prometheus alert rules

Alerting rules enables definitions alert conditions, based on Prometheus expression language, and to send notifications about alerts to an external service.

image-20230713095619-3

Create an alert rule

Find the rule file path at monitor node:

/etc/prometheus/alert.rules.yml

Add rule to Prometheus config file​​​​​​​

groups:
- name: alert.rules
  rules:
  - alert: Node_High_CPU_Usage
    expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[2m])) *100) >75
    for: 2m
    labels:
      severity: "critical"
      namespace: monitoring
    annotations:
      summary: "CPU usage on node is over 75%\n Value = {{ $value }}\n Instance = {{ $labels.instance }}\n"
      description: "CPU usage of {{$labels.instance}} is is over 75% "

  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: "critical"
      namespace: monitoring
    annotations:
      summary: "Endpoint {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

  - alert: Node_High_Memory_Usage
    expr: 100 - (avg by (instance)((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) ) * 100 > 75
    for: 2m
    labels:
      severity: warning
      namespace: monitoring
    annotations:
      summary: "Host out of memory (instance {{ $labels.instance }})"
      description: "Node memory is filling up (> 25%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: Node_High_Disk_Usage
    expr: (100 - (avg by (instance)(node_filesystem_avail_bytes{mountpoint="/"}/node_filesystem_size_bytes{mountpoint="/"})) * 100) >75
    for: 1s
    labels:
      severity: warning
      namespace: monitoring
    annotations:
      summary: "Host out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (> 75%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Add the rule file path at:

/etc/prometheus/prometheus.yml
rule_files:
 - "/etc/prometheus/alert.rules.yml"

Restart the Prometheus service and verify that the rule is displayed correctly in the Prometheus menu.

systemctl restart prometheus


Prometheus Alertmanager

The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie.It also takes care of silencing and inhibition of alerts.

image-20230713095619-4

Download and Installation

a. Login with an s390x architecture monitor node (IBM Z/IBM® LinuxONE) with root, for example

ssh root@9.XXX.XXX.XXX

b. Create new folder for the download file

sudo mkdir downloads  cd downloads

c. Get the installation package from https://github.com/prometheus/alertmanager/releases

d. Copy the link for s390x architecture version (IBM Z/IBM® LinuxONE) of release and download

sudo wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-s390x.tar.gz

e. Extract the build with command

sudo tar -xzvf alertmanager-0.25.0.linux-s390x.tar.gz

f. Copy the binary file

sudo cp alertmanager-0.25.0.linux-s390x/alertmanager /usr/local/bin 

Configuration

a. Edit the service file:

vi /etc/systemd/system/alertmanager.service
[Unit]
Description=Alert Manager
After=network-online.target

[Service]
User=root
Restart=on-failure
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/pmalertmanager/email.yml \
  --web.external-url http://9.152.85.201:9093 \
  --cluster.advertise-address=9.152.85.201:9093

[Install]
WantedBy=multi-user.target

b. Edit the service file:

​​​​​​​vi /etc/pmalertmanager/email.yml
global:

  resolve_timeout: 5m

route:

  group_by: ['alertname']

  group_wait: 30s

  group_interval: 5m

  repeat_interval: 1h

  receiver: 'send_email'

receivers:

  - name: 'send_email'

    email_configs:

    - to: 'test@cn.ibm.com'

      from: 'test@cn.ibm.com'

      smarthost: smtpav03.dal12v.mail.ibm.com:25

      auth_username: 'test@cn.ibm.com'

      auth_password: 'yourpassword'

      tls_config:

        insecure_skip_verify: true

inhibit_rules:

  - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

    equal: ['alertname', 'dev', 'instance']

Enable and start the service

sudo systemctl daemon-reload

sudo systemctl enable alertmanager.service

sudo systemctl start alertmanager.service

sudo systemctl restart alertmanager.service

Test the status

sudo systemctl status alertmanager.service

sudo firewall-cmd --add-port=9093/tcp

sudo firewall-cmd --reload

Access the service  URL, for example http:// 9.XXX.XXX.XXX:9093 /


Grafana

Grafana is the open source analytics & monitoring solution for databases.

image-20230713095619-5

Installation at monitor node

a. Refer to the detail steps at: https://grafana.com/docs/grafana/latest/setup-grafana/installation/rpm/

b. Note: Do not use Docker to install Grafana, it replies: “no s390x version image found

Enable and start the service

sudo systemctl daemon-reload

sudo systemctl enable grafana-server

sudo systemctl start grafana-server

Test the status

sudo systemctl status grafana-server

sudo firewall-cmd --add-port=3000/tcp

sudo firewall-cmd --reload

Access the service URL, for example http://172.26.XXX.XXX:3000/

Select the data source, for example http://172.26.XXX.XXX:9090

Import the dashboard, we suggest that uses the code: 11074


Set up steps on the IBM Cloud Infrastructure Center management node and compute node

There are several datums collect tools available to choose from, we use Node Exporter.

Node Exporter

The Prometheus Node Exporter is an open source software tool that collects and exposes metrics about a machine's hardware and operating system. It is used with Prometheus, a monitoring system that can collect and store metrics from various sources.

image-20230713095619-6

Download and Installation

a. Login with an s390x architecture management and compute node (IBMZ/IBM® LinuxONE) with root, for example

ssh root@172.26.XXX.XXX

b. Create new folder for the download file

sudo mkdir downloads  cd downloads

c. Get the installation package from https://github.com/prometheus/node_exporter/releases

d. Copy the link for s390x architecture version (IBMZ/IBM® LinuxONE) and download

sudo wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-s390x.tar.gz

e. Extract the build with command

sudo tar -xzvf node_exporter-1.5.0.linux-s390x.tar.gz

f. Copy the binary file

sudo cp node_exporter-1.5.0.linux-s390x/node_exporter /usr/local/bin

Configuration

a. Edit the service file:

vi /etc/systemd/system/node-exporter.service
[Unit]
Description=Node Exporter
After=network-online.target
[Service]
User=root
Group=root
Type=simple
ExecStart=/bin/sh -c '/usr/local/bin/node_exporter'
[Install]
WantedBy=multi-user.target

Enable and start the service

sudo systemctl daemon-reload

sudo systemctl enable node-exporter.service

sudo systemctl start node-exporter.service

Test the Status

sudo systemctl status node-exporter.service

sudo firewall-cmd --add-port=9100/tcp

sudo firewall-cmd --reload

Access the service url , for example http://172.26.XXX.XXX:9100/metrics


OpenStack Exporter

The OpenStack exporter exports Prometheus metrics from a running OpenStack cloud for consumption by Prometheus.

Download and Installation

a. Login with an s390x architecture management node (IBMZ/IBM® LinuxONE) with root, for example

ssh root@9.152.85.201

b. Create new folder for the download file

sudo mkdir downloads cd downloads

c. Get the installation package from https://github.com/openstack-exporter/openstack-exporter/releases

d. Copy the link for s390x architecture version (IBMZ/IBM® LinuxONE) and download

sudo wget https://github.com/openstack-exporter/openstack-exporter/releases/download/v1.6.0/openstack-exporter_1.6.0_linux_s390x.tar.gz

e. Extract the build with command

sudo tar -xzvf openstack-exporter_1.6.0_linux_s390x.tar.gz

f. Copy the binary file

sudo cp openstack-exporter /usr/local/bin

Copy security file

scp 9.114.16.XXX:/etc/pki/tls/certs/icic.crt  /etc/ssl/certs/large-monitor.crt

Edit the configuration file

mkdir -p /etc/openstack/
vi /etc/openstack/clouds.yaml
clouds:
  large-monitor:
    auth:
      auth_url: https://9.114.16.120:5000/v3
      password: dfltpass
      project_name: ibm-default
      username: root
      user_domain_name: default
      project_domain_name: default
    region_name: RegionOne
    cert: /etc/ssl/certs/large-monitor.crt

Edit the service file:

vi /etc/systemd/system/openstack-exporter.service
[Unit]
Description=OpenStack exporter
After=network-online.target

[Service]
User=root
Restart=on-failure
ExecStart=/bin/sh -c '/usr/local/bin/openstack-exporter \
    --os-client-config /etc/openstack/clouds.yaml \
    large-monitor'

[Install]
WantedBy=multi-user.target

Enable and start the service

sudo systemctl daemon-reload

sudo systemctl enable openstack-exporter.service

sudo systemctl start openstack-exporter.service

Test the status

sudo systemctl status openstack-exporter.service

sudo firewall-cmd --add-port=9180/tcp

sudo firewall-cmd --reload

Access the service URL, for example http://9.152.XXX.XXX:9180/metrics

http://9.152.XXX.XXX:9180/probe?cloud=large-monitor

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB35","label":"Mainframe SW"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSLL2F","label":"IBM Cloud Infrastructure Center"},"ARM Category":[{"code":"a8m0z0000001iPmAAI","label":"Chatbot Used"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
24 July 2023

UID

ibm17011721