Monitoring job throughput and back pressure

It is important to monitor the processing jobs throughput and back pressure to able to determine when the system needs to be scaled.

About this task

Flink exposes a number of metrics about the jobs it processes. Typically, you can use the Flink web user interface to monitor at what rate the events are being processed by the processing jobs, the lag between this throughput and the number of events in the Kafka bus, the existence of back pressure in the system, and similar indicators.

Procedure

  1. Access the Flink web user interface.
    1. Start a Kubernetes proxy by running the kubectl proxy command.

      The Flink web user interface is then available at the following URL: http://localhost:8001/api/v1/namespaces/<my-namespace>/services/https:<my-release>-bai-flink-jobmanager:ui-rest/proxy/

    2. Replace <my-namespace> and <my-release> with the namespace and release name that you used to deploy IBM® Business Automation Insights.
  2. Click Running Jobs, and then click the job you want to monitor.
  3. Click Task Metrics.
  4. Add the metrics that you want to monitor.
    1. Select an operator box in the Job graph that is displayed in the middle of the page.
    2. To monitor metrics related to incoming events, select the box that contains Ingress.
    3. Back in Task Metrics tab, click the Add Metric option that was added.

      A searchable list presents all available metrics.

    4. Add the metric containing ingress. records-consumed-rate to monitor the average number of records consumed per second by the processing job.
    5. Add the metric containing ingress. records-lag-max to monitor the lag in terms of number of records for the Kafka topic partitions.

      An increasing value over time is your best indication that the processing job is not keeping up with the messages that are sent to the Kafka bus. In this case, consider scaling your work configuration. For more information, see Scaling an IBM Business Automation Insights system.

  5. Navigate to the Back Pressure tab.

    From this tab, you can monitor the back pressure behavior of the processing job. If you see, for example, a High back pressure warning for an operator, this means that this operator is producing data faster than the downstream operators can consume. High back pressure is a typical case where you should consider scaling your system. For more information about back pressure, see Monitoring Back Pressure page of the Flink documentation.