Workload Controller Scale Actions

Actions associated with a workload controller scale replicas horizontally to maintain Service Level Objectives (SLOs) for your applications. This is a natural representation of these actions because the parent controller's number of replicas are modified. The workload controller then rolls out the changes in the running environment.

For example, when current response time for an application is in direct violation of SLO, Turbonomic will recommend increasing the number of replicas to improve response time. If applications can meet SLOs using less resources, Turbonomic will recommend reducing the replica count to improve infrastructure efficiency.

Action Generation Requirements

Turbonomic generates workload controller scale actions under the following conditions:

Note:

For GenAI LLM inference workloads, see the next section.

  • Services are discovered by the Kubeturbo agent that you deployed to your cluster.

  • Application performance metrics for services are collected through the Instana or Dynatrace target or the Prometurbo metrics server.

    Prometurbo collects application performance metrics from Prometheus, and then exposes the applications and metrics in JSON format through the REST API. Data Ingestion Framework (DIF) accesses the REST API and converts the JSON output to a DTO consumed by Turbonomic.

    To collect metrics through Prometurbo, deploy Prometurbo and enable metrics collection.

  • You have created service policies and configured SLOs for Response Time and Transactions in those policies.

Scale Actions for GenAI LLM Inference Workloads

For GenAI large language model (LLM) inference workloads that use GPU resources and are deployed in a Kubernetes cluster, Turbonomic generates workload controller scale actions to maintain SLOs for the following GPU metrics:

  • Concurrent Queries

  • Queueing Time

  • Service Time

  • Response Time

  • Transactions

Note:

It is assumed that you have set up an LLM inference service on a cluster that has an NVIDIA GPU attached.

The following diagram illustrates how the components listed in the next table work together to support scale actions for LLM inference workloads. To configure these components properly, review the requirements listed in the table.

Components that support scaling of LLM inference workloads

Component

Requirements

NVIDIA DCGM (Data Center GPU Manager)

NVIDIA DCGM is deployed as DaemonSet pods and collects GPU metrics. DCGM exposes these metrics as APIs.

DCGM exporter for Prometheus

DCGM exporter for Prometheus is deployed as DaemonSet pods and collects GPU metrics from DCGM.

DCGM exporter exposes the data for Prometheus to scrape, connects to the Kubelet pod resources API to identify GPU devices associated with a container pod, and then appends the GPU devices to the metrics.

TGI (Text Generation Inference) or vLLM metrics

TGI or vLLM metrics are exposed directly by the LLM-serving services on predefined ports.

Prometheus server

The Prometheus server is configured to scrape both GPU and TGI metrics from DCGM exporter and TGI service endpoint.

The Prometheus server makes these metrics available through PromQL queries.

Kubeturbo agent

Kubeturbo is deployed to your cluster.

Kubeturbo monitors container platform entities and collects standard metrics for these entities.

Prometurbo agent

Prometurbo is deployed to your cluster and Prometurbo metrics collection is enabled.

Prometurbo connects to the Prometheus server and sends PromQL queries to collect GPU and TGI metrics.

Prometurbo requires these CRs:

  • PrometheusQueryMapping – specifies the GPU and TGI metrics to collect

  • PrometheusServerConfig – specifies settings for your Prometheus server

Turbonomic supply chain and charts

Turbonomic stitches the entities discovered from Prometurbo and Kubeturbo into the supply chain. When you set the scope to container platform entities, charts show GPU and TGI metrics.

  • GPU metrics include GPU (utilization of Tensor cores) and GPU memory (utilization of framebuffer memory).

  • TGI metrics include Concurrent Queries, Queueing Time, Service Time, Response Time, and Transactions.

Turbonomic calculates 10-minute and 1-hour moving averages, and then uses the maximum of the two. This mechanism allows for faster generation of scale up actions, and slower or more conservative generation of scale down actions.

Turbonomic service policies

Service policies are created for the services associated with the LLM inference workloads. In these policies:

  • Scope to the relevant services.

    Tip:

    Create a group of services from Settings > Groups and then specify the group as your scope.

  • Turn on Horizontal Scale Down and Horizontal Scale Up actions.

  • Enable SLOs for Concurrent Queries, Queueing Time, Service Time, Response Time, and Transactions, and then specify your preferred SLO values.

Turbonomic generates workload controller scale actions to maintain the SLOs that you defined in the policies. See the next section for information about the generated actions.

Action Visibility

Turbonomic shows and executes SLO-driven scale actions through workload controllers. A single scale action represents the total number of replicas that you need to scale in or out to meet your SLOs.

Action Center page with Scale highlighted

When you examine an action, SLO is indicated as the reason for the action.

Action Details page with graphs and reason for action highlighted
Action Details page with graphs and reason for action highlighted