Monitoring the platform

From the IBM Cloud Pak for Data web client, you can monitor the services that are running on the platform, understand how you are using cluster resources, and be aware of issues as they arise. You can also set quotas on the platform, individual services, and on projects to help mitigate unexpected spikes in resource use.

Accessing the Monitoring page

Required permissions:
To access the Monitoring page, you must have one of the following permissions:
  • Administer platform
  • Manage platform health
  • View platform health
To access the Monitoring page:
  1. Log in to the Cloud Pak for Data web client.
  2. From the navigation menu, select Administration > Monitoring.
From the Monitoring page, you can:
  • See the current resource use (vCPU and memory) for the platform

    If you click the View status and use data arrow on the Platform resource overview card, you can see a breakdown by services, service instances, tool runtimes, data planes, physical locations, pods, and projects.

  • Review the platform resource use for the last 12 hours

    If you click the View historical data arrow on the Platform resource use card, you can see a breakdown by services, service instances, tool runtimes, data planes, physical locations, pods, and projects. You can also view historical data beyond 12 hours. By default, the platform stores up to 30 days of data. However, you can adjust the length of time that data is retained. For details, see Changing the retention period for IBM Cloud Pak for Data monitoring data.

    Platform resource use information includes pods from physical locations, if there are physical locations associated with the Cloud Pak for Data instance.

  • Access at-a-glance platform monitoring
  • View events and alerts
  • Configure and enforce quotas

At-a-glance platform monitoring

From the Status summary column on Monitoring page, you can see the status of the following items on the platform:
Available cards Status information Get more detailed information
Services

Services are software that is installed on the platform. Services consume resources as part of their regular operations.

From the Monitoring page, you can see:
  • How many services are installed on the platform
  • The number of services that have either:
    • A service instance in a failed state
    • A pod in a failed or unknown state
  • The number of services that have either:
    • A service instance in a pending state
    • A pod in a pending state
Click the Services card to see:
  • The historical vCPU and memory use for all services

    You can optionally filter the graph to show a single service.

  • The status (or health) of each service
  • The number of service instances, environments, and jobs that are associated with the service (if applicable)
  • The vCPU quota status and the memory quota status (if set)
You can optionally configure the table to show:
  • The current vCPU use, requests, and limits
  • The current memory use, requests, and limits
You can select a service to see:
  • The historical vCPU and memory use for the service
  • The pods that are associated with the service
  • The pods with the highest vCPU and memory use for the service
  • The vCPU and memory quotas for the service (if set)
Service instances

Some services can be deployed multiple times after they are installed. Each deployment is called a service instance.

Service instances consume resources as part of their normal operations.

From the Monitoring page, you can see:
  • How many service instances are deployed on the platform
  • The number of service instances where either:
    • The instance is in a failed state
    • A pod is in a failed or unknown state
  • The number of services instances where either:
    • The instance is in an unknown state
    • A pod is in a pending state
Click the Service instances card to see:
  • The historical vCPU and memory use for all service instances

    You can optionally filter the graph to show a single instance

  • The status (or health) of each service instance
  • The service that the service instance is associated with
  • Who provisioned the instance and when
  • The number of users who have access to the service instance
  • The number of pods associated with the service instance
You can optionally configure the table to show:
  • The current vCPU use, requests, and limits
  • The current memory use, requests, and limits
You can select a service to see:
  • The historical vCPU and memory use for the service instance
  • The pods that are associated with the service instance
You can click the Options icon (Image of the Options icon.) for a service instance to:
  • Manage access to the instance
  • Delete the instance
However, to complete either of these tasks, you must be an administrator of the service instance or you must have the Administer platform permission.
Tool runtimes

Runtime environments specify the hardware and software configurations for environments for analytical assets and jobs. Environments consume resources as part of their regular operations.

By default, this card is not displayed on the platform. It is displayed only if you install a service that uses environments.

From the Monitoring page, you can see:
  • How many environments are running on the platform
  • The number of environments with at least one pod in a failed state
Click the Tool runtimes card to see:
  • The status (or health) of each environment
  • Who started the environment and when
  • The project or deployment space where the environment is running
  • The number of GPU requests
  • The current resource use for the environment

You can select a environment to see the pods that are associated with it.

You can optionally click the Stop runtime instance icon (Image of the Stop runtime instance icon) to stop the environment.

Pods

Services are composed of Kubernetes pods.

If a pod is failed or unknown, it can impact the health of the service. If a pod is pending, the service might not be able to process specific requests until the pod is running.

From the Monitoring page, you can see:
  • How many pods are associated with the platform
  • The number of pods in a failed or unknown state
  • The number of pods that are pending
    Kubernetes is attempting to create and schedule these pods. The pods might remain in pending state if:
    • Kubernetes is waiting for a process to complete or doesn't have sufficient resources to fulfill the pod requests
    • The platform or service quota settings are preventing new pods from being created
Click the Pods card to see:
  • The status (or health) of each pod
  • The number of containers in the ready state compared to the number of containers defined for the pod
  • The service the pod is associated with
  • Whether the pod is associated with a fixed resource, service instance, job, or environment
  • The function or application of the pod
  • The service instance that the pod is associated with
  • When the pod was started
  • How many times the pod has restarted
You can optionally configure the table to show:
  • The Red Hat® OpenShift® project (namespace) where the pod is running
  • The environment, job, project, or deployment space that the pod is associated with
  • The current vCPU use, requests, and limits
  • The current memory use, requests, and limits
You can click the Options icon (Image of the Options icon.) for a pod to:
  • See the details of the pod
  • View the pod logs
  • Restart the pod
Data planes

A data plane is a logical grouping of one or more physical locations. You can deploy workloads to a data plane. The workload will be scheduled on one of the physical locations associated with the data plane.

From the Monitoring page, you can see:
  • How many data planes are currently on the platform
Click the Data planes card to see:
  • The current vCPU use and requests for each data plane
  • The current memory use and requests for each data plane
You can select a data plane to see:
  • The historical vCPU and memory use for the data plane
  • The pods that are associated with the data plane
Projects

Projects are collaborative workspaces where you work with data and other assets to accomplish a particular goal.

By default, this card is not displayed on the platform. It is displayed only if you install a service that uses the Cloud Pak for Data common core services.

From the Monitoring page, you can see:
  • How many projects are currently on the platform
Click the Projects card to see:
  • The current vCPU use and requests for each project
  • The current memory use and requests for each project
You can optionally configure the table to show:
  • The current vCPU limits
  • The current memory limits
You can select a project to see:
  • The project collaborators
  • The historical vCPU and memory use for the project
  • The pods that are associated with the project
Physical locations

A remote physical location is processing infrastructure on a remote cluster. When you set up a remote physical location, you install Cloud Pak for Data agents on a remote cluster.

After you set up a remote physical location, you can register the physical location with the instance of Cloud Pak for Data that you want to expand. Then, you can add the physical location to a data plane. You can optionally add the same remote physical location to multiple data planes.

From the Monitoring page, you can see:
  • How many physical locations are registered with the platform
Click the Physical locations card to see:
  • The current vCPU use and requests for each physical location
  • The current memory use and requests for each physical location
You can select a physical location to see:
  • The historical vCPU and memory use for the physical location
  • The pods that are associated with the physical location

Events and alerts

An alert is triggered by an event or a series of events. The severity of an event indicates that an issue occurred or that there is a potential issue.

From the Monitoring page, you can see:
  • The number of critical alerts
  • The number of warning alerts

If you click on any of these entries, you are taken to a filtered list of alerts or events based on the entry you selected.

If you click the View all events and alerts arrow on the Alerts card, you can a complete list of events.

You can optionally customize the events that trigger alerts. For details, see Monitoring and alerting in Cloud Pak for Data.

Setting and enforcing quotas

A quota is a way for you to specify the maximum amount of memory and vCPU you want the platform, a specific service, or a project to use. A quota is a target against which you can measure your actual memory and vCPU use. A quota acts as a benchmark to let you know when your vCPU or memory use is approaching or surpassing your target use.

Note: Setting a quota is not the same thing as scaling.

Scaling impacts the overall capacity of a service by adjusting the number of pods in the service. (You can also scale the Cloud Pak for Data control plane.) When you scale a service up, the service becomes more resilient. Additionally, the service might have increased parallel processing capacity.

Setting a quota on a service does not change the scale. Scale and quota are independent settings.

In addition to setting a quota, you can optionally enable quota enforcement. When you enforce quotas, new pods cannot be created if the pods would push your resource use above your quota.

The behavior of the quota enforcement feature depends on whether you set your quotas on pod requests or limits. (For an in-depth explanation of requests and limits, see Managing Resources for Containers in the Kubernetes documentation.)

Enforcing quotas on pod requests
A request is the amount of vCPU or memory that the pod expects to use as part of its normal operations.
When you set quotas on pod requests, you have more flexibility in how your resources are allocated:
  • If you enforce platform quotas, the control plane and any services that are running on this instance of Cloud Pak for Data are prevented from creating new pods if the requests in the new pod would push the platform over either the platform memory quota or the vCPU quota. These pods remain in the pending state until there are sufficient resources available. However, the existing pods can use more memory or vCPU than the platform quota.
  • If you enforce a service quota, the service is prevented from creating new pods if the requests in the new pod would push the service over either the memory quota or the vCPU quota. These pods remain in the pending state until there are sufficient resources available. However, the existing pods can use more memory or vCPU than the service quota.
  • If you enforce a project quota, the project is prevented from creating new pods if the requests in the new pods would push the project over either the memory quota or the vCPU quota. The pods remain in the pending state until there are sufficient resources available. However, the existing pods can use more memory or vCPU than the project quota.
Enforcing quotas on pod limits
A limit is the absolute maximum amount of vCPU or memory that the pod can use. If the pod tries to consume additional resources, the pod is terminated. In most cases, the requested resources (the requests) are less than the limits.
When you set quotas on pod limits, you have more control over your resources:
  • If you enforce platform quotas, the control plane and any services that are running on this instance of Cloud Pak for Data are prevented from creating new pods if the limits in the new pods would push the platform over either the platform memory quota or the vCPU quota. These pods remain in the pending state until there are sufficient resources available. When you enforce platform quotas on pod limits, the quota is a cap on the total resources that existing pods can use.
  • If you enforce a service quota, the service is prevented from creating new pods if the limits in the new pod would push the service over either the memory quota or the vCPU quota. These pods remain in the pending state until there are sufficient resources available. When you enforce service quotas on pod limits, the quota is a cap on the total resources that the existing pods can use.
  • If you enforce a project quota, the project is prevented from creating new pods if the limits in the new pod would push the project over either the memory quota or the vCPU quota. These pods remain in the pending stat until there are sufficient resources available. When you enforce project quotas on pod limits, the quota is a cap on the total resources that the existing pods can use.

If you don't enforce quotas, the quota has no impact on the behavior of the platform or services. If you are approaching or surpassing your quota settings, it's up to you whether you want to allow processes to consume resources or whether you want to stop processes to release resources.

Setting the platform quota

Note: Memory and vCPU requests and limits that are displayed with platform quotas include pods from physical locations, if physical locations are associated with the Cloud Pak for Data instance. However, platform quota enforcement applies only to pods on the hub for the Cloud Pak for Data instance.

To set the platform quota:

  1. From the Monitoring page, click the Edit platform quotas icon (Image of the Edit icon) on the Platform quotas card.
  2. Select Monitor platform resource use against your target use.
  3. Specify whether you want to set quotas on pod Requests or Limits.
  4. Specify your vCPU quota. This is the target maximum amount of vCPU you want the platform to use.
  5. Specify your vCPU alert threshold. When you reach the specified percent of vCPU in use, the platform alerts you based on your alert settings.
  6. Specify your Memory quota. This is the target maximum amount of memory you want the platform to use.
  7. Specify your Memory alert threshold. When you reach the specified percent of memory in use, the platform alerts you.
  8. If you want to automatically enforce the platform quota settings, select Enforce quotas.
  9. Click Save.

Setting service quotas

Note: Memory and vCPU requests and limits that are displayed with service quotas include pods from physical locations, if physical locations are associated with the Cloud Pak for Data instance. However, service quota enforcement applies only to pods on the hub for the service.

To set service quotas:

  1. On the Monitoring page, click Services on the Quotas card.
  2. Locate the service for which you want to edit the quota, and click the Edit icon (Image of the Edit icon).
  3. Select Monitor service resource use against your target use.
  4. Specify whether you want to set quotas on pod Requests or Limits.
  5. Specify your vCPU quota. This is the target maximum amount of vCPU you want the service to use.
  6. Specify your vCPU alert threshold. When you reach the specified percent of vCPU in use, the platform alerts you based on your alert settings.
  7. Specify your Memory quota. This is the target maximum amount of memory you want the service to use.
  8. Specify your Memory alert threshold. When you reach the specified percent of memory in use, the platform alerts you.
  9. If you want to automatically enforce the service quota settings, select Enforce quotas.
  10. Click Save.

Setting project quotas

To set project quotas:

  1. On the Monitoring page, click Projects on the Quotas card.
  2. Locate the project for which you want to edit the quota and click the Edit icon (Image of the Edit icon).
  3. Select Monitor project resource use against your target use.
  4. Specify whether you want to set quotas on pod Requests or Limits.
  5. Specify your vCPU quota. This is the target maximum amount of vCPU you want the project to use.
  6. Specify your vCPU alert threshold. When you reach the specified percent of vCPU in use, the platform alerts you based on your alert settings.
  7. Specify your Memory quota. This is the target maximum amount of memory you want the project to use.
  8. Specify your Memory alert threshold. When you reach the specified percent of memory in use, the platform alerts you.
  9. If you want to automatically enforce the project quota settings, select Enforce quotas.
  10. Click Save.

Setting data plane quotas

To set data plane quotas:

  1. On the Monitoring page, click Data planes on the Quotas card.
  2. Locate the data plane for which you want to edit the quota and click the Edit icon (Image of the Edit icon).
  3. Select Monitor data plane resource use against your target use.
  4. Specify whether you want to set quotas on pod Requests or Limits.
  5. Specify your vCPU quota. This is the target maximum amount of vCPU you want the data plane to use.
  6. Specify your vCPU alert threshold. When you reach the specified percent of vCPU in use, the platform alerts you based on your alert settings.
  7. Specify your Memory quota. This is the target maximum amount of memory you want the data plane to use.
  8. Specify your Memory alert threshold. When you reach the specified percent of memory in use, the platform alerts you.
  9. Click Save.

Setting physical location quotas

To set physical location quotas:

  1. On the Monitoring page, click Physical locations on the Quotas card.
  2. Locate the physical location for which you want to edit the quota and click the Edit icon (Image of the Edit icon).
  3. Select Monitor physical location resource use against your target use.
  4. Specify whether you want to set quotas on pod Requests or Limits.
  5. Specify your vCPU quota. This is the target maximum amount of vCPU you want the physical location to use.
  6. Specify your vCPU alert threshold. When you reach the specified percent of vCPU in use, the platform alerts you based on your alert settings.
  7. Specify your Memory quota. This is the target maximum amount of memory you want the physical location to use.
  8. Specify your Memory alert threshold. When you reach the specified percent of memory in use, the platform alerts you.
  9. Click Save.