Open Source @ IBM Blog

Follow the latest happenings with Open Source @ IBM and stay in the know.

How we resolved Kubernetes node memory issues


Developers who work with Kubernetes understand that a node is a worker machine in Kubernetes. Each node contains the services necessary to run pods and is managed by the master components.

If you are working on site reliability and managing cloud systems in an environment with Kubernetes, node resources and node memory might be a challenge. Our team faced an ongoing issue in one of our cloud systems. In this blog post, I share the steps collected to determine the cause and how we resolved the issue.

When you manage a cloud environment you might see the frequent error messages from your monitoring or notification tools.

You might see messages about a particular node with “high memory usage”:

Alert:
Cluster_Name
Node has high memory usage > 70%. XX.XX.XX.XX:XXXX has high memory usage of xx.xx.

When the node lacks resources, you also see many alerts that many pods in the node are failing and restarted:

Alert:
Cluster_Name
“Pod has restarted frequently in last hour. Pod xxx-xxx-xxxx/xxx-xxx-xxx-xxxxx was restarted 11 times.”
“Pod has restarted frequently in last hour. Pod yyy-yyy-yyyy/yyy-yyy-yyy-yyyy was restarted 10 times.”

If you don’t fix this issue, the users of your applications experience slow performance and longer response times.

How we found the cause of these node resource issues

To check the overall status of your Kubernetes nodes, use the following command:

$ kubectl top nodes

You see results like the following example. The memory usage of nodes xx.xxx.xx.x4 and xx.xxx.xx.x6 are very high:

NAME            CPU(cores)    CPU%    MEMORY(bytes)    MEMORY%
xx.xxx.xx.x1        1121m    28%        9748Mi        73%
xx.xxx.xx.x2        404m        21%        1926Mi        67%
xx.xxx.xx.x3        1350m    34%        8200Mi        61%
xx.xxx.xx.x4        460m        23%        3140Mi        109%
xx.xxx.xx.x5        1117m    28%        6369Mi        47%
xx.xxx.xx.x6        403m        20%        3200Mi        111%

To get more detailed information for each node, use the following command:

$ kubectl describe node <node_name>

Here’s what the details look like for node xx.xxx.xx.x4:

…
Namespace                        Name                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
---------                                ----                                       ------------  ----------  ---------------  -------------  ---
ns-xxxxx        pod-xxx-xxx-xxxxxx         1 (52%)      4 (208%)     500Mi (17%)     2Gi (71%)    15d
ns-yyyyy        pod-yyy-yyy-yyyyyy        20m (1%)        0 (0%)       500Mi (17%)      500Mi (17%)    36d
ns-zzzzz        pod-zzz-zzz-zzzzzs        10m (0%)      100m (5%)   16Mi (0%)        64Mi (2%)      40d
…

Here’s what the details look like for node xx.xxx.xx.x6:

…
Namespace                        Name        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
---------                                ----                                       ------------  ----------  ---------------  -------------  ---
ns-aaaaa        pod-aaa-aaa-aaaaa         1 (52%)     4 (208%)    500Mi (17%)     2Gi (71%)     15d
ns-bbbbb         pod-bbb-bbb-bbbbb        100m (5%)     0 (0%)      512Mi (17%)      1Gi (35%)      35h
ns-ccccc         pod-ccc-ccc-ccccc        100m (5%)     500m (26%)  64Mi (2%)        128Mi (4%)     40d
…

In these examples, you can see that there is one pod taking a lot of resources in each node. For node xx.xxx.xx.x4, it is pod-xxx-xxx-xxxxxx in the namespace ns-xxxxx. For node xx.xxx.xx.x6, it is pod-aaa-aaa-aaaaa in namespace ns-aaaaa.

These pods are consuming a lot of CPU and memory resources in each node, causing the “high memory usage” issues and the pods restarting.

How we resolved the node resource issues

We could confirm the issue was caused by a specific application and then take action for that application.

To find out more information about the pod application and confirm the owner and the current configuration, we used the following command:

$ kubectl get pods <pod_name> -n <namespace> -o json

It returned the following information. Considering the current resource assigned in each node, the example shows that this specific application is assigned many resources. When our team saw this information, we could adjust the resource configuration and deploy it again.

…
     "resources": {
         "limits": {
            "cpu": "4",
            "memory": "2Gi"
         },
         "requests": {
            "cpu": "1",
            "memory": "500Mi"
         }
      },
…

Also, we needed to check whether the currently assigned node was in the correct node group. When an application is assigned into ingress worker node group, it could cause high overload into ingress worker node.

…
        "nodeName": "xx.xx.xx.xxx",
…

To assign the pod into correct node group, we needed to add an additional nodeSelector setting in the YAML file for deployment. This setting assigns the pod into right node group as intended:

…
        "nodeSelector": {
            "worker-type": "application"
        },
…

Finally, to clean up the specific pod application, we needed to find out the corresponding Kubernetes resources and delete them using the following commands. (Before deleting anything, make sure you know what you are deleting and if it is safe to do so.)

$ kubectl delete pod <pod_name>
$ kubectl delete service <service_name>
$ kubectl delete deployment <deployment_name>
$ ibmcloud cr image-rm <image_name>:<tag>

Summary

After I found and tested the steps for resolving the Kubernetes node memory resource issues, we wanted to share them with other teams who manage cloud resources in a Kubernetes environment. We hope these tips help your teams better manage your cloud environments.

You can get more information about configuring “out-of-resource” handling in the Kubernetes documentation. If you are working with the IBM Cloud Kubernetes Service, see helpful troubleshooting information about worker nodes in Debugging your cluster documentation.