February 24, 2021 By Kean Kuiper
Saju Mathew
Rei Odaira
5 min read

In the second part of this five-part blog series, we will explain the Linux network virtualization layers and introduce a tool called SystemTap.

We used SystemTap to narrow down the scope of the packet loss issue described in Part 1. This is part of the series of blogs that is intended for network administrators and developers who are interested in how to diagnose packet loss in the Linux network virtualization layers.

Linux network virtualization layers

To understand what caused the packet loss, it is important to know the Linux network virtualization layers. Figure 1 shows how received packets flow through the network interface card (NIC), the host Linux kernel and up to the guest operating system. It assumes Linux as a guest operating system, but the entire study is applicable to any guest operating system.

When it comes to handling incoming packets, network virtualization consists of three top-level layers:

  • NIC hardware
  • Host Linux kernel
  • Guest Linux kernel

The host Linux kernel is composed of two sub-layers, or two groups of threads — the host ksoftirqd threads and the vhost-net threads. The layers are connected by queues; from bottom to top, virtual function (VF) queues, macvtap queues and virtqueues:

Figure 1

Virtual function queues

The NIC hardware dispatches a received packet to one of the VF queues. Traditionally, between the NIC and the kernel, there was only one queue. This means that there was only one kernel thread to handle all the received packets. However, in modern systems, the single kernel thread can no longer keep up with the pace of the received packets because the network throughput has drastically increased over the decades, while the CPU frequency has not. This is why modern scalable network systems typically have multiple queues.

In our cloud environment, each VF is configured with four queues. The NIC hardware dispatches a received packet based on the hashed value of the fields in the packet header.

Associated with the four VF queues are four IRQ handlers and corresponding four host ksoftirqd threads. A received packet triggers an interrupt (IRQ) to the host Linux kernel, and the host ksoftirqd thread communicates with the VF using the NIC driver to process the packet.

In a non-virtualized Linux system, the ksoftirqd thread would then call the TCP/IP functions and would eventually notify a user application of the received packet. In a virtualized Linux system, however, the host Linux kernel must instead pass the received packet to a guest operating system.

Macvtap

Macvtap is an interface between the host Linux kernel and the guest operating system (Linux, in this example). It is a device driver for virtualized bridged networking across guests and the host. It allows the host to expose a virtual NIC to the guest in a configurable manner. In our simplified example, its sole purpose is to enqueue the received packet to one of the macvtap queues, which are exposed to the upper layer as a macvtap character interface. This is the place where packets were dropped, in our case. The details will be explained in the subsequent sections.

Corresponding to each macvtap queue, there is a dedicated kernel thread called vhost-net to dequeue the packets. It passes them to the guest operating system through another set of queues called virtqueues. The virqueues are exposed to the guest operating system as a virtual NIC interface. Readers are referred to this very interesting series of blogs for the details of the vhost-net threads and virqueues.

For exactly the same reason that there are multiple VF queues, it is common to have multiple macvtap queues so that the guest operating system can process the received packets in multiple threads. It is important to note that the number of macvtap queues does not necessarily match the number of VF queues.

In our configuration, there are four VF queues and only three macvtap queues. Based on our measurements, this configuration best balances the network throughput against CPU utilization. As a result, the four host ksoftirqd threads that correspond to the four VF queues can dispatch the received packets to any of the three macvtap queues, based on the hashed values of the packet headers.

Source code analysis

Let’s consider the source code of Linux to understand where the packet loss occurred. Our analysis was based on the Linux version 4.15.0. As explained in Part 1, we found that the RX dropped counter of the macvtap device exactly matched the number of times the 200-millisecond delays occurred. The RX dropped counter was incremented at macvlan_count_rx() in include/linux/if_macvlan.h:

static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
                                    unsigned int len, bool success,
                                    bool multicast)
{
        ...
}

Analysis of the source code revealed that macvlan_count_rx() was called from macvtap_count_rx_dropped() in drivers/net/macvtap.c. This is a simple wrapper function of macvlan_count_rx():

static void macvtap_count_rx_dropped(struct tap_dev *tap)
{
        ...
        macvlan_count_rx(vlan, 0, 0, 0);
}

Before investigating further where macvtap_count_rx_dropped() was called, we wanted to ensure that there was no other place that incremented the RX dropped counter. For this reason, we utilized an instrumentation tool called SystemTap.

Introducing SystemTap

SystemTap is a tool for gathering information about the running Linux system. It provides a command line interface and scripting language to instrument a running kernel. Under the hood, it compiles a user-written script and inserts it into the kernel by taking advantage of backends — such as the Berkeley Packet Filter (BPF).

A very simple SystemTap script would look like this:

probe module("macvtap").function("macvtap_count_rx_dropped").call
{
    printf("dropped\n")
}

This script prints out a message every time macvtap_count_rx_dropped() is called. Because macvtap_count_rx_dropped() is defined not in the core kernel but in a kernel module called macvtap, you need to specify it by module(“macvtap”).

We named this script macvtap_dropped.stp and executed it on host Linux, using the stap command of SystemTap. Every time the host Linux called macvtap_count_rx_dropped(), the SystemTap script printed out a new message — “dropped” — on the console:

# stap -v macvtap_dropped.stp
dropped
dropped
dropped
dropped
dropped
dropped
dropped
dropped
...

We counted the number of messages and confirmed that it exactly matched the RX dropped counter. SystemTap was a great tool for us to check that we were on the right path toward the root cause of packet loss.

Narrowing down the scope

In our system, macvtap_count_rx_dropped() is called only from tap_handle_frame() in drivers/net/tap.c. The following is a simplified version of tap_handle_frame() for brevity:

 1: rx_handler_result_t tap_handle_frame(struct sk_buff **pskb)
 2: {
    ...
 3:         q = tap_get_queue(tap, skb);
    ...
 4:         if (__skb_array_full(&q->skb_array))
 5:                 goto drop;
    ...
 6:         if (netif_needs_gso(skb, features)) {
 7:                 struct sk_buff *segs = __skb_gso_segment(skb, features, false);
 8:                 if (IS_ERR(segs))
 9:                         goto drop;
    ...
10:         } else {
    ...
11:                 if (skb_checksum_help(skb))
12:                         goto drop;
13:                 if (skb_array_produce(&q->skb_array, skb)
14:                         goto drop;
15:         }
    ...
16: drop:
17:         if (tap->count_rx_dropped)
18:                 tap->count_rx_dropped(tap);
    ...
19: }

The tap_handle_frame() function is executed in the context of a host ksoftirqd thread and is called for each received packet. Parameter pskb represents the received packet. The tap_handle_frame() function calls macvtap_count_rx_dropped() at line 18 through a function pointer. Line 18 is reachable only through the drop label in line 16. Within tap_handle_frame(), there are four places that jump to the drop label: lines 5, 9, 12, and 14.

Before understanding the details of tap_handle_frame(), we wanted to narrow down the scope of the analysis. This was exactly where SystemTap came in handy. We first checked whether packets were dropped at line 9 for Generic Segmentation Offload (GSO) or at line 12 for checksum. Immediately before reaching lines 9 and 12, the execution must call functions __skb_gso_segment() and skb_checksum_help(), respectively. We added instrumentation for these two function calls to the previous SystemTap script, as follows:

probe module("macvtap").function("macvtap_count_rx_dropped").call
{
    printf("dropped\n")
}
 
probe kernel.function("skb_checksum_help").call
{
    printf("checksum\n")
}
 
probe kernel.function("__skb_gso_segment").call
{
    printf("gso\n")
}

As shown below, by executing this script, we observed only dropped messages but no checksum or gso messages. This means that no packet was lost at lines 9 or 12 of tap_handle_frame(). We successfully narrowed down the scope and no longer needed to consider these execution paths:

# stap -v macvtap_gso_or_checksum.stp
dropped
dropped
dropped
dropped
dropped
dropped
dropped
dropped
dropped
dropped
...

Summary

In this post, we explained the Linux network virtualizations layers to understand where the packet loss occurred. We then introduced SystemTap, a scripting language and runtime tool to instrument a running Linux kernel. SystemTap was a useful tool for us to save the diagnosis time by narrowing down the scope of the analysis.

In the next post of this five-part blog series, we will focus on the remaining possibility of packet loss in tap_handle_frame() — that is, a queue overflow. We will present our very weird, but interesting, observations of what was going on in the macvtap queues.

Read more

Was this article helpful?
YesNo

More from Cloud

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

9 min read - As organizations strive to stay ahead of the curve in today's fast-paced digital landscape, mainframe application modernization has emerged as a critical component of any digital transformation strategy. In this blog, we'll discuss the example of a US bank which embarked on a journey to modernize its mainframe applications. This strategic project has helped it to transform into a more modern, flexible and agile business. In looking at the ways in which it approached the problem, you’ll gain insights into…

The power of the mainframe and cloud-native applications 

4 min read - Mainframe modernization refers to the process of transforming legacy mainframe systems, applications and infrastructure to align with modern technology and business standards. This process unlocks the power of mainframe systems, enabling organizations to use their existing investments in mainframe technology and capitalize on the benefits of modernization. By modernizing mainframe systems, organizations can improve agility, increase efficiency, reduce costs, and enhance customer experience.  Mainframe modernization empowers organizations to harness the latest technologies and tools, such as cloud computing, artificial intelligence,…

Modernize your mainframe applications with Azure

4 min read - Mainframes continue to play a vital role in many businesses' core operations. According to new research from IBM's Institute for Business Value, a significant 7 out of 10 IT executives believe that mainframe-based applications are crucial to their business and technology strategies. However, the rapid pace of digital transformation is forcing companies to modernize across their IT landscape, and as the pace of innovation continuously accelerates, organizations must react and adapt to these changes or risk being left behind. Mainframe…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters