64-bit DMA concepts

On new IBM® Power Systems servers running Linux®, a set of the PCIe slots support a unique feature called 64-bit direct memory access (DMA). DMA improves I/O operations, and, therefore, system performance.

Adapters and PCIe slots that are enabled for 64-bit DMA allow I/O traffic to take place with less operating system usage, improving latency (time delay for data transfer) and throughput (average rate of successful data transfer).

Following are some key terms for 64-bit DMA.

PCIe
Peripheral Component Interconnect Express®, or PCI Express. PCIe is a high-speed, serial, computer expansion bus standard for connecting extension hardware devices to a system board. PCIe is one of the primary buses that are used to attach peripheral devices to an IBM Power Systems server.
DMA
Direct memory access. DMA allows an I/O adapter to access a limited amount of memory directly, without involving the CPU for memory transfers. Both the device driver for the adapter and the operating system must recognize and support DMA.
RDMA
Remote direct memory access. RDMA supports direct memory access from the memory of one system into another system's memory, without increasing operating system usage. To accomplish this access, data is copied from the network stack to the application memory area. Eliminating the operating system involvement promotes high throughput, low-latency communication. RDMA is often used in High Performance Computing (HPC).
IOMMU
Input/output memory management unit. IOMMU enables the connection between DMA-capable I/O buses and the main memory, and manages the I/O memory addresses. On IBM Power Systems, a Translation Control Entry (TCE) translates addresses generated by I/O devices into physical addresses.
DMA window
Direct memory access window. A DMA window is a range of addresses that the adapter is allowed to access. A typical DMA window is relatively small, around 2 GB, but can be as large as 1 TB. The DMA window address is mapped to the physical memory by using a Translation Control Entry (TCE) table in the IOMMU.

In the normal mode of using DMA, device drivers must request mappings from the operating system for every I/O operation, and later remove those mappings after they are used. Some I/O operations allow mappings to be cached and reused by the driver. The performance advantage of using IOMMU is that data is delivered directly to, or read directly from, memory that is a part of the application space. Typically, this approach eliminates extra memory copies for the I/O.

64-bit DMA
64-bit direct memory access. 64-bit DMA is a PCIe slot capability on IBM Power Systems servers that enables a DMA window to be wider, possibly allowing all the partition memory to be mapped for DMA. This feature avoids increased system usage when DMA mappings are requested by the driver, because all the system memory assigned to the partition is already mapped. Consequently, this feature enables the data transfer between the I/O card that is placed in this slot and the system memory to be more efficient and with lower latency.

This capability is also known as Huge Dynamic DMA Window in some Linux kernel patches and discussions.

Not all PCIe slots or PCIe adapters support 64-bit DMA. If the card or the device driver does not support the 64-bit DMA feature, the PCIe slot works in a standard way, not being differentiated from the other slots.

64-bit DMA benefits

With a wider DMA window, the entire memory address space can be mapped. Therefore, there is a direct map between I/O address space and the memory address space. This direct map eliminates the system use involved with dynamically mapping and unmapping memory for each I/O operation. Moreover, the 64-bit DMA support allows RDMA-aware applications and libraries to transfer data directly to any location in a remote system's memory, with appropriate access restrictions. This, in turn, results in a complete offload of all protocol processing to the RDMA adapter. The result is lower-latency and higher-bandwidth communications, as well as significant reduction of the number of CPU cycles needed to move the data from one system to another.