Optimizing Hardware Platform Performance: A Case Study

Before rolling out a new hardware platform to the IBM Cloud, we thoroughly evaluate its performance.

As you can see below, we stress systems to their limits, uncovering performance bottlenecks and addressing them before they impact your business.

We describe an issue that was discovered and resolved with certain multi-socket systems using 2nd-Gen Intel® Xeon® Scalable Processors (formerly code-named “Cascade Lake”).

System topology

In multi-socket systems, contemporary Intel processors use Ultra Path Interconnect (UPI) links to logically tie sockets together. The system in this study has four sockets, using processors with three UPI links each, allowing a fully connected topology:

Each processor (P0 – P3) has locally attached memory (M) and a direct UPI link to every other processor. Processors use UPI links to remotely access the memory attached to other processors.

Memory throughput collapse

We use a variety of workloads to characterize platform performance. This study focuses on the STREAM Triad benchmark used to measure memory throughput in a variety of configurations that assess both local and remote memory performance. The following configuration places maximum stress on remote memory serving:

Memory attached to P3 is remotely accessed by three other processors simultaneously. In its memory serving role, P3 has no active cores. We vary the number of active cores on the other processors, starting with one on each, then two, and so on.

This is a plot of measured throughput versus increasing load:

Things look normal on the left, where throughput rises with more active cores, reaching an expected peak consistent with the limits of the system. However, as the load continues to increase, throughput collapses, falling to around 6% of the peak.

Potential impacts

Intense remote memory traffic like this is typically avoided by NUMA (Non-Uniform Memory Access) aware applications, but some applications are not suited to such tuning and others may not be configured properly. Further, while we studied this throughput collapse in steady state, we recognize that transient conditions can trigger this collapse temporarily.

A dramatic drop in memory throughput can obviously affect application performance, but even temporary occurrences of this issue can easily produce spikes in application response latency.

Identifying the root cause

In our first conversation with Intel about this issue, they suggested that we vary a BIOS parameter called Local/Remote Threshold (part of UPI Configuration under North Bridge) to see if that might help. The parameter has various settings including High, Medium, and Low. In this case, changing it made no difference.

We then worked with Intel to help them reproduce the issue, extracting the essential elements from our measurement framework. Further work by the Intel team revealed that the issue was tied to how Local/Remote Threshold was mapped internally for our hardware topology.

The Local/Remote Threshold setting configures undocumented registers in the processor. Intel showed us how to inspect these registers and to modify one of them, which we will call register R here. Adjusting R allowed us to solve the memory collapse issue, as we show below.

Our earlier attempts to alter the Local/Remote Threshold had no effect because all available choices on the BIOS menu mapped to the same value of register R. This mapping from BIOS menu choice to R is affected by the hardware topology. As examples, the mappings for eight socket systems and those for four socket “ring” topologies (with two UPI links per processor) both vary R with menu choice. In this case, the BIOS vendor had correctly implemented the recommended mapping for our four socket fully connected topology.

New choices for register R

This shows memory throughput in the most challenging case of the problem workload, where all cores are active on each of the three processors running STREAM:

We chose register R values X, Y, and Z as possible alternatives to the original value.

Here is the problem workload with the original value of register R and the three new candidates across a range of active cores:

Clearly, all three new values address the original problem. After reaching its peak, throughput is now nearly flat in the face of increasing load.

Zooming in more closely, we see small differences in throughput between the new choices:

Note that magnification exaggerates the drop from the peak, as even value Z sustains more than 95% of best memory throughput.

Validating changes to register R

We then circled back to our full suite of memory throughput measurements to see how they were affected by the new candidate values for register R. Each was examined at the level of detail discussed above. For brevity, we present the geometric mean of all these workloads, which gives a good sense of the overall impact of varying R:

Value Y yields a balanced improvement, but each new choice significantly outperforms the original.

Delivering the solution

Our BIOS vendor provided a new version that enables the preferred choices for register R in systems with four fully connected sockets.

If you have read this far, we hope you have a better understanding of the IBM Cloud approach to performance.

Was this article helpful?

YesNo

Kean Kuiper

Senior Engineer

Before rolling out a new hardware platform to the IBM Cloud, we thoroughly evaluate its performance.

System topology

Memory throughput collapse

Potential impacts

Identifying the root cause

New choices for register R

Validating changes to register R

Delivering the solution

More from Cloud

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

The power of the mainframe and cloud-native applications

Modernize your mainframe applications with Azure

IBM Newsletters