October 16, 2020 By Kean Kuiper 4 min read

Before rolling out a new hardware platform to the IBM Cloud, we thoroughly evaluate its performance.

As you can see below, we stress systems to their limits, uncovering performance bottlenecks and addressing them before they impact your business.

We describe an issue that was discovered and resolved with certain multi-socket systems using 2nd-Gen Intel® Xeon® Scalable Processors (formerly code-named “Cascade Lake”).

System topology

In multi-socket systems, contemporary Intel processors use Ultra Path Interconnect (UPI) links to logically tie sockets together. The system in this study has four sockets, using processors with three UPI links each, allowing a fully connected topology:

Each processor (P0 – P3) has locally attached memory (M) and a direct UPI link to every other processor. Processors use UPI links to remotely access the memory attached to other processors.

Memory throughput collapse

We use a variety of workloads to characterize platform performance. This study focuses on the STREAM Triad benchmark used to measure memory throughput in a variety of configurations that assess both local and remote memory performance. The following configuration places maximum stress on remote memory serving:

Memory attached to P3 is remotely accessed by three other processors simultaneously. In its memory serving role, P3 has no active cores. We vary the number of active cores on the other processors, starting with one on each, then two, and so on.

This is a plot of measured throughput versus increasing load:

Things look normal on the left, where throughput rises with more active cores, reaching an expected peak consistent with the limits of the system. However, as the load continues to increase, throughput collapses, falling to around 6% of the peak.

Potential impacts

Intense remote memory traffic like this is typically avoided by NUMA (Non-Uniform Memory Access) aware applications, but some applications are not suited to such tuning and others may not be configured properly. Further, while we studied this throughput collapse in steady state, we recognize that transient conditions can trigger this collapse temporarily.

A dramatic drop in memory throughput can obviously affect application performance, but even temporary occurrences of this issue can easily produce spikes in application response latency.

Identifying the root cause

In our first conversation with Intel about this issue, they suggested that we vary a BIOS parameter called Local/Remote Threshold (part of UPI Configuration under North Bridge) to see if that might help. The parameter has various settings including High, Medium, and Low. In this case, changing it made no difference.

We then worked with Intel to help them reproduce the issue, extracting the essential elements from our measurement framework. Further work by the Intel team revealed that the issue was tied to how Local/Remote Threshold was mapped internally for our hardware topology.

The Local/Remote Threshold setting configures undocumented registers in the processor. Intel showed us how to inspect these registers and to modify one of them, which we will call register R here. Adjusting R allowed us to solve the memory collapse issue, as we show below.

Our earlier attempts to alter the Local/Remote Threshold had no effect because all available choices on the BIOS menu mapped to the same value of register R. This mapping from BIOS menu choice to R is affected by the hardware topology. As examples, the mappings for eight socket systems and those for four socket “ring” topologies (with two UPI links per processor) both vary R with menu choice. In this case, the BIOS vendor had correctly implemented the recommended mapping for our four socket fully connected topology.

New choices for register R

This shows memory throughput in the most challenging case of the problem workload, where all cores are active on each of the three processors running STREAM:

We chose register R values X, Y, and Z as possible alternatives to the original value.

Here is the problem workload with the original value of register R and the three new candidates across a range of active cores:

Clearly, all three new values address the original problem. After reaching its peak, throughput is now nearly flat in the face of increasing load.

Zooming in more closely, we see small differences in throughput between the new choices:

Note that magnification exaggerates the drop from the peak, as even value Z sustains more than 95% of best memory throughput.

Validating changes to register R

We then circled back to our full suite of memory throughput measurements to see how they were affected by the new candidate values for register R. Each was examined at the level of detail discussed above. For brevity, we present the geometric mean of all these workloads, which gives a good sense of the overall impact of varying R:

Value Y yields a balanced improvement, but each new choice significantly outperforms the original.

Delivering the solution

Our BIOS vendor provided a new version that enables the preferred choices for register R in systems with four fully connected sockets.

If you have read this far, we hope you have a better understanding of the IBM Cloud approach to performance.

Was this article helpful?
YesNo

More from Cloud

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

9 min read - As organizations strive to stay ahead of the curve in today's fast-paced digital landscape, mainframe application modernization has emerged as a critical component of any digital transformation strategy. In this blog, we'll discuss the example of a US bank which embarked on a journey to modernize its mainframe applications. This strategic project has helped it to transform into a more modern, flexible and agile business. In looking at the ways in which it approached the problem, you’ll gain insights into…

The power of the mainframe and cloud-native applications 

4 min read - Mainframe modernization refers to the process of transforming legacy mainframe systems, applications and infrastructure to align with modern technology and business standards. This process unlocks the power of mainframe systems, enabling organizations to use their existing investments in mainframe technology and capitalize on the benefits of modernization. By modernizing mainframe systems, organizations can improve agility, increase efficiency, reduce costs, and enhance customer experience.  Mainframe modernization empowers organizations to harness the latest technologies and tools, such as cloud computing, artificial intelligence,…

Modernize your mainframe applications with Azure

4 min read - Mainframes continue to play a vital role in many businesses' core operations. According to new research from IBM's Institute for Business Value, a significant 7 out of 10 IT executives believe that mainframe-based applications are crucial to their business and technology strategies. However, the rapid pace of digital transformation is forcing companies to modernize across their IT landscape, and as the pace of innovation continuously accelerates, organizations must react and adapt to these changes or risk being left behind. Mainframe…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters