POWER CPU Memory Affinity 1 - local, near and far RAM Architecture

How To

Summary

Power Systems gain there massive performance with lots a technology this series details many of them.

Objective

Originally, written in 2012 for the DeveloperWorks AIXpert Blog for POWER7 but updated in 2019 for POWER8 and POWER9.

Steps

On POWER6 based largest machine was the Power 595 with 64 Physical CPUs (cores) across eight CPU books in the machine - each CPU book having 4 POWER6 chips and so 8 CPUs (POWER6 is a dual CPU chip design). However, with POWER7 that has stepped up to 256 CPUs across the same eight CPU books with four chips but with 8 CPUs each so that is 32 CPUs per book.

I might be stating the obvious but memory access to memory directly attached to the Power chip on which your process is running is slightly faster than memory access via a Power chip near by and faster than memory attached to a Power chip which is further away. For example, in a different CPU book in the POWER 595 or 795 or a different CEC drawer if you are using a POWER 570, 770, 780 model.

On the POWER6 595, a 32 way virtual machine (LPAR) would require the use of 4 CPU books so you are having to use the CPU book to CPU book access to memory a lot of the time.However, on the POWER7 795, all 32 CPUs c
an be in a single CPU book - thus the memory is closer and a bit faster. And that is not including the fact that POWER7 memory is faster as well.

Update:

For POWER8 and POWER9 based servers we have up to 12 CPU cores per chip and SMT=8. They go from a small handful of CPUs to 192 and up to 32 TB of memory. The memory transfer bandwidth between the chips is far faster at the same time that the memory sizes are much larger. So the speed difference between local and far memory has been greatly reduced. Thus supporting large LPARs (Virtual Machines).

This introduces the different distances to memory. Before I go further I have better state this distance (and time) is not massive. We are not talking "the good, the bad and the ugly" of memory access here but more like "the good, the excellent and the blisteringly fast" memory access. Each generation has in addition to the CPU performance jump had to add similar jumps in memory performance for it to keep up.

Definitions of memory access:

Local
- This means your program and memory are on the same Power Chip.
- This for memory that is directly attached to the Power Chip's memory controller.
Near
- Your memory is on the next door Power Chip and joined by copper tracks in a Multi-Chip Module.
- This uses the Power Chip to Power Chip Intra-node communication paths.
- Each Power Chip has a direct path to every other Power Chip - this is called one hop.
- The Power Chip running your program asks the nearby Power Chip to transfer the data memory cache to cache.
Far
- On a Power Chip that is in the next "CPU and Memory" Drawer (also called a CEC drawer or 595 books) via SMP cable or systems back-plane 595).
- This uses the Node to Node Inter-node communication paths.
- Each node has a direct path to every other node. In POWER8 and POWER9 each Chip has a path to the matching POWER Chip in the other Drawers making this a 2 hop design with multiple options in the route to avoid bottlenecks.

So how does that map onto the POWER7, POWER8 and POWER9 servers:

Model	No. of POWER chips	Local	Near	Far
Power 710	1	Same Chip	-
Power 730	2	Same Chip	-	Other Chip
Power 720	1	Same Chip	-
Power 740	2	Same Chip	-	Other Chip
Power 750/755	4	Same Chip		Other Chip
Power 770/780	8	Same Chip	Other Chip but same CEC Drawer	Different CEC Drawer
Power 795	32	Same Chip	Other Chip but same CPU book	Different CPU Book
S822	2	Same Chip		Other Chip
S824	2	Same Chip		Other Chip
E850	4	Same Chip		Other Chip
E870/E880	16	Same Chip	Other Chip but same CEC Drawer	Different CEC Drawer
S922	2	Same Chip		Other Chip
S924	2	Same Chip		Other Chip
E950	4	Same Chip		Other Chip
E980	16	Same Chip	Other Chip but same CEC Drawer	Different CEC Drawer

Note: the above chart was updated in 2019

The Power 710 to 740 uses either a single Chip (by definition using Local memory) or has two POWER7 processor chips using Local and Far memory. You might think that is wrong! Why not Local and Near? Well the POWER Chips have two buses to memory transfers for Near and Far - in the two to four POWER Chip only one bus is needed. The Server designer on these smaller servers decided the Far bus was easier to implement, reduces components and has the necessary bandwidth.

The Power 750 - it looks if you open the top of the machine like the CPUs are on different mini "CPU books " as this models allows you to insert one to four CPU and memory "cards" but they are in fact joined together by simple and fast copper tracks - hence the large number of connectors. So the memory access is Local or Far.

The same goes for E850 and E950 but the POWER Chips are all directly on the system planner (motherboard).

From the above table you can see it is only the POWER 770, POWER 780 and POWER 795 that make use of the Far memory access across the Intra-node (Central Electronic Complex (CEC)) communications. Assuming the regular 8 CPU (cores) per POWER7 this means the POWER 770/780 has is local and near if you virtual machine is up to 16 CPUs and placed in a single CEC and on the Power 795 the memory can be local or near for up to 32 CPUs. Let me remind you that the Far memory is not a issue - it is a designed into the architecture to be fast and lets these large machine give us excellent performance on super larger virtual machines and Local/Near can give you a speed boost for smaller virtual machines that don't span nodes. Plus the 8 core design of the POWER7 processor lets you avoid this even more than in POWER6.

The same goes for a E870, E880 andE980. On a single CPU and Memory Drawer servers it used Local memory access (of course), Near memory access between the POWER Chips in the drawer. Four POWER8 or POWER9 Chips with up to 48 CPU cores per drawer. If you have 2, 3 or 4, CPU Drawers the memory transfers using Far bus.

So how does this effect me?

Virtual Machines (LPARs) that are smaller than a single POWER Chips CPU cores the the VM should be on a single POWER Chip and get an effective speed bump of all local CPU cores and all local memory - and a further one due to sharing level 3 cache.
On larger models, Virtual Machines that are smaller than a CPU Drawer and get placed on a single node get an effective speed bump a voiding Far memory
- For Power 770/780 this is 16 CPUs. Or 12 if you have the 6 core POWER7 chips.
- For Power 795 this is 32 CPUs. Or 24 if you have the 6 core POWER7 chips.
- For POWER8 E870/E880 and POWER9 E980 a drawer has 48 CPU cores for 12 CPU core POWER chips.

Please note: Going slightly over the above virtual machine sizes i.e. not having the virtual machines placed nicely means more memory is access from further away. Not a disaster in itself but not making the best use of the architecture. The effect depends on how well the processes and their memory can be closely coupled - this is called CPU and Memory Affinity. AIX has a few clever techniques to optimise this automatically. More about this is a later blog.

Warning:

For Dedicated CPU virtual machines the size is obviously the number of CPUs.
For Shared CPU virtual machine size is the Virtual Processor number (VP) and not what many people might assume is the Entitlement (E). This makes the choice of VP much more important on large machines than was previously thought. We will look at this more in part 2.

Additional Information

Other places to find content from Nigel Griffiths IBM (retired)

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"Component":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power -\u003EPowerLinux"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"Component":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]

Tips

POWER CPU Memory Affinity 1 - local, near and far RAM Architecture

How To

Summary

Objective

Steps

Additional Information

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?