When I wrote about the PS4’s hardware in the April edition of NAG Magazine, I mentioned that the APU inside it would be sold to the public later on, probably for the low-power desktop market and some gamers who like to feel the PS4 connection. The PS4’s AMD APU is pretty unique in that it has a “Unified Memory Architecture” and many bloggers theorised how on earth this will work. Well, yesterday AMD l;ited the NDA regarding some details about Kaveri to journalists (not me, sadly) and has also given clues about how the PS4’s unified memory works.
The roadmap above shows the three main goals this year by AMD, namely: 1) Create a Unified Address space for the CPU and GPU 2) Use pageable parameters for the GPU using CPU pointers 3) Install a fully coherent memory space betwene the GPU and CPU
Kaveri is on its way
Now Kaveri is part of the new Fusion APU family that will be making its debut at Computex 2013 this year (or at least, that’s the expected date – AMD still hasn’t actually confirmed it) and it changes a lot of things. Its a mobile SOC (system-on-chip) based on AMD’s Steamroller architecture and uses four Jaguar cores in its makeup as well as AMD Radeon HD8000 series graphics. Kaveri will be a pretty big disruptor for Intel’s integrated graphics, because already the VLIW5-based HD7660D inside the AMD A10-5800K is much faster than HD4000 and still faster than Intel’s upcoming HD4600.
The GCN-based graphics cores will also appear in the smaller chips for other markets, with Kabini servicing the value ultrathin market and Temash bringing GCN goodness to cheap tablet designs. All three are designed to throw Intel off the high horse and a spanner in the works for other products based on ARM chip designs, like Nvidia’s Tegra. Kaveri is also destined for the desktop and note that AMD has added in next to it, “HSA features arrive on Kaveri.” What that meant to bloggers, fans and journalists at CES 2013 was a mystery. Today, I think we can figure it out.
The mystery unravels…
Its called “hUMA” or Heterogeneous Uniform Memory Addressing. A little context is required here – in 2003 AMD released its new processor, the Athlon 64. The chip changed a number of things for both the company and the industry and one feature in particular was so ahead of its time it took Intel over five years to copy it – the integrated memory controlller (IMC). AMD’s IMC took the memory controller off the motherboard that was relying on third party chip-makers to provide and put it straight on the processor, giving the company more control over how their chip behaved and worked with RAM, cumulating in enabling the Athlon 64 to crush Intel’s Pentium 4 at every turn.
UMA also was very different from previous memory controller architectures. For the first time on a commercially available processor (we’re not counting the very strange Transmeta Crusoe here) you could have multiple cores share the same memory space. Memory pointers, directions for processors to look for information on paged or non-paged memory referenced by name for easy developer access, were also shared and that meant that data and instructions for the one core was simultaneously available and readable by other processors.
Later on, we also got “NUMA”, or Non-Uniform Memory Access. This allowed a GPU built into the motherboard to share the same memory as the processor with the caveat that it would reserve address space for itself. Intel, Nvidia and AMD used and still use NUMA designs today in many products. Intel expanded on this with the HD graphics family, giving it the option to increase or decrease its memory allocation dynamically and if you look at the specs for any PC with the HD3000 graphics or higher, it specifies that it can have “up to 1750MB of system memory.”
Introducing hUMA, the new standard
In 2013, AMD brings about a new design and appears to have leapfrogged Intel again with hUMA. Heterogeneous Uniform Memory Access. This is a technology that only comes with the desktop version of Kaveri and there are various reasons why its limited to that and all of them end in the words “Playstation 4.” Kaveri for the desktop is expected to be the APU that AMD says it is intent on selling to consumers, the same APU design that would be seen in the Playstation 4.
hUMA now allocates the GPU’s memory on the system memory but shares its contents and pointer information with the CPUs. It now has access to potentially huge amounts of memory, giving it more space to work with to do whatever calculations its given. The memory the GPU uses isn’t reserved and the amount in use can be adjusted as and when needed. This solves a big problem with AMD”s NUMA implementations on Llano and Trinity, because the integrated Radeon graphics couldn’t address more memory than it was allowed to, putting a cap on its theoretical potential. Its also worlds away from Intel’s current implementation with Intel HD.
The console connection
Those of you who have read an earlier piece I wrote, titled “More next-gen console speculation hits the net“, will see another pattern emerging – this time from the Xbox 360. Microsoft’s Xbox 360 had a critical advantage over Sony’s PS3 in hardware. The PS3 was massively powerful for its time but it was hamstrung – the eight-core IBM Cell processor and the Nvidia RSX Reality synthesiser each got handed to them 256MB of memory. No more, no less. That’s the reason why in today’s games you’ll see that the PS3 has some “texture optimisations”, apparently a developer’s nicer way of saying “low-resolution crappy textures.”
But the 360 was different. It had a triple-core IBM PowerPC processor called “Xenon” and a graphics core made by AMD based on a combination of technologies from the Radeon X1900XT and the HD2900XT graphics cards. Like the PS3, it also had 512MB of RAM but this wasn’t split – it was a dynamically addressable, shared memory space between the CPU and GPU.
And it was ATi that brought the technology to the table. Ding ding ding!
Once again, we see evidence of AMD using the Xbox 360 as a development mule to figure out their own foray into the APU space. It was the first example of a commercially available APU in 2010, when the “Trinity” revision of the console put both the CPU and GPU through the 45nm process and integrated other features on the board that didn’t need their own chips, like the Northbridge logic. “Corona”, which is the Xbox 360 Slim, finally integrated the Southbridge logic into the chip, making it a complete system-on-chip.
The large scale to which these are mass-produced has given AMD enormous amounts of data to use when developing, selling and supporting their own version of the APU for the x86 desktop. Its worth noting that AMD originally announced Llano in early 2008, but shipped it only in 2011 after numerous delays, financial setbacks and mountains of documents in their court case against Intel. Had Llano launched in 2008, we would be past all this a long time ago.
How hUMA brings it all together, finally
The diagram above illustrates how the pointers and data sharing work currently work in NUMA implementations. In the diagram, the CPU and the GPU have their own memory and their own pageable addresses. The CPU feeds information to the GPU, the GPU runs through it and gives it back and the CPU then takes the results and carries on, sometimes repeating the process to complete whatever it is you’re doing. Note that the GPU cannot address or access any memory the CPU keeps in its cache or RAM, even if it could somehow navigate the data structure.
The main reason why we even have NUMA is because it’s a stop-gap measure. AMD and Intel assumed when they started integrating graphics chips on their boards that prices for memory chips would drop drastically as time went on, making the lack of uniform addressable memory less of an issue because you could have so much more memory. But it was still a stop-gap measure.
With hUMA, you can now have both the CPU and GPU working from the same addressable pool of data from the same memory banks. If the GPU needs something that it knows is on the CPU’s side of things, it checks the CPU pointer to see where it is in the structure and goes to fetch it – no interrupts needed, no wasting time for commands, it just goes. The same applies for the CPU – it can now access information the GPU works with and manipulate it. Can both work on the same thing at the same time? Probably not, but the CPU now no longer needs to feed the GPU with information, needlessly wasting power and CPU cycles.
Its like being in a library. The current situation with NUMA is when you’re younger and you ask the librarian to find and fetch a book for you because you have no idea where it is and its too tall for you anyway. Because the librarian is busy, it takes him/her some time to finish what he/she’s doing to get to your request. But then you grow up and when you go back to the same library later on, you can bypass the librarian directly, you can read the signs that tells you where the book is and, for bonus points, you can actually fetch the damn thing yourself. Then you can walk back to the librarian and they won’t be able to resist your absolute sexiness because you can get stuff done, damnit!
The hUMA solution also addresses a key issue – latency disparity. With NUMA, if you were working with code on the CPU that needed to be processed by the GPU, you would have to copy that data into the GPUs addressable memory and wait for it to finish working through the code. Once done, the GPU writes the result to memory, the processor copies that from the GPU-reserved memory into its own memory and than carries on. If the GPU needs something, its the same long, convoluted steps to take.
The disparity required programmers to have advanced knowledge of CPU and GPU architectures and how they worked with memory and how their pointers differed. This is a part of why more multi-platform games reach the Xbox 360 first, because it has already solved this issue somewhat with dynamically adjustable memory reserves and because the PS3’s hardware setup is very limiting when porting a game from other architectures and environments.
Groundbreaking stuff on the horizon
It gets more interesting, though. Now we’re onto the coherent goal, goal #3 on AMD’s hit-list in the 2013 roadmap. One of the key features in hUMA is Coherent memory. The CPU and the GPU are allowed, on a hardware level, to access their respective caches that are built into their die. This means that once the CPU is done with something and puts it in the L2 or L3 cache, immediately the GPU can fetch and begin using that information for its own calculations without being prompted. The two chips can then use their shared pointers to access information held in RAM – the CPU can access the GPU’s RAM and vice versa.
They can also, in addition, access each other’s virtual memory which is hosted in the page file on the hard drive. This is groundbreaking stuff, people. The processor can compute where a zombie is in the game you’re playing and add it into the virtual memory intended for the GPU. The GPU can then fetch that information along with the decompressed textures and other information in the RAM and figure out that the zombie goes there and draws it in.
The CPU, knowing now that the zombie is almost done, can add in particle effects to its clothing WITHOUT (in theory) interrupting the GPU’s processes. You can have both chips working on the same thing to finish it so much quicker – the CPU handles single or lightly-threaded parts of the job while the GPU gets the grunt work done. That, offloading more and more parallel work to the GPU, is the key, indeed AMD’s key, to lowering power consumption and CPU usage on the system.
The speed benefits it brings have the potential to be tremendous. Complex scenes in games that have to be decompressed onto the RAM from the hard drive can be pre-loaded and pre-rendered ahead of time so the system can allocate more resources to rendering the game, applying textures and other things and reduce any chance of hard disk thrashing as a result of heavy pagefile useage. Its literally game-changing.
This extends way beyond games as well
Although this all benefits the PS4 and Microsoft’s Xbox 720, in a much bigger way it will change how games are optimised on different platforms, especially on the PC. AMD says that their hUMA magic is all done in hardware and with machine code, so it doesn’t need an operating system for the oversight or new APIs to program the functionality in. hUMA chops out so many steps in the sharing of information between the CPU and GPU that we could see big efficiency and performance increases. Just how big we’ll not know until the official launch of Kaveri for the desktop. We”re more likely to see the benefits on consoles first than actual hardware on the desktop.
AMD also says that this technology is available for virtualised environments as well. That means that the company is in a position to heavily challenge Nvidia’s Tesla server market. While Tesla requires CUDA-capable software and won’t have any hUMA improvements, AMD could use this technology to turn an entire server farm into one single, massively powerful entity capable of enormous computational power and throughput, all without having to rely on proprietary software and OS-dependent APIs. That could be interesting to exploit. We could see Skynet and Judgement Day in our lifetimes.
AMD does say, though, that there are limitations to their technology. For one, with discrete GPUs the memories for the GPU and CPU are still separate, but similarly addressable. AMD says that hUMA for Kabini on the desktop won’t directly unify the two memory pools, but they will still be uniformly addressable. The APU can put something on the much faster GDDR5 memory and the GPU can put something in the DDR3 system RAM. That means textures, future game engine code to work with, pre-rendered environments…
Its all coming together now and this is the fun part to watch. The issues AMD’s been having with the memory manager in the HD7000 series, the delays in the HD8000 consumer lineup to focus on getting the PS4 and whatever they’re doing for the Xbox out of the gate, the focus on the mobile segment, the focus on reducing graphics latency with frame rating, abandoning the race for x86 superiority against Intel and the focus on OpenCL – these are all in preparation for a future with HSA (mind you, we knewit was coming, just not this quickly).
Kaveri replaces Richland which replaces Trinity which replaces Llano and they were all incremental updates towards the same goal – a heterogeneous system architecture where you can use the memory and compute resources on the APU and the discrete GPU together to create a huge, single pool of compute resources. This is perhaps the real reason why AMD changed to socket FM2 to accommodate Trinity processors. This is perhaps the reason why the HD8000 series is delayed as they prepare to launch a double-whammy on the gaming world.
The move to processors sharing floating-point units with a scalable design starting with Bulldozer may have been part of the plan all along. Putting together the pieces of the puzzle is fun, but stepping back to see how far you’ve come in assembling the giant jigsaw together is quite something. What does this mean for AMD’s products next year?
When developers from major studios commented that the PS4 was going to be a dream to code on, I don’t think anyone else besides those people and engineers inside AMD, Sony and Microsoft realised quite what that meant. This is quite a bit change. For game and app developers it means that in the future they’ll have access for more hardware resources for their targeted platforms. OpenCL hardware acceleration won’t have as many stumbling blocks to hobble over and may actually properly take off this year or the next.
For Sony and Microsoft, it means more and more developers will be interested in a system that’s easy to code for, offers obscene amounts of power with a fully coherent HSA design and never changes its internal structure. If both platforms are more or less identical, that accelerates multi-platform development by orders of magnitude faster, giving the developers more time to polish their titles but to also ensure that the visuals are top-notch as well. For about a year or so, I don’t expect a PC to have the same levels of performance the next-gen consoles will be capable of.
However, one thing to keep in mind is that hUMA is the APU sharing memory coherently. hUMA with a discrete GPU is possible, but has memory latency issues and will have scaling issues. The PS4 has “Unified Memory Architecture” which means that the APU works on GDDR5 RAM and not regular DDR5. It benefits the console environment because you can take the latency issues of GDDR5 into account and adjust your program to not take much notice of it. hUMA is just merely giving the CPU the option to access GPU-held data and vice versa, but its still slightly separated. The PS4 throws that all under one roof.
For AMD, hUMA means a opportunity to once again push for the performance crown in a range of markets, especially the budget ones that APUs currently serve. hUMA might be exclusive to Kaveri for now, but there will be more products available in the future. No doubt they’ll be milking the console connection for all its worth. With their past history of delivering 15% performance increases year-on-year to their chips, you can also best that performance in the future may keep on improving.
Discuss this on the forums: Linky