Nvidia finally explains the GTX 970 memory issue

Nvidia logo HD

If you’re a current or prospective Geforce GTX 970 owner, you may have heard of, or been following, the whole tooty on the internet about how the card has had its memory allocated to it. Almost overnight the BuildaPC subreddit and several tech threads on NeoGAF and Beyond3D exploded with people discovering that once you go over 3.5GB of VRAM utilisation of the GTX 970, performance drops like a stone, whereas the same issue does not plague the GTX 980. But why does it do that and is it related to the problems of old that also hindered the Geforce GTX 550Ti or the GTX 660Ti? There’s a full explanation from Nvidia after the jump as well as my own thoughts on the saga.

Firstly, to get a proper perspective on these matters, we need to compare the GTX 970 to the card it is based on. Nvidia tripped on themselves a bit when revealing the GTX 970 and their marketing department may have gotten specifications downright wrong in some parts.

Geforce GTX 980 vs GTX 970 (official) vs GTX 970 (actual)
GTX 980 GTX 970 (official) GTX 970 (actual)
 GPU Cores 2048 1664 1664
 Texture Units 128 104 104
 ROP count 64 64 56 
 L2 cache 2048 KB 2048 KB 1792 KB 
 Memory Bandwidth 224GB/s 224GB/s ~224GB/s

When Nvidia launched the GTX 970, they told reviewers that it was almost a GTX 980 with some key differences to differentiate performance. The GM204 GPU that represents the GTX 980 is composed of 16 Streaming Multiprocessor Modules (SMMs) with 128 CUDA cores each. Tie 13 of those together and you get the GTX 970, with 1664 CUDA cores. With the GTX 980’s full stack of SMMs, it also is able to have four ROPs assigned to each SMM, increasing throughput at higher resolutions and improving performance overall.

In previous architectures, there’d always be a slight mismatch in terms of the ROP output and the shader module unit, mostly to preserve performance at high resolutions. In the GTX 970’s official specs, just about five ROPs are available per module when you work out the math, but the reality in the chip design is a little different and you’ll see why just now. With specifications corrected to reflect the GTX 970’s actual ROP count of 56, that gives us closer to four ROPs per SMM, sticking as close as possible to the GTX 980’s configuration.

Note also the memory bandwidth of the GTX 970 now changes from a guaranteed 224GB/s to something around 224GB/s. We’ll get into that later, but suffice to say its not related to the bus width or Nvidia’s colour compression tricks.

As PC Perspective discovered when asking Jonah Alben, Nvidia Senior VP of GPU Engineering, about the memory issues surrounding the GTX 970, not only did Alben reveal that it was something Nvidia knew about, they mostly kept quiet about their mistake and were eventually ready to reveal why performance degraded when running over 3.5GB/4GB on the frame buffer as the internet collectively blew up about it. Nvidia’s official excuse for any misrepresentation of the GTX 970’s specifications is apparently down to their marketing people not knowing how GPUs work and miscommunication that resulted in mistakes in the reviewer’s guide that never got rectified.

Not an optimal setup?

Nvidia Geforce GTX 970 memory subsytem
Credit: PC Perspective, Jonah Alben

Alben provided PC Perspective with the above graph which is the first in-depth look into how the memory subsystem in the Maxwell architecture works. Above we can count 16 possible SMMs with only 13 enabled, a full content of L2 cache (with 1/8th of it disabled), four 64-bit memory controllers for a 256-bit memory bus and 4.0GB of GDDR5 VRAM, organised into a 3.5GB pool and a 500MB pool.

Firstly, though three SMMs are disabled, the GTX 980 and GTX 970 are very close in raw performance. This is because the GTX 970 has less active components, so it can boost to high frequencies and can maintain them for longer because there’s less heat output. Where previous GPUs from the Kepler line based GPU boost on TDP limits, Maxwell does it based on temperature. All SMMs can also access resources provided through the Crossbar, so there’s no bottleneck on the computational side of things.

Secondly, Maxwell is the first GPU architecture that Nvidia has created that is able to selectively turn off parts of the L2 cache, but still have the memory controller it was linked to working and functional. With Kepler, oftentimes there were whole sections of the GPU disabled in order to drive performance down to a lower price point, but it meant that a lot of silicon went unused and more often than not, it was a larger chunk than they’d like. Sub-dividing the L2 cache into chunks that could be selectively turned off is one of Maxwell’s strengths as this increases yield output.

Unfortunately, disabling that portion of L2 cache does drop the ROP count to 56 instead of 64, so in that respect Nvidia has misrepresented the GTX 970’s specifications. Whether that’s enough to send a lawsuit after them for this fiasco, I don’t know.

Probably not.

So what’s happening here?

Nvidia Geforce GTX 970 Mai benchmarks

So, people discovered the memory bug when playing games like Max Payne 3, Elder Scrolls V: Skyrim or Hitman: Absolution, all games which can be incredibly taxing on the amount of VRAM you have on your card. In the case of Hitman and Max Payne 3, both games would only report 3.5GB of VRAM being available and in use, which shouldn’t have been the case according to official specifications. Once Reddit got wind of this story, the community set out to test this using Nai’s benchmark and discovered that any memory accesses to the last 512MB of VRAM on the card effectively ran at 1/7th of the actual memory speed. If you’d like to do your own testing, you’ll have to run your GPU headless (not connected to a monitor) to get the correct performance stats.

What you’re seeing above is the output of Nai’s benchmark, showing that main memory in the GTX 980 has no issues with speed or consistency, but the GTX 970 sticks mostly to 150GB/s bandwidth with the last 512MB running at below 30GB/s on average. As soon as the benchmark gets to that point, the performance of the available L2 cache also drops dramatically.

In an ideal situation for the GTX 970, games would utilise less than 3.5GB of VRAM as offered up by the OS and if it needed more, the idea is that the drivers and operating system all work together to cache assets that aren’t needed frequently into the slower memory, which is still about four times as fast as system memory. Essentially, that 512MB of VRAM works as a cache for the GPU and it’s definitely not something that should be used for performance-sensitive workloads like a modded Skyrim install.

Its important to note that graphics cards also operate similarly to solid state drives when it comes to accessing memory chips – they do it in parallel. In the GTX 970, all 3.5GB of VRAM in the first pool is worked on by seven of the eight memory controllers but the 512MB pool is accessible and addressable by only that attached memory controller.

But – and this is an important point – there is still 4GB of GDDR5 memory available on the card and all of it is open and accessible to the system and software, its just that the last portion is much slower. Nvidia would have been in a much better situation overall today had they released the proper ROP count with the GTX 970 and called the sectioned-off memory L3 cache, which would be somewhat correct (or super-cache or whatever marketing whizzkids will try to come up with).

Alternatively, Nvidia could have avoided this entirely by sectioning off the entire affected L2 cache area along with the 64-bit memory controller, chopped off 1GB of VRAM from the specs and given that to us with 48 ROPs, 104 Texture Units (TMUs), 1.5MB of L2 cache and 1664 CUDA cores. Funny enough, they did almost exactly that to the GTX 970M.

So, if you have a GTX 970, treat it as a 3.5GB card for all intents and purposes. If you want to go through the legal framework to claim back your money and try replace it with a GTX 980 which doesn’t have these issues, then you’re probably well within your rights to do that. Just don’t expect Nvidia to care enough to make amends for it on a global scale.

They’ve also done this before…

Nvidia Geforce 660Ti

In the last four years, Nvidia has released no less than 13 graphics cards with very similar issues, most notably the GTX 460 V2, the GTX 550Ti and the GTX660Ti. All three of these cards were on 192-bit memory buses and had mismatched memory allocations. That is, on the GTX 460 V2, 768MB of VRAM was accessible normally on the memory bus, but 256MB was shared on two of the controllers running at slower speeds and with higher latencies and access times. Though it ended up being the better choice overall compared to the original GTX 460, that 256MB of VRAM was more or less left on the table to sit idle.

The problem was exacerbated by the GTX 550Ti. This was now Fermi, take two, and it shipped with 1024MB of VRAM on a 192-bit memory bus. However, changes to the behaviour of Fermi meant that it could now address memory chips of varying density, but it was still much slower than the main pool and was mostly performance left on the table, unused. You could still use all 1GB of it, but it wasn’t pretty.

Nvidia Geforce GTX 550Ti Memory

When Nvidia made the jump to Kepler, they now did it with a GTX 660 and GTX 660Ti in their stable, both using 192-bit memory buses and 2GB of VRAM as standard. As before, a pair of the memory controllers was saddled with extra RAM stacked on to it, but performance was much better. Still, many games running on these cards will report 1.5GB of VRAM being available. In fact, this was something that Nvidia was open about to reviewers on the launch of the GTX 660Ti, but that was just over two years ago. We’ve long forgotten about it for the most part.

With their previous experience segmenting off memory into pools, Nvidia simply made the necessary optimisations in their drivers to avoid using the second pool of 512MB VRAM unless absolutely necessary.

Nvidia Geforce GTX 660Ti Memory

With Kepler and Fermi, however, dropping to the 192-bit bus was a cost issue. They could either sell the cards as 1.5GB models, giving the advantage on paper to AMD, or they could put on enough 2Gb chips to make up 3GB of VRAM, which would hand the advantage back to them but lower the overall bandwidth available because the memory controllers now had twice as much memory access to work through. Another disadvantage was that, by bringing their designs down to the 192-bit interface, they had to leave out a more sizeable section of ROPs and TMUs, bringing performance down by about 25%.

In conclusion

If you’re a GTX 970 owner, I feel bad for you. Not only did you buy a product that was falsely advertised by Nvidia, it’s dogged by a problem they’ve had in generations of the GPUs before Maxwell and its nothing that we haven’t seen before. You essentially have a 3.5GB card. You’ll have to use some applications like MSI Afterburner or GPU-Z to analyse your memory usage on the games you play and lower the quality settings you’re using until you fall under 3.5GB utilisation, or use the recommendations from Geforce Experience to ensure you never run off that performance cliff.

So, things like MSAA or Dynamic Super-Resolution (DSR) will most definitely chew up the available RAM and give you poor game performance if you push it too far. If you’re playing at UltraHD 4K already, you’ll have to be far more selective about any games like Skyrim that have ultra high-definition texture packs available. You can really only hope to solve this by moving to a GTX 980, or buying a Radeon R9 290 or R9 290X from AMD, both of which come with 4GB of VRAM entirely accessible to any and all software.