Earlier this week I had a deeper look into the Geforce GTX680 and decided it was definitely the contender we needed to give the Radeon HD7970 a run for its money. Kepler is an advancement of Fermi, so much so that the GTX680’s core is closer to the GTX460 than any other card in Nvidia’s lineup. Today we look at the performance of Kepler, and try see where and in what systems it should comfortably fit in.
Firstly though, we need to talk about the third generation of PCI Express. While PCI Express 3.0 is no doubt an improvement to v2.0, it bears reminding that it can only be found to be extensively marketed on Sandy-Bridge platforms that have fully compliant hardware to meet the new standard. All motherboards with a full 16-lane PCI Express port can, in theory through BIOS upgrades, move up to PCI Express 3.0. The only drawback here is older motherboards that support multiple graphics cards. That’s right – your Rampage Gene 2 motherboard that you spent thousands on potentially won’t get the proper PCI Express 3.0 update, regardless of how up-to-date you think the board may be.
The reason is that the older chipsets that controlled SLI and Crossfire had to use up eight lanes to allow for switching the slot from a full-size 16-lans port to a half-size 8-port. Add up the lack of lanes to the 20% overhead that PCI Express 2.0 usually incurred, and it was easy to see why more expensive boards that had 16 lanes in each port even while running SLI provided better performance. If you had to upgrade your BIOS to support the new connection standard without compliant hardware, all your PCI slots would be reduced to eight lanes wide and you wouldn’t get the benefits anymore.
Motherboards that do have the correct new chipsets with PCI Express 3.0-compliant switchers don’t have this issue – they can switch from a lonely 16x slot to a 8x slot with another card in SLI or Xfire without hassle. Motherboards that don’t support multiple graphics cards will likely run at full speed though you might still get the update. So that’s why, if you’ve been reading other reviews online, Nvidia says that the GTX680 is PCI-E 2.0 and PCI-E 3.0 compliant.
See, Nvidia wants as many people as possible to buy their cards. But if you sold a card that was only PCI Express 3.0 compliant, you’d have trouble running that on older motherboards that don’t support the standard yet. So all GTX680s are, essentially, PCI Express 2.0 cards. But don’t worry, because it doesn’t actually impact performance that much, yet. Once Ivy Bridge launches with PCI Express 3.0 support, then Nvidia will provide an optional update to folk who want the faster and more advanced standard. Radeon HD7000 owners also have a similar issue, and will get a similar update to the BIOS to support the new standard. Remember: You can fit a PCI-E2.0 card in a PCI-E 3.0 board, but not the other way around.
Further into the card, we have to make a quick stop at chip design level. Since it’s based on the GTX460/GF114, we have to go back (rather, left) and look at the load-out for the GTX460’s shader model architecture. Each Shader Module had 48 CUDA cores with 16 Load/Store units (for threads that were required in dependent code, as I discussed on Monday), 16 Interpolation units, eight special function unit and eight texture units. You see those orange blocks at the top? Those are called Warp Schedulers, and they were responsible for telling the CUDA cores what to do and when.
The Warp schedulers have two dispatch units. These dispatch units took the workload for the cores and split it into 64-thread segments (just a reminder: one thread = one line of code). That segment was then divided into four 16-thread bits for the four dispatch units. At any given time, you can go through 32 threads in one clock cycle (running the code from Initialisation through to the Issue, refer back to part one if you’re confused), and running all 64 threads would require two clock cycles. This is why overclocking a card brings more benefits – the higher the Shader clock speed, the more cycles can be run in the same amount of time.
Kepler or GK104 changes this by improving the efficiency of the Shader Module and dropping the Shader clock. I know I’m going to be oversimplifying here, but it beats getting lost on me after 750 words! GK104 increases the number of cores in each module to 192, doubling the amount of Load/Store Units to 32, Interpolation Units to 32, Special Function Units to 32 and Texture Units to 16. It now has four Warp Schedulers with eight Dispatch Units per Shader Module. Each Dispatch Module now can run up to 16 threads simultaneously, which means that GK104 can go through double the workload of GF114 in half the time. Overclocked, we should see GTX680 variants easily pushing over the ageing GTX590. If this is what the flagship is like, what’s the GTX690 going to do to benchmarks?
Just as an aside, some of you may be asking if I failed math somewhere. If there’s 192 CUDA cores, at the end of it all it means that Kepler actually runs through 128 threads in a single cycle, thereby actually making it four times as fast? Not really. See, half of those CUDA cores are there to make up for the missing shader clock that modern applications will no doubt go up on a goose chase to find. About half of the cores are used to do the checking to make sure that each thread passes through right on time.
If you’re still not convinced, there’s an earlier hardware stage called the Register File which holds all the necessary code in 32-bit segments ready for action. It’s only doubled in size for Kepler and thus doesn’t have work for the other half of the cores, so there’s no issue of CUDA cores going to waste that could be used for more threads.
But enough of that, let’s take a look at some benchmarks; we’ll start with Tom’s Hardware’s run of 3D Mark 11. Yes, I know it’s synthetic, but the latest version has grown to be a much better fortune teller for those looking for a new card. As you’ll see below, the GTX680 should beat the GTX580 by a comfortable margin and should find itself slightly ahead of the HD7970. It should also trail a bit behind the GTX590 as well, which you’ll see is no mean feat on its own. The Physics score is more interesting because of Kepler’s simpler approach to the way threads are run in the new, smaller CUDA cores. Even without the dependency checking and Register table in Fermi, it more than keeps up with the new pack, and beats the GTX590 by a hair’s breadth.
Moving over to games, here we look at Battlefield 3. Strangely enough, most reviews up already don’t benchmark the GTX680 with FXAA enabled in Battlefield 3. I’ll be dealing with FXAA in my final analysis, but not showing off how big the benefits of FXAA is a big let-down. I’ve added in a second 1080p result from Anandtech which enabled it, and another third review by Hardware.info, which managed to get a quad-SLI setup running in no time for testing at 5760 x 1080p, the default configuration for Nvidia surround setups. Battlefield is a shader-heavy game, and the GTX680 squeezes into third place here, just as 3D Mark predicted. It still provides playable performance at 30” resolutions with everything maxed out. With FXAA it streaks ahead of the HD6990, and easily provides smooth gameplay in Nvidia surround. Unfortunately, not enabling FXAA brought down performance at those high resolutions, and paints the GTX680 in a different light. Looking at the different between the Quad SLI and SLI scores, adding in two cards results in a gain of only 16fps, a poor showing of the platform.
Moving to my current favourite title, Crysis 2, we can see one area where the GTX680 suffers – bandwidth. Although the card features a 256-bit bus its not enough to make up for the fact that the amount of bandwidth hasn’t changed much from the GTX580. Granted, the GTX580 did have a 384-bit bus, and significantly lower-clocked memory. If there should ever be a non-reference board with a 384-bit bus, it would chew up everything all the way up to the GTX590 in Crysis 2. That same trend continues up to 2560 x 1600, where its powerful enough to muscle past the HD6990 but doesn’t get comfortably playable framerates. Quad SLI takes the cake, using the extra computing power to make up for the bandwidth loss.
And then, there’s some interesting results at the end here. When taking a look at the Metro 2033 scores, we see the GTX680 shirk back to its usual third place finish. Metro’s engine puts a strain both on shader performance and bandwidth, essentially mating Crysis with Battlefield and instead landing up with a Russian brat. Even when turning on AA, which is traditionally Nvidia’s strong point, the HD7970 sneaks past with a win. Perhaps this is down to drivers, or Metro just prefers AMD cards. This is certainly apparent at 5760 x 1080, and even Quad SLI drops down to unplayable framerates with everything turned up. This isn’t a driver problem, then – where the hardware limitation is isn’t apparent, but I’m betting its to to with the re-arranging of the Polymorph engine and not increasing the size of the L1 cache. It may have double the Texture units, but that doesn’t do squat if it’s already running out of memory at 2560 x 1600. Then again, no-one’s using FXAA here for comparison, so perhaps that’s something to do with it.
I’ll end off the Performance Preview with a double-up: Alien Versus Predator and Civilisation V. Both games make heavy use of the CUDA cores as GPGPUs, exploiting the DirectCompute ability first demoed in Fermi. With Tesselation enabled, Alien Vs Predator uses Tesselation to draw up the Alien in its entirety. In Fermi, the Multi-Port Decoder Queue would sort out threads that were dependent or independent of other code, and schedule the dependent code for later threading. In Kepler it just zooms right through, sorting out the dependent code first, then doing the rest of the code thats both dependent on those results and independent of everything else. In a way it makes sense, but that brings up one issue.
If the GPGPU code using CUDA cores is pushed through and calculated first, that should bring a performance hit with code thats used in the game that’s required immediately. In this case, its up to the developers to push up a patch that will change the way code is scheduled in the game’s engine. But thankfully Nvidia saw this before they made the scheduling change, and designed Kepler to be far more intelligent. This is why despite the change, the GTX680 pulls back into third place in Alien Vs Predator at 1080p with full AA (once again, notice the lack of forced FXAA). At 5760 x 1080 it still offers playable performance in Nvidia Surround – just don’t turn on AA. As with Metro, Alien Vs Predator enjoys the stronger Compute performance in the Radeon HD7000 family, easily besting Quad SLI with 4x AA enabled.
However, given that the Scheduling has now changed inside the chip itself, does it affect the performance of the CUDA cores directly? There’s no comprehesive review out right now that tests CUDA performance in the GTX680, but there is one game that can give you an indication – Civilisation V. The game uses DirectCompute to decompress textures in the game folder for use on buildings and environment. In the graph below the GTX680 is clearly at the bottom rung, trailing the HD7870, even. Yes, it beats the GTX590, but that’s essentially composed of two underclocked GTX580 cores anyway.
The GTX680, then, is by and large a gamer’s card, built by gamers for gamers. The improvements to heat, Tesselation performance and DirectCompute ability inside an application are worthy of its own recommendation thanks to the number of design wins Kepler has under its arm. I’d wholeheartedly tell any gamer worth his salt to have a long, hard look at it. Sure the Radeon HD7970 may be the better performer in many scenarios, but once you enable 4x AA they’re both at the same level. 30″ screens notwithstanding, both cards are equally matched, with the GTX680 winning thanks to its lower price and extra feature set. Its a good time for Nvidia, and a great time for the consumer.
In my final assessment of the GTX680, I’ll be looking at the other tidbits that I haven’t yet covered. There’s so much to discuss – A deeper dive into the performance hit to DirectCompute, Adaptive V-Sync, GPU Boost, FXAA and TXAA, there’s a lot more to digest in my last of this three-part analysis. Be sure to hit the links below if its already live, or trawl our Forums and follow my thread to get the latest updates.