Over the past year I’ve been very vocal about AMD’s seeming inability to catch up with Intel’s latest and greatest processors and their in-app performance. When Bulldozer was released for socket AM3+, it was generally on par with a CPU that was a generation old – Intel’s Core i7-920. The i7-920 today is still one of the powerhouses that throws around its weight with gusto, putting in performances in modern games and apps that make one question if the price you paid for the upgrade to Ivy Bridge was really worth it, considering there wasn’t a lot wrong with Nehalem in the first place. And if you were one of the few fans that saw the good in the Bulldozer family, you could argue that there wasn’t a lot wrong with it either.
Piledriver aims to bring performance up a notch and when it was announced on AMD’s roadmap, the company promised improvements to power consumption and up to 15% more performance than the outgoing FX range. Have they delivered on their promise?
First, this is an improvement on the Bulldozer architecture, which could be taken as a “Tick” product in the same “Tick-Tock” strategy that Intel employs for developing their product lines. “Ticks” are usually refinements of a new product or a new process node being used to improve an existing design. While Piledriver’s cores are still made using 32nm technology, there are architectural tweaks for efficiency and performance. Most of what you’re about to see was actually present in AMD’s Trinity family as well and the A10-5800K was recently given a thumbs-up by NAG Magazine’s Neo Sibeko in their latest November issue.
SO WHAT’S CHANGED SINCE WE LAST MET, BULLDOZER?
In Piledriver, the first major change is in the way the processor logic handles branch prediction. A large part of making a particular workload parallelised is to have multiple threads that rely on each other’s outcomes as little as possible in order to compete a given amount of work in tandem. Its a major hurdle to having applications run on more than three or four cores and all of the CPU and GPU manufacturers have done their bit to improve branch prediction engines inside their processors.
Intel’s Ivy Bridge improved its branch prediction logic over Sandy Bridge, as did Nvidia’s Kepler and AMD’s GCN-based GPUs, which are designed for highly parallelised workloads in the GPGPU war with Nvidia and Intel’s Quicksync. With OpenCL becoming a slightly more widely adopted standard, its important to address multi-threaded performance before it becomes a problem.
Firstly, the Prediction Queue. In Bulldozer, its prediction engine was optimised for parallel workloads, but several things weren’t up to the same task – namely the software available and Windows’s hardware scheduler, which juggles tasks according to which processor cores are available or in a particular C-state. One core in the module works on the first thread in the L1 BTB cache, while the second will work through the dependent thread fed through the L2 BTB cache. This, tied together with the new hardware scheduler in Windows 8 (by default) and in Windows 7 (through a hotfix), aims to keep a particular workload attached to a single module during its execution.
There are also some new ISA instruction sets that make an appearance here. Fused multiply-add is a modern instruction set that’s required for more modern encryption schemes and with some uses in other programs and applications. It allows an instruction to have up to four Operands, which are, fundamentally, maths problems that you had to solve in Grade 2, although they can get a bit more complex. FMA3 (with three Operands) and FMA4 are supported by Piledriver, allowing execution threads to contain more work in a single thread to be completed in a single clock cycle. FMA4 was previously present in Bulldozer and FMA3 is a planned instruction set to debut in Intel’s processor family with Haswell. In this regard, AMD has had a lead for almost a year, even if many apps aren’t taking advantage of this just yet.
There’s also FC16, also an ISA intruction set that is currently supported by Ivy Bridge. It converts half-precision threads into floating point values, which is a crucial building block required in programs compiled in Microsoft’s Visual Studio 2012. Bulldozer lacks support for FC16 and thus wasn’t suited to compiling programs in Visual Studio. Piledriver changes that, bringing more options into the workstation arena for value-orientated rigs.
A stumbling block in Bulldozer was how quickly information would flow from the caches to the separate integer cores. AMD’s engineers discovered that information stored in the register file wouldn’t reach the scheduler on time, wasting valuable clock cycles waiting for the information to reach the cores. This was fixed by improving the time taken for information in the registers to reach the schedulers. The cores themselves still have two execution units and two address generators capable of MOV intructions – MOV intrustions move information from the RAM and registers to the CPU for the execution units to utilise.
A large part of why Bulldozer’s performance was lackluster was the cache latency. L1 cache is usually the fastest integrated memory in the chip, with L2 and L3 falling behind in terms or latency but increasing in size. A lot of data gets stored in the L2 cache but its location needs a pointer to find that information quicker. The Translation Lookaside Buffer (DLTB) doubles from 32 to 64 data entries to improve the hit rate when looking for data in the L2 cache that is required by the CPU still working with info in the L1 cache. A few modern games hit the L2 cache really hard and that brought down performance with Bulldozer.
Lastly, L2 cache is still shared by the two cores but is now a little faster on average thanks to an improved data prefetcher. L2 cache latency hasn’t changed from Bulldozer, however, meaning that in some workloads the lack of super-fast L2 cache will impact performance noticeably. But hopefully the other improvements will keep minimum performance gains above 5%. L3 cache size remains unchanged, although I’m not convinced that it plays as large a role as it used to in the past. Intel relies on speedy L2 cache but keeps their L3 cache under 10MB in size because it doesn’t play as big a role in desktop performance as it does in a server.
Rather curiously, Piledriver is still only able to execute eight threads per clock cycle when fully loaded. Bulldozer was the same, but Llano can chew through twelve threads at a time and Intel’s Sandy Bridge mows through sixteen threads per clock cycle. Higher cache latencies, a poor branch predictor, a limited address buffer for L1 and a shared floating point unit between cores in the same module was probably the reason why Bulldozer suffered in single-threaded workloads. The latter part doesn’t change in Vishera, but there’s other improvements to look forward to…like the price.
GETTING MORE FOR LESS? THAT SOUNDS TOO GOOD TO BE TRUE…
As previously mentioned, AMD’s positioning of Bulldozer caused a stir around the tech world because the price the various options commanded just wasn’t attractive. Piledriver launches with a bit more common sense attached, with the FX-8350 aiming for the Core i5-3570K which costs $36 more when you look only at the RRP. The FX-8320 takes the fight to the Core i5-3450, the middle-of-the-range FX-6300 attempts to usurp the i5-2300 (now replaced by the i5-3330) and the FX-4300 will try to hold its own against the Core i3-2120 (now replaced by the i3-3220). In a recent Tom’s Hardware’s matchup of the FX-4170 with the i3-3220, it was found that the chips were more or less on par, with both deserving a spot in a gaming or productivity-orientated rig.
What’s more exciting is the improved power efficiency that the Piledriver-based FX family (also codenamed Vishera) offers to consumera. I’ll show you those results later, but max TDP for the FX-6300 drops from 125W to 95W, with the FX-4300 staying at 95W and the FX-8320 and FX-8350 both staying at a maximum TDP of 125W. You’ll recall that the FX-8120 and FX-8150 hit 200W for power consumption when stressed out to the max.
The pin-outs on the Vishera FX family are made to match with socket AM3+. Very few AM3 boards will work with the new family and for those that do, a BIOS update is all that’s needed to provide support. For AM3+ owners a BIOS update is also applicable but for Gigabyte owners this might be a bit of a problem. The company recently updated their lineup with the UEFI BIOS upgrades and all supporting motherboards have the option to update from the ageing, but largely functional AMI BIOS to UEFI, a crucial component that plays a role in future OEM versions of Windows 8 and protected platforms that are designed for use in enterprises and tablets. For most users, the UEFI upgrades aren’t necessary but if you’re moving to Vishera it is required.
THAT’S ALL GOOD, BUT HOW DOES IT PERFORM?
Well there’s going to be a few problems finding comprehensive test scores online. Most sites received only the FX-8350 for review and little else. So for the FX-4300, FX-6300 and FX-8320, readers have to draw approximations for their preferred chips and I feel that’s really unfair. AMD needs to make an impression for all buyers in every price point that they target – for online reviews, this means that the company should give the reviewer chips from the entire lineup to show where the improvements are made for the targeted price points. Anandtech did, however, land all three chips for testing and I’ll be using some of their results for my analysis today.
The first stop is 7-Zip and immediately the performance enhancements are noticeable. 7-Zip with the Bulldozer family was a remarkably favourable bench for AMD and brought the FX-8150 just under the Core i7-2600K’s performance last year. This year it trails the i7-3770K by a slight margin, but the FX-8350 easily stretches its legs to beat Intel’s best. Take note that the FX-6300 almost matches the performance of Intel’s i7-920 and puts in the same showing as the i5-3570K. The FX-4300 likewise easily outpaces the Core i3-3220 based on Ivy Bridge. Take note that wherever a benchmark favours integer performance and has threads with few dependencies, Piledriver puts in a great performance.
Recoding video is another thing that Vishera does well. Since it doesn’t feature an integrated GPU, Open CL benchmarks won’t run as nicely because those are mostly single-threaded on the CPU and multi-threaded on the GPU’s workload. This is one area where AMD’s Trinity processors work very well (especially with the ability to chew through as many threads as it does). But the pattern that forms from these benchmarks shows things quite clearly: when the workloads are multi-threaded, the Vishera family generally performs above the level of equivalent Ivy Bridge processors with the exception of the i7-3770K.
With most multi-threaded benchmarks running really well, the only exception is the Visual Studio benchmark, compiling the installer for Mozilla’s Firefox browser. Program compilations aren’t heavily threaded, however, as several sections of any complilation are single-threaded. In this regard, the FX-8350 performs on the same level as the Core i5-2500K and i5-3470 – both of which are, appropriately, the similarly-priced options available from the Intel camp. Photoshop CS4 shows this clearly, pitting the Core i5-2500K just ahead of AMD’s best. Less than a second’s difference isn’t that big, but a lead of three seconds by the i7-3770K is a problem. With Photoshop CS6, which a few other review sites use, this lead is at most halved with the use of semi-threaded plugins and completely erased with fully threaded plugins and editing tools.
THAT’S AWESOME, BUT WHAT ABOUT MY GAMES?
Here’s a place where I feel that tests and benchmarks could do with some new methodology. For the most part, using lower resolutions to isolate CPU performance is great but hardly anyone plays with those kind of settings. At 1080p with high settings, I’d expect most games to be GPU-limited at their highest settings, with little to no consequence for which processor you use at those settings (with the exception of games that take advantage of more than two physical cores). Anandtech’s gaming results, however, left a sour taste in my mouth. From Anand’s review:
“Our latest discrete GPU gaming tests use a GeForce GTX 680, while the older tests use the Radeon HD 5870. We’re focused on looking at differences between CPUs here so most of the numbers you see are CPU bound rather than pushing the GPU to the limits.”
That’s a little disingenuous, isn’t it? Not making a note of which tests are performed using the HD5870 or the GTX680 does make a difference and sways the test scores, as seen below:
The Skyrim and Diablo III benchmarks are clearly GPU-limited even at 1680 x 1050, where there’s little emphasis on the processor used. Those tests most probably use the GTX680, as there’s very little difference between the competing processors as you move up the ladder. At over 200 frames per second, benchmarking high-end processors at lower resolutions in gaming is becoming less of a real-world test and more of an academic exercise. Its good to see wholesale improvements over the FX-8150, but turn your attention to the World of Warcraft and Starcraft II results. There’s very little reason why we’d see such a large gulf in scores unless they were performed using a weaker GPU – the Radeon HD5870 is a likely possibility.
Since WOW and Starcraft II are CPU-taxing titles (Starcraft more so) it feels a bit like cheating to me. It does highlight that the game prefers strong single-threaded performance but it would be a much more fair test if everything was on maximum at 1080p with the GTX680 – because that’s what gamers will be playing with. This is why I feel that isolating CPU performance in games at low resolutions and settings is a test we should be abandoning – what really matters these days is the ability to drive and properly feed graphics cards at the game’s highest and most demanding settings and this applies especially to those of you running Crossfire or SLI setups. What’s even more important is how timeously the information is fed to and used by the GPU, which is why Techreport’s frame latency testing methodology is drawing a lot of attention from the internet.
Looking at the Skyrim results from Techreport, we can see that there are parts of the benchmark where even Intel’s Core i5-3570K struggles for a few milliseconds here and there to keep the game running at 60 fps (which is right on the 20ms latency mark, for your reference). Often there are spikes over 30ms which suggest that the game even taxes the HD7950 a fair amount. If you had to stretch the two graphs to meet up at the end, you’d even see similarities in the spikes towards the end of the test. You would also see that there’s a particular part of the benchmark where the Core i5 struggles while the FX-8350 sails through. For those of you interested in how the Core i7-3770K performed, click here.
What can we take away from here? The FX-8350 may be the poorer performer overall in terms of average fps, but its behaviour is predictable and there were very few instances where it dropped below 60 fps. The i5-3570K can’t say the same for itself. In fact, to satisfy your curiosity, the Core i7-3770K turns in almost the same performance as its cheaper, HT-less little brother, with spikes in the same regions on the graph.
Techreport’s Batman: Arkham City results are even more taxing on the entire system. Their benchmark involves zip-lining through the city at the height of the Gotham City skyline, with new areas of the level requiring loading while Batman shows Spiderman how it’s done. Where AMD’s chips are compared, the FX-8350 performs better than its older siblings throughout the entire test, posting frame latencies that even dip below 10ms. Overall, where Bulldozer had a tough time making its mark against the Phenom X6 1100T, Piledriver pulls ahead with ease. That’s not just a small bump in performance, that’s an entire generation gap.
Compared to the Core i5-3570K, the results are a little more than pleasing. It does show fewer spikes over 20ms during the benchmark, but overall their performances are remarkably similar. The Core i7-3770K isn’t that far off either, putting in only a slightly better performance than its sibling. However, the Piledriver chip spends most if its time at just under 20ms latency, suggesting that there are parts of the benchmark where better single-threaded performance is needed.
This is one of those results that are hard to quantify to Battlefield 3 players. Because the multi-player portion of the game is heavily CPU-dependent due to the sheer amount of jaw-dropping stuff that happens in an online, 64-player match, single-player scores aren’t really that interesting. Here its clear that the game is GPU-limited but you’ll notice that the FX-8350 turns in a better score overall with far fewer spikes above 20ms. I’m happy to tell you its an even better performance than the Core i3-3770K, but there’s not a lot of difference overall.
What is good news is that this game runs pretty well on any quad-core if you’re aiming for a 60 fps minimum average score. Even the A8-3850 does well in this test and barely spends any time struggling to stay stable paired up with a behemoth like the Radeon HD7950. With that in mind, anyone planning on going online with Battlefield 3 better have a quad-core chip inside there. A Core i3 chip with Hyper-threading would suffice as well, but just barely.
I’m a huge Crysis fan and to see this kind of performance is awesome. Compared to older chips, the FX-8350 turns in a better performance overall and this is one of those games where Bulldozer pulled ahead of the Phenom X6 chip in its early days. Crysis 2 appears to be well threaded when you compared it to other chips in the market, with performance in Bulldozer chips getting progressively better as you pile on more modules. The FX-8350 spends most of its time hovering around 60 fps, showing that at this level the game is GPU-limited.
Crysis 2 seems to prefer stronger single-threaded performance but the larger emphasis on multi-threading keeps the Piledriver chip in very close contention with the competition.
SO, WHAT HAPPENS IF I OVERCLOCK IT?
That depends. Most sites were able to overclock their review units of the FX-8350 above 4.5GHz with a relatively minor voltage increase of 10%. Gains in benchmarks and games were minimal, but they were there.
A lot of review sites didn’t go into much detail about their overclocks. Most found their limits to be around 4.6GHz, with only TechpowerUp’s review being notable for hitting 5GHz on all cores right off the bat. When overclocked, the FX-8350 easily overtakes the Core i7-3770K in Anandtech’s x264 recode on the second pass, nearly matching it in the first run. Hexus noted that Cinebench scores with their sample at 4.6GHsz just about matched the i7’s performance, with gaming scores improving just a little bit in Batman: Arkham City. In a game that is more GPU-limited, an extra ten frames is nothing to sneeze at. But its not the prowess of the chip when pushed to the limits that impresses me: its the power draw.
POWER CONSUMPTION IMPROVES DRASTICALLY WITH THIS GENERATION
Those tweaks for performance were also efficiency tweaks, allowing AMD to slightly push up the clock speeds and generally get their top chip to perform better than its predecessor. If you quickly go back and have a look at Anandtech’s overclocked results, you’ll see how the FX-4300 and FX-6300 performed. Note in particular how the dual-and-triple module performers match up to the full monty – on average, it chews through double the work of the lowly quad-core and goes through its workload with 50% more aplomb than the FX-6300. Those kinds of performance gains are, again, predictable and generally better than the Intel competition. But in terms of power consumption, who’s the winner here?
Techreport’s test with a two-stage x264 recode shows the kind of power consumption you can expect from the chip under load. While the FX-8350 does match the power draw and the thermal limits of the FX-8150, it does so at a higher frequency and with better performance to boot. Compared to the Phenom X6 chip it is at the extreme end of the table, but arguably its no longer an issue for those of you looking for the best threaded performance for a more agreeable price. Techreport’s benchmark has two runs, the second of which focuses on image quality. That starts just before the 60 second mark, after which power consumption drops to be more in line with the competition.
A 50W difference compared to the competing Core i5-3570K and less than 25W against the more expensive i7-3820 is definitely an attractive number by any standard. But you may ask, how efficient is is really when you compare it to the previous generation? To do that, you have to overclock both chips to their limits and eschew AMD’s TDP limits.
Techpowerup managed to get both their FX samples to 5GHz successfully and then ran them through a x264 recode to see how the chips run their business under extreme stress. The results, I think you’ll agree, are telling. With over 50 Watts separating the new generation from the old, AMD has clearly done their homework and fixed a great many things for better performance, thermals and power consumption. That’s quite a big improvement to power consumption, a bigger gap than Intel managed even with a die shrink to the 22nm process for Ivy Bridge. With another year of engineering, we might see another 50W drop with the Steamroller architecture and a 95W TDP for AMD’s FX family.
Just take a look at the full system load when overclocked and running a CPU-intensive test! Its only using 50W more at full blast with all cored on 5GHz with a voltage increase. That’s impressive.
IN THE END, DOES IT MATTER?
I’ve always been asked by friends and family and even colleagues what value AMD holds to gamers and power users. And for a while before the FX chips, the answer was easy – better multi-threaded performance if you pushed the chips in the right manner and generally cheaper prices than the competition. I saved a friend over R6000 by showing him that the Phenom X6 was a far better choice than him digging out of his pocket to fork out for the Intel Core i7 Extreme Edition, which was the only other six-core available to consumers. When I was working in retail, I set up no less than five Digital Audio Workstations (DAWs) for customers using AMD’s X6 chip and they all were tremendously pleased at the money they were saving. Those customers later poured the savings into SSDs and high-end audio cards. I never saw a cent of those savings that I gave them, but knowing that they were happy with what was traditionally perceived to be a poorer brand (no thanks to Intel and their Netburst failure) was enough.
With the arrival of Bulldozer, that benefit simply fizzled into “not much, stick with Intel”. Intel’s Core i7-920 was still kicking every other chip around even while on a dead-end socket and Sandy Bridge just set everything on fire like a Maliwan handgun from Borderlands. Seeing those kind of improvements was satisfying, but inside I longed for the kind of response the green team used to be famous for – I based my personal rig on the venerable Athlon X3 445, after all. To this day, the last great chip that was produced for gamers was the Phenom II X2-550. Being generally unlockable for buyers to a triple or quad-core with stability at 3.4GHz with a full compliment of L3 cache for R1500, it was the better choice for a lot of gamers and buyers who had a limited budget. With Bulldozer, the only chip worth considering was the FX-4100 because gaming performance was more or less the same as the high-end FX-8120.
With the Vishera family and the revised Piledriver architecture, there are several products in the AMD lineup that are very attractive to gamers, system builders and enthusiasts. The FX-6300 gets my thumbs-up for providing the best power consumption, a lower 95W TDP and generally better and predictable performance than the FX-4300. Both chips should be good for 5GHz overclocks as well, cementing enthusiast value even further. But its the FX-8320 that probably deserves a good deal of attention.
With stock speeds of 3.5GHz and a Turbo boost limit of 4GHz on all cores, a 125W TDP and a $169 price point, its the one that gamers should be going for. The money saved should go into a better CPU cooler, DDR3-1866 RAM or even a better GPU. Like the A10-5700 APU, its more or less in the middle of the high-end lineup but is the more compelling offer.
IS THIS A RETURN TO FORM FOR AMD?
Not quite. I mean, its an improvement over Bulldozer and for that AMD’s engineers and design team must be commended. Its no mean feat to do as well as they did with a relatively new architecture. But where I feel improvements could be made is in socket AM3+ and the 9-series chipset. AM3+ has been with us for quite some time but doesn’t look and feel as modern as FM2. So far, FM2 offers native Thunderbolt connectivity, native SATA 6GB/s ports, native USB 3.0 and occasionally even great Hybrid Crossfire support for those that go in that direction. PCI-Express 3.0 support is lacking, but given its budget aspirations this can be overlooked.
AM3+ is the premium market offering and lacks a lot of things Intel has had for a while now. There’s no PCI-Express 3.0 support, no Thunderbolt, and with Haswell’s new power switching profiles coming to us in future (and a subject of a future column of mine) there’s going to be a bigger gap in power consumption because of lag in desktop chips switching between profiles set by AMD’s Cool ‘n Quiet technology. UEFI support is there but Intel’s offering a Trusted Platform to manufacturers to prevent users booting a different OS on a Windows 8 rig – to my knowledge, AMD doesn’t support that.
In addition, prices are way too high. The 990FX chipset features 32 PCI-Express 2.0 lanes natively linked to the CPU, double what Intel’s Z77 has to work with, with 16 PCI-Express 3.0 lanes built-in. Most board partners have to use the PEX 8747 switching chip to get four-way Crossfire and SLI working, something that AMD, had they not quarreled with Nvidia years ago, would have to offer enthusiasts at a lower price point. When Intel’s Nehalem powerhouse landed and smashed records and piggy banks, AMD’s platform was the better choice because overall, to get the same value and features, it commanded an average of $50 to $100 less than competing Intel builds. These days, its hard enough to find a 970-series board in stock, much less a 990FX board at a decent price.
Its here that I feel AMD can change things the most and its here where I feel the company needs to turn its attention to before Steamroller hits. Too many motherboard companies are seeing decreased sales from the AMD camp because you can simply update your BIOS and drop in a new chip without changing much else. While that benefits consumers, I think it needs to change. With Steamroller only due for a early 2014 release, AMD needs to kick that out as early as possible and drop prices on all their AM3+ boards in the interim.