NVIDIA Pascal P100

NVIDIA yesterday revealed their next-generation graphics architecture codenamed Pascal at GTC 2016, and not only did they show off the final silicon for the graphics cards, they also showed off some really high-performance products for the server markets. The “GPU” you see here is NVIDIA’s Tesla P100 Accelerator, a enterprise-bound compute card that’s made for crunching numbers all day long. It’s the fastest, most power-efficient compute card that NVIDIA has ever made, and the architecture it is based on will be coming to consumers this year in the form of the Geforce graphics card family.

The Tesla P100 Accelerator is quite the behemoth in terms of specs. Designed to fit into a 3U server chassis using NVIDIA’s new interconnect that replaces PCI-Express (which I’ll get to in a minute), it’s easily the smallest high-performance server compute card that NVIDIA’s ever made. NVIDIA positions it as a hardware solution to drive today’s learning algorithms and advanced AI programs. The Tesla P100 packs in 16GB of HBM v2 memory, which gives it an eye-popping 720GB/s of memory bandwidth. By comparison, AMD’s Fiji-based graphics cards which feature HBM v1 memory top out at 512GB/s.

In terms of hardware, the P100 is a massive leap over the previous generation of NVIDIA hardware. With 3584 CUDA cores running at a base lock speed of 1326MHz, it’s the fasted GPU NVIDIA has ever made, promising performance leaps over the previous generation of at least 70%. It’s difficult to quantify this in terms of video game performance because the Tesla P100 is not geared for gaming at all, but it chews through workloads that taxed previous Tesla offerings with ease. Quoting test results from a learning algorithm that NVIDIA was training, it took a Tesla M40 system with four GPUs over 25 hours to train an algorithm to identify objects in images, while on an eight-way Tesla P100 system that same task look less than two hours.

nvidia pascal architecture

I’m not going to get too much into the Pascal architecture right now, but it’s more of an extension of Maxwell than a complete redesign, although there are some things that are drastically different. Where Maxwell divided itself up into more manageable chunks to optimise power efficiency, Pascal divides itself even more to more evenly distribute workloads across the chip, allocating 64 CUDA cores to each SM (shader module) and keeping space open for 32 double-precision units that are capable of 64-bit floating point math. This means that NVIDIA can further gate the power consumption and heat output at a more granular level, and when a workload doesn’t require the entire GPU to work on it, the other SMs are put into an idle state. With Pascal, NVIDIA and AMD are more comparable as both now divide their processors into 64-unit groups.

Pascal can scale all the way from two SMs, which will likely be the Tegra variant that we’ll see in 2017, all the way to 60 SMs, which is the full-sized chip that will be used in a future Pascal-based product. The P100 Accelerator has four of these units disabled to increase the amount of usable chips they get from the factory, although I doubt this will have a massive bearing on performance. In terms of die size, at 600mm² the Pascal-based P100 GPU is the largest GPU that NVIDIA has ever made on TSMC’s new 16-nanometer production process. It’s bigger than AMD’s Fiji GPU die found in the Radeon R9 Fury X (596mm²), and it’s only 1mm smaller than the outgoing GM200 die found in the Tesla M40 and Geforce GTX 980 Ti.

However, the maximum single-precision throughput of NVIDIA’s P100 chip found in the P100 Accelerator tops out at 10.6 TFLOPS, whereas AMD’s Radeon R9 Fury X currently sits at 8.6 TFLOPS. It would be interesting to see how close the two companies are in throughput once AMD’s Polaris architecture is launched on a 14-nanometer process.

NVIDIA DGX-1 Pascal P100 server

As for how the P100 Accelerator will be used, NVIDIA revealed that they would be entering the server market with another custom-made solution, this time one that only works with P100 units. It’s called the DXG-1, and it’s dubbed as a “datacenter in a box” by NVIDIA because of how much this one server could be made to do. It boasts two Intel Xeon eight-core processors that are passively cooled based on the Haswell-E architecture (and soon to be updated to Broadwell-E), up to eight P100 Accelerators pre-installed, 512GB of ECC registered DDR4-2133 memory, up to 8TB in SSD storage, dual 10 gigabit Ethernet ports, four Infiniband optical ports, and a power supply capable of a sustained load of 3200 watts from the wall. And it’s entirely custom-made.

nvidia NVLink architecture

In fact, it’s so custom that NVIDIA had to create their own replacement for PCI-Express that wouldn’t bottleneck the GPUs in any way. It’s called NVLink, and it’s a bit like a ring network for GPUs, only that it allows for bidirectional communication with any two GPUs in any direction, as well as all eight GPUs to every other GPU at the same time. NVIDIA claims that the peak throughput of NVLink is 20GB/s in any single lane, and a maximum of four lanes can be open at any given time between the GPUs. That’s a theoretical peak of 160GB/s in bidirectional bandwidth between two P100 GPUs, a huge boost over a standard 16-lane PCI-Express 3.0 connection which tops out at a meagre 15.75GB/s.

What’s interesting about this setup is that NVIDIA chose not to make Pascal a HSA-compliant design, which would have allowed them to build in uniform memory access and cache coherency so that all the VRAM held by the GPUs could exist in the same pool. The contents of the VRAM of one P100 is open to another to use and work off, but only four GPUs can ever be connected with each other at any given time, each using a single NVLink lane. It’s almost like uniform memory access thanks to the speed of the link, and it’s a simpler workaround than AMD’s plan to build a fully uniform memory access technology into their GPUs and CPUs, but NVIDIA has much more control over the future of NVLink than they do over the future of the PCI-Express standard, and thus they can improve the standard much more quickly to match their graphics hardware.

nvidia pascal DGX-Open

If you’re ever keen on buying one, NVIDIA’s price for the fully loaded DGX-1 is just as high as its computing capability, at $129,000 per 3u chassis. Compared to the two to four 3u-sized chassis that customers had to previously buy to get only half the performance of one DGX-1 unit, NVIDIA thinks that this is a fair price for its customers to pay, and it’ll probably sell many dozens to companies currently using deep learning networks like Google, Facebook, and Microsoft. Eventually, NVIDIA wants the DGX-1 to inspire a whole range of similar servers from other market leaders like Dell and HP, and it intends on stepping out of the limelight once the market starts producing competition from its partners.

In the consumer markets, Pascal will probably not be as crazy as this implementation in the P100 Accelerator, and we’re likely to see a consumer rollout starting in the June/July timeframe, using GDDR5X memory instead of HBM v2 to simplify production, and a smaller die than the P100 flagship to bring down cost.