NVIDIA’s latest generation of graphics cards, the RTX 2000 series, has been crushing it in benchmarks and reviews, and it’s easily the company’s finest work to date, allowing them to hold on to the performance crown. As with any new hardware release, however, there are some teething issues. On Reddit, the NVIDIA forums, and as covered by several YouTube channels including Gamers Nexus, there is a rather large number of users experiencing early bugs in hardware and software. If you’re a new or prospective RTX 2000-series owner, you’re definitely going to want to read this.
NVIDIA’s subreddit shows the problem quite visibly. Searching for any RTX 2000-series card and “dead”, “issue”, “BSOD”, or “artifact” yields quite a number of results. You’ll run across threads like this survey trying to find commonality to the reported issues, another reporting that underclocking RAM fixes some of the BSODs and app crashes, or this one suggesting faulty power circuitry being the problem. It’s not just the cards that are problematic either – some threads report that people have been through up to three RTX GPUs in the past month via the RMA process, trying to find a working unit. Searching for threads with “RMA” in the title show people being frustrated with the RMA process, and it’s almost certain that NVIDIA and their partners are having a bit of a tough time getting stock in for RMA purposes because so many people are experiencing issues.
In a bid to try pinpoint the issues themselves, Gamers Nexus reached out to RTX 2000-series owners just over a week ago to have dead or dying cards shipped to them to isolate the issues and find the cause of the problem. The first issue discovered was a BSOD which would trigger when the GPU was plugged into an older G-Sync display, and running specific games would cause the operating system to crash to a BSOD. A driver update is supposed to fix this. More serious is a reproducible bug where games would crash when the card has multiple monitors plugged into it. Unplugging the additional monitors got rid of the issue, and the drivers should fix this as well. Both issues point to problems with initialising the displays and passing the handshake protocol that is initiated when the displays are plugged in using HDMI or DisplayPort.
Another weird one is a clock bug that is either firmware or hardware related. Steve Burke from Gamers Nexus went through the steps on video to reproduce the bug, where an RTX 2080 Ti was stuck at 1350MHz base clock and would either crash applications or crash the operating system when even slightly overclocked to try improve the card’s performance. Burke mentioned that the bug had been encountered mostly on reference PCB designs, and that re-flashing the BIOS did not fix the problem. Gamers Nexus will be doing a teardown shortly to try find the source of the problem if it is hardware-related.
Why is this happening?
You’ve likely heard of the term “early adopter tax” being thrown around in conversations when it comes to issues with new hardware launches. This references one of two things: higher prices on launch as a result of pent-up demand, or major issues and failures in the first batches of shipping units. The latter is expressed in terms of something called the bathtub curve, and in the IT industry it’s what allows manufacturers to plan out their warranty coverage periods and see if the failure rate is higher than expected.
The bathtub curve is closely related to the Gompertz–Makeham law of mortality, which states that the mortality rate of humans throughout their lifecyle exponentially increases with age. Initially, infant mortality rates and the chance of death are extremely high up to 5-6 years of age, at which point the mortality rate begins climbing exponentially with age. The bathtub curve follows this somewhat. In the beginning of a product’s lifecycle on the market, infant mortalities are expected to be high as a result of manufacturing faults. As the infant mortality rate drops, the failure rate of the product exponentially increases until it wears out completely from use.
In the hardware industry, manufacturers set aside time for their new products for early failure detection. Products are either individually tested and assessed, or randomly assessed according to the batch they’re from, to make sure that there isn’t a high infant mortality rate when the product launches. Companies like Gigabyte pull about one GPU per 100 to assess the product’s quality and try catch bad batches early on. Doing this encourages consumers to purchase the product at launch because there is a perceived higher level of quality as failure rates are low.
NVIDIA’s bathtub curve is probably trending a bit more to the left on the time scale as masses of cards are reported to be failing, but this doesn’t detract from the card’s performance or perceived level of quality. Instead, it’ll just make consumers a bit more wary to purchase that RTX 2070 considering that they’re seeing headlines about the cards straight up dying in their first week. If you’re holding out on purchasing a new card, that’s perfectly fine. Waiting on the bad batches to be ironed out has never harmed anyone. On the other hand, if you’re buying a new RTX 2000-series card, you probably don’t have anything to worry about. The actual failure rate is going to be quite low relative to how many cards NVIDIA has shipped out so far, and it might just be a bad batch of GPUs that need to be taken out of the sales channels because the issues weren’t caught sooner. You’ll have the warranty coverage to take care of that.
If you’re having issues with your brand new GPU and want to help out investigations into the issue, send an email to dead2080ti (at) gamersnexus (dot) net. Even if you’re not going to send your card overseas to have it looked at, you can share details like the issues you’re experiencing, your setup and Windows and GeForce driver version, and your card’s serial number and model number. Gamers Nexus is trying to figure out if the issues are related to a bad batch of GPUs or another issue entirely, and even basic reports about the issues experience would help to pinpoint what’s causing them.