Search This Blog

Sunday, October 28, 2018

Headnode Windows-Nvidia GPU Nonsense

I recently got into light computer gaming for the second time in my life. My parents never let me have video games as a kid. I played the MMORPG Mu for about a year in middle school, but lost interest. I started playing Diablo 3 a few months ago...it's pretty fun. I use my Windows 10 Pro installation (separate SSD) in the  the headnode for the game. My headnode has a GTX Titan (original, superclocked), so it's perfectly capable of running Diablo 3 at the max framerate my screen can handle 60FPS). And it was working fine, until one day I started getting the blue screen of death and/or crashes every few minutes.

At first, I thought it might be the new windows update installed nvidia driver not playing nice with Diablo 3. I installed the latest nvidia driver from the website, but that didn't help. I also tried the oldest available on the website (388.31) after uninstalling the other, but that also didn't work. To make sure it wasn't just Diablo, I ran some stress tests, specifically userbenchmark and furmark. Both caused crashes. This meant it was either a driver problem or a hardware problem. Since I could control a software problem, I decided to try that first.

It turns out that not completely, completely, uninstalling and removing an old nvidia driver can cause crashes. So I downloaded the popular DDU (display driver uninstaller). This program suggests booting into safe mode, so I did that, and ran it with the default options. This deleted the driver(s) I had attempted to install. On normal boot, the gpu was using the basic windows display adapter according to the device manager. However, a few minutes after booting into normal Windows, Windows Update automatically installed an nvidia driver for it. Ah...maybe that's what's going on. It turns out removing the windows update driver and preventing its installation is a pain. Here's the process for it (Windows 10 Pro):
  1. Boot into safe mode
  2. Run DDU to delete nvidia drivers
  3. You can skip the above two steps if you have not tried to install any nvidia drivers yourself. Boot into normal mode. This auto installed the windows update nvidia driver after a few minutes.
  4. Follow this link for "rolling back" a driver. In short, go to the device in the device manager, go to the drivers tab, and click rollback. Note that nothing else in that link worked for me (uninstalling an update, blocking installation of an update via that troubleshooter tool). 
  5. Follow this link for how to block windows automatic driver installation for a particular device. To do this, you need to copy the hardware IDs from the GPU's device manager details tab, then adding a "device installation restrictions" group policy (gpedit) for those hardware IDs. Windows may download or try to update the nvidia drivers now, but it can't because of this block. 
  6. While you were doing 4 and 5, windows probably reinstalled its nvidia driver. You need to boot into safe mode again, and run DDU. DDU has an option to prevent windows from updating drivers, as well as an option to delete the nvidia C:/ folder. Select those options.
  7. Reboot into normal mode
  8. Check the GPU in device manager: it should be using the basic windows display adapter driver. Wait about 10 minutes. If Windows does not install the nvidia driver automatically, then you're all set. If it does, then go back to step 4 and try again, maybe with some more googling. Mine did not auto-update after this. 
  9. Now install the driver and physx only. If you use 3D, then you need the 3D drivers. If you have a separate high performance audio card, then the audio driver might be useful to you. Otherwise, don't install those. Don't install geforce experience unless you want to stream/record. I used the oldest driver listed on the website (388.31) because my GPU is older.
At this point, try your GPU again with the stress test programs. If it works, then you're all set. However, mine still failed. I tried some of the other drivers, but none helped. This led me to think it was a hardware issue, possibly overheating. I did the following to underclock it: 
  1. Install MSI Afterburner
  2. Turn down clock speed, reduce max power to 90% or lower
  3. Change fan profile to hit full throttle earlier
  4. Save the profile, apply it (check mark), and click the button that launches msi at startup. This will apply the saved profile to the GPU everytime you boot windows. 
Unfortunately, this didn't help either. At this point I tried my other GTX Titan, but it still caused crashes. Note that, when you switch GPUs, you need to let windows install the basic adapter or the nvidia installer won't recognize your GPU. After that, you need to add the new GPU's hardware id's (every GPU has different hardware IDs) to the group policy from earlier to prevent windows from installing its nvidia driver. Anyways, this led me to believe it wasn't the GPU or driver.

Sometime between when it worked and when it stopped working, I had switched the CPUs to the new v4 ES's and moved the GPU from slot 1 to slot 3 (both on CPU 1). I wonder if either of those could have something to do with it. I tried moving the GPU from slot 3 up to slot 1. I repeated the instructions above for a clean driver (oldest) install, and did the underclock. This passed the stress test! Max GPU temps never got above 62C, so I could probably undo some of the underclock. My guess is that the ES (which is not a QS) in the CPU1 socket has some unstable PCI lanes that are associated with PCI slot 3 which are causing crashes under high loads. Interesting, I had tried the FDR Infiniband HCA in slot 3, and it worked great, but it's only x8 instead of x16, so one/some of the other lanes are probably at fault. I'll have to keep that in mind if I ever want to use more than one GPU in this build. It's possible that the other ES (CPU2) has the same problem. So in summary, I probably had a combination of driver conflicts and unstable pci lanes which were causing crashes under high loads. Hopefully this guide will help future nvidia GPU owners diagnose crashes, BSODs, and other problems.



To do: 
1. Switch from ntpd to chronyd
2. Add a second fan to each CPU cooler
3. Figure out how to deal with switching the heat extraction fans on and off so I don't have to open the cabinet door every time.

No comments:

Post a Comment