Search This Blog

Sunday, December 16, 2018

3D printed jet engine

I 3D printed a jet engine. It's not a scale of any particular engine, and the mixer is wacky, but it has all of the major components and was free, so I can't complain too much. Definitely the most complicated thing I've printed to date.

Took a few months of off and on printing to finish all of the parts. Most parts had to be printed multiple times to get them to come out nicely. It took about 5 hours to assemble, much of which was spent trying to get it to turn smoothly without any rubbing. The final product is pretty nice.

I also printed a bunch of other things recently, and updated this post with them.

Friday, December 14, 2018

Internet Upgrade

We recently upgraded from BT 50Mbps internet to Virgin 105Mbps internet. Twice the speed for exactly half the cost. Virgin's customer service is worse, and the wifi router built into their modem was much worse than BT's "Smarthub", but it is much faster internet. The virgin quickstart kit is bullshit by the way: unless that property has had virgin internet recently, you'll have to have an engineer come out to get everything working: I think you can select that you don't have the right ports at checkout on their website in order to schedule the engineer. Their customer service agents are all in India and useless.

Anyways, I purchased a TPLink C1200 router to solve the poor range problem. You have to be careful buying routers: there are still many out there with 100Mbps ethernet ports. I'd consider modern to be 4x 1Gbe ports and AC wifi, and the C1200 seemed to have the best range/cost ratio with those options. It works great. I later noticed that there was a cable hook up right behind the server cabinet, so I moved the modem and router to the shelf above my homelab.

Originally, the headnode was connected to the wifi via a usb wifi adapter. This worked ok, but obviously wasn't ideal.

Network diagram drawn with draw.io
Now that I have the router close to the homelab, it just makes sense to use a wired ethernet connection for internet. However, I'm out of ethernet ports on the headnode's motherboard. As a reminder, there are two ethernet ports on the motherboard, each connects to one 8 port unmanaged switch and has a static IP assigned. The ethernet ports/switches are on different subnets. One switch is connected to the management ports on all four SM compute nodes for IPMI (administration). The other switch is connected to an ethernet port on all four compute nodes (for MPI, ssh, data transfer, etc). This works well: the headnode can talk to the management controllers over the ipmi interface and ssh to the compute nodes' OSs' over the intranet interface.

I'd like to replace the wifi link with an ethernet link. Since the IPMI network usage is tiny, and I have extra ports on that switch, I realized I could connect the router to that switch without a performance hit. Doing that means that the ethernet adapter on the motherboard connected to the ipmi switch has to have two different static IP addresses and subnets: one for the original ipmi, and one for the router. Turns out this is fairly simple to do using nmtui. Just edit the connection you want to add an IP address to, go to add address, add in the new IP address and subnet mask, e.g. 192.168.1.100/24, put the router's internal network gateway (usually just the router's IP, which is usually XXX.XXX.XXX.1) as the "gateway" and "dns", then make sure "never use this network for default route" is unchecked, bring the interface up and down, and you should now be able to access both networks from one interface. The route command is useful for troubleshooting. Here are some links I found useful: 1 2 3 4 . And if you need to cross the router between subnets: 1. I also set the static IP for the headnode in the router using the router's administration page (accessed by typing the router's IP into a browser and entering the administrative password) so that the router's DHCP wouldn't try to double assign the headnode's IP.

I also: 1. edited my firewall and moved the ipmi interface to the public zone, 2. updated the ddns (router) and noip settings so I can access it remotely, 3. got all of the wireless devices on the new wifi network.

Yay for more stable internet.


Sunday, November 18, 2018

Follow up: homelab thermal solution

In a few prior posts, I mentioned that I replaced the soundproof cabinet's exhaust fans with quieter ones and added a temperature controller for the fans' speed. I sized the new fans well...much quieter, and they can handle the steady state heat extraction when the cluster is running at full power. However, what I didn't account for was heating of the room. The cluster is in a small office, and the office heats up after a few hours at full power. This makes sense...it's thermally equivalent to leaving a 1500W space heater on for hours/days. The reason that's a problem is that the inlet air is now hotter, which causes the exhaust to be even hotter, which causes the inlet air to be hotter, etc. The cabinet has a barrier between the outlet and inlet (both on the bottom of the cabinet) to minimize re-circulation, but if the whole room is hot, that doesn't matter. Leaving the door fully open and putting a large fan in the doorway seems to help some. It also offsets the house's gas heating needs, especially for the upper floor (where the office is). However, it's still getting too hot inside the room and cabinet. I don't have a great solution for this yet...



To do:
1. Fix room heating problem
2. Replace ntpd with chronyd

Friday, November 16, 2018

Automatic mulit-threading with python numpy

This came up while running python codes on the cluster. It turns out that numpy vector operations are automatically parallelized if numpy is linked against certain libraries, e.g. openBLAS or MKL, during compilation. Those linear algebra libraries will automatically use the max number of available cores (or if your processor is HT, 2x number of physical cores) for matrix operations. While that might seem convenient, it actually made a lot of people unhappy because of the overhead involved with multithreading lots of tiny matrix operations. Fortunately, there is a way to control the max number of threads used, and some devs are working on a way of dynamic control via numpy.

I created the following basic test script. It generates two random matrices, then multiplies them together. The random number generation is a serial operation, but the dot product is parallelized by default.
import os
#must set these before loading numpy:
os.environ["OMP_NUM_THREADS"] = '8' # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = '8' # export OPENBLAS_NUM_THREADS=4
os.environ["MKL_NUM_THREADS"] = '8' # export MKL_NUM_THREADS=6
#os.environ["VECLIB_MAXIMUM_THREADS"] = '4' # export VECLIB_MAXIMUM_THREADS=4
#os.environ["NUMEXPR_NUM_THREADS"] = '4' # export NUMEXPR_NUM_THREADS=6

import numpy as np
import time

#np.__config__.show() #looks like I have MKL and blas
np.show_config()

start_time=time.time()
#test script:
a = np.random.randn(5000, 50000)
b = np.random.randn(50000, 5000)
ran_time=time.time()-start_time
print("time to complete random matrix generation was %s seconds" % ran_time)
np.dot(a, b) #this line should be multi-threaded
print("time to complete dot was %s seconds" % (time.time() - start_time - ran_time))
The lines under import os set environment variables. The one(s) you need to set depend on what your numpy is linked against, as shown by np.show_config(). Note that those must be set before importing numpy.

I ran some experiments on one of the compute nodes (dual e5-2690v2) using slurm execution. Software was anaconda 5.2, so anyone with a recent anaconda should have similar behavior. My np.show_config() returned information about MKL and openBLAS, so I think those are the relevant variables to set.

Test 1: slurm cpus-per-task not set, ntasks=1, no thread limiting variables set.
Results: No multi-threading because slurm defaults to one cpu per task.

Test 2: slurm cpus-per-task=10, ntasks=1, no thread limiting variable set.
Results: dot used 10 threads (10.4s)

Test 3: slurm cpus-per-task=20, ntasks=1, no thread limiting variable set.
Results: dot used 20 threads (5.4s)

Test 4: slurm cpus-per-task=4, ntasks=1, no thread limiting variable set.
Results: dot used 4 threads (24.8s)

Test 5: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, OMP_NUM_THREADS=4
Results: dot used 4 threads (24.8s)

Test 6: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, OPENBLAS_NUM_THREADS=4
Results: dot used 10 threads (10.4s)

Test 7: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, MKL_NUM_THREADS=4
Results: dot used 4 threads (24.9s)

Test 8: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, VECLIB_MAXIMUM_THREADS=4
Results: dot used 10 threads (10.4s)

Test 9: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, NUMEXPR_NUM_THREADS=4
Results: dot used 10 threads (10.4s)

Test 10: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, OMP_NUM_THREADS=8, OPENBLAS_NUM_THREADS=8, MKL_NUM_THREADS=8
Results: dot used 8 threads (12.5s)

As you can see above, setting either MKL or OMP_NUM_THREADS will limit the number of threads, though apparently openBLAS is not being used, at least for dot. Also, limiting the number of cpus available will also limit the number of threads.

For this code, which has to run on 100's of different cases that can be run simultaneously, it looks like giving one full socket per case (10 cores) is optimal. The environment variables don't need to be set because the default behavior is to use all available cores (limited by slurm). That's assuming np.dot is a good indicator, which it might not be because her code is far more complicated.

Anyways, I hope someone finds this useful.

Saturday, November 3, 2018

Homelab Cluster: Hardware Finally Done

The day has finally come: I'm happy with the homelab's hardware. *fireworks*

Final list of hardware:
1. Headnode: ASUS Z10PE-D8, 2x Xeon E5-2690V4 ES (14c @ 3GHz), 8x8GB 2Rx8 PC4-19200 ECC RDIMMs, 500GB Samsung 960 Evo NVMe (CentOS 7.5), 2x 3TB HDD in RAID1 (data), 480GB SSD (Windows), GTX Titan, CX354A FDR IB HCA.
2. Compute nodes: Supermicro 6027TR-HTR, which has 4x nodes: 2x E5-2690v2, 8x8GB dual rank PC3-14900R ECC RDIMMs, 120GB SSD (CentOS 7.5 compute node), CX354A FDR IB HCA.
3. Mellanox SX6005 FDR switch with Sonoff wifi power switch
4. 2x 8 port unmanaged 1Gbe switches, one for IPMI, one for intranet
5. Riello UPS: 3300VA, 2300W
6. APC NetShelter CX Soundproof Cabinet with custom, automatic heat extraction system

Here are some pictures:

The whole homelab + 3D printer. It's final position will be about 2 feet to the right. The plywood under it is to allow for easy rolling over the carpeted floor. My desk with the monitor is just to the right.


Front of cabinet. Looks clean and organized.

Back is a little bit of a mess, but it's the best I could come up with. All of the cables are too long, so I had to coil them.

Close up of heat extraction electronics. The controller board is mounted in its new 3D printed tray.

Mounted power strip for not-C13-plug things
It's currently cranking through TB's of astrophysics data. I'll be running CFD cases on it soon.

Possible future changes

Since I just got finished saying that the homelab cluster is finished, it's time to list some possible future upgrades, because that's how this hobby goes...

1. Clean up the wiring a little more. It's kind of ugly in the back due to all of the coiled up wires. I'm not really sure how to make it neater without custom cables, though, and that definitely isn't worth the time/money involved to me. 

2. Rack rails/strips. Racking the server and switches might help clean up the wiring inside slightly and make it look neater. The biggest problem with doing this is that I will lose the ability to pull the SM compute nodes out. They come out of the back, and I currently have to slide the server to the side/angle it so that I can pull a node out of the back door. If the server chassis is racked, I won't be able to do that, so I'd have to pull the whole chassis out in order to get to a node. Aside from make it look a little prettier, adding rack rails would be pretty pointless, so this probably won't happen.

3. AMD EPYC. The new AMD EPYC processors are awesome for CFD. Each has 8 channels of DDR4-2666 RAM = crazy high memory bandwidth = more tasks/CPU before hitting the memory bandwidth bottleneck. Looking at the OpenFOAM benchmarks, two dual 7301 servers with dual rank RAM (4 CPUS, 64 cores @~2.9GHz) should be faster than my entire cluster (10 CPUS, 108 cores @~3GHz), and it's almost all thanks to memory bandwidth. Unfortunately, the economics don't make any sense. Building just one dual socket 7301 server/workstation would cost more than I spent on this whole cluster, even if the RAM, CPUs, and motherboard were all purchased used. Because its new hardware, there aren't many used EPYCs or motherboards on the market yet. Also, DDR4 RAM is absurdly expensive, mostly due to price fixing/collusion between the only three RAM manufacturers in the world. Two dual socket EPYC servers would require 32x dual rank 2666 RAM, which for 8GB at the cheapest (new) prices I could find would run about ~$3500....ouch. Again, since that's the latest speed RAM, there isn't much pre-owned DDR4-2666 yet. I did an electricity price analysis to see if it would still make sense economically to upgrade. Assuming running for 1/2 of a year, the current server would use 6100 kWh. At $0.22/kWh (England...), that's about $1350/year in electricity. I think two AMD EPYC servers would use about 900W. That's about $870/year in electricity, for a savings of ~$480/year. Even including selling off what I currently have, it'd take more years than it's worth for me to break even. So if/when this upgrade occurs, it will be in the future when the prices come down. One really exciting prospect of the EPYC chips is that they allow overlclocking. The famous overclocker der8auer's EPYC build used a prototype Elmor Labs to EVC v2 to overclock dual 7601's to 4GHz (all cores) on a Supermicro H11DSI and an ASUS RS700a-e9. He had to use novec submersion or dry ice, but he was able to get a fairly good overclock with a water cooler. Overclocking doesn't make sense in a large cluster/data center environment where running costs (electricity, cooling, maintenance) dominate. Power (and thus heat) scales with frequency cubed, so it's cheaper for them to buy more servers and not overclock. But in a homelab/small cluster environment, where initial hardware costs are usually the dominating factor, overlcocking makes a lot of sense, so this might be something I look into in a few years.

4. Add internal LED lights that come on when either the front or back doors are opened. These would probably be in the form of stick-on strips along the top front and top rear of the cabinet running off of the heat extraction system PSU. The only reason I haven't done this is that I doubt I'll be opening the cabinet much anymore now that everything is situated, heat extraction is automatic, and powering on equipment can all be done remotely.

Thursday, November 1, 2018

Review of Cheap Fan Temperature PWM Controllers

I purchased a couple more types of cheap (~$5) temperature fan controllers from eBay. There are about 5 or so different types. Since I reviewed one previously, and I now own three of the most common ones, I thought I'd do a review of all of them in one place.

First, a brief review of standard fan control methods. There is a lot of confusing terminology out there regarding small dc fan control. 2 pin fans just have power and ground. These can be controlled either by varying voltage linearly, or by PWM'ing the power line. The former only works down to ~half the rated voltage for most fans or they don't have enough power to start. The latter requires a PWM fan controller. 3 pin fans have an extra wire that outputs the tachometer readings. This is useful for measuring fan speed if the power source is constant, i.e. not PWM'd. If the power is PWM'd, then the sensor is, too, which usually messes up its readings, unless the PWM frequency is much greater than the RPM. 4 pin fans have power, gnd, tach, and control wires. In addition to the two methods mentioned for two pin fans, these have a third option for control. In stead of PWM'ing the power wire, a low voltage/low current PWM signal can be sent to the control wire. The fan's internal electronics then handle the actual power PWM'ing. This has the added benefit of not screwing up the tach sensor readings because the voltage on the power wire is still consistent. Unfortunately, finding a cheap controller for these fans is difficult. Noctura makes a ~$20 manual pot one, but that's the only one I could find. I'm reviewing the three most common cheap chinese eBay ones here.


Fan Controller 1


This is the controller I reviewed in March. It can handle two fans of 2, 3 or 4 pins in 2-pin control mode. 12-24V input, max 4A output. It will automatically adjust the two fan outputs' duty cycle based on the reading from a temperature probe.

The control chip is a TC648 dedicated fan temperature controller. Unfortunately, the PWM switching frequency is about 30 Hz, and is audible/visible and annoying. If the switch is up, the potentiometer allows for tuning the turn-on temperature, which is nominally 30C, and always about 20C lower than the max temperature, nominally 50C. The pot is very sensitive. If the switch is down, supposedly the temperature set points are fixed at 30C and 50C, but I don't think they're accurate. At 12V, the acceleration is smooth, but very slow. At 24V, the acceleration in the control band is underdamped, so the fan speed oscillates wildly. Note that the 2A or 3A versions have one missing FET (like the one pictured), but the 4A one has all of them. Because this one pulses the power pin, there is a voltage drop across the board that results in the fan not operating at the same maximum RPM as if it was plugged directly into the power supply.

Conclusion: Not recommended.

Fan Controller 2
This controller only works with 12V 4 pin fans operating in 4-pin control mode. Each of the three fans has a max operating current of 3A.  It should be possible to splice a wire from the 12V input line to the fan's 12V cable in order to have a higher than 3A current limit. It will automatically adjust FAN1's PWM (control pin, not the power pin) duty cycle based on the temperature read by the short temperature probe. FAN2 and FAN3 are only controlled manually by the two potentiometers. FAN2 and FAN3's minimum duty cycle is 10%.  There is a stalled fan warning beeper. There are 5 DIP switches. Switch 1 controls the minimum duty cycle of FAN1, either 20% or 40%. Switches 2 and 3 allow for the selection of one of four minimum and maximum temperature pairs: 35C-45C, 40C-55C, 50C-70C, 60C-90C. Switches 4 and 5 control the behavior of the stall alarm for FAN1 and FAN2. All fans will always be on: there is not automatic shutoff. The chip is not marked, but it must be some sort of microcontroller. Come to think of it, a simple microcontroller will probably be cheaper than a chip specifically designed for fan temperature control because more microcontrollers are produced than fan control chips.

The temperature control works fairly well: acceleration is smooth in the control band. The pot controlled fans are adjustable from about 10-100% duty cycle. It reaches the same max RPM as if the fan was directly connected to a 12V source. Board current consumption is very low, a few 10's of mA.

Conclusion: If you have 12V 4-pin fan(s), and one of those temperature ranges works for you, and especially if you need to manually set two other fans as well, then this is a good pwm fan temperature controller for you.

Fan Controller 3
This controller works with 12-60V 4 pin fans operating in 4-pin control mode. Each of the two fans has a max operating current of 3A.  It should be possible to splice a wire from the power input line to the fan's power cable in order to have a higher than 3A current limit. It will automatically adjust FAN1's PWM (control pin, not the power pin) duty cycle based on the temperature read by the long (~1m) temperature probe 1, and the same for FAN2 and temperature probe 2. In other words, it has two separate control zones, which is nice. The low and high temperature ranges are settable from 0-60C and 10-70C respectively, in 1C increments.The interface is the best out of all of them: buttons for selecting modes and changing settings, and a 3 digit 8 segment display, along with 4 indicator LEDs for displaying the current settings, temperatures, and fan RPMs. The temperature probes are long and potted in metal tubes. It has a stall alarm for both fan outputs. Fan minimum duty cycle is adjustable between 10-100% in increments of 1%, which can be used to manually control the fans if the minimum start temperature is set higher than ambient. This one is a bit larger than the other boards, and current consumption is about 40mA. Fans are always on: not auto-shutoff feature. The two chips near the top are shift registers (74hc595d) for the 8 segment displays and the LEDs. The chip near the capacitor is a buck converter (xl7005a) for powering the board, and the chip on bottom left is a microcontroller (N76e003At20).

It works great. Acceleration is silky smooth from the low temperature set point to the high temperature set point. It reaches the same max RPM as if the fan was directly connected to the power source. Acceleration is a little slow, likely because of the potted temperature probes taking a long time to heat up. It's faster than Fan Controller 1, though. I haven't tried this with voltages other than 12V input, but my guess is the behavior would be the same due to the buck converter.

Conclusion: This board is awesome. If you need to do temperature control, especially dual zone, of 4-pin fans of 12-60V, then this is the board for you. I will be using this in my homelab's cabinet to control the heat extraction fans. The only feature I wish it had was automatic stop/start of the fans so they wouldn't run when below the min temperature threshold.

Fan Controller 4
This controller is not a temperature controller, but a manual PWM controller. The PWM duty cycle (power is PWM'd, so this would be 2-pin fan mode) is controlled by the potentiometer. It has one output, and is supposedly rated for up to 60V and 20A, though considering how hot it gets with just a few amps, I'm not sure I'd want to push 20 through it. On 12V, as you turn the pot, the output is fairly smooth, but with 24V, only the first ~3% of the pot can be used to change RPM, the rest is full speed. The PWM frequency is high enough not to hear or notice, unlike Fan Controller 1, so that's good. It works, so if you just need manual fan control of 2 or 3 pin fans, especially high power ones, then this is a good choice for you.

Hopefully this review will help someone in the future choose a PWM fan temperature controller.


As I mentioned above, I will be using Fan Controller 3 moving forward. I had to create a new wiring harness.



I'm using FAN1's output to control all three fans. Because their total current is greater than 3A (limit of a fan connector), I had to run bypass pwr and gnd wires directly to the power supply. I pulled the pwr pin from the 4-pin fan connector that plugs into the controller to prevent current from being run through the connector. I also soldered small power wires for the controller to the fork terminals on the bypass wires. The three PWM control pins are wired together in the harness to the single blue wire connected to the controller, but only one of the yellow tachometer wires is so the tach signals don't mix. I 3D printed some terminal covers for the power supply because it didn't come with any. I'm going to 3D print a green tray to hold the controller PCB and to shield its back from shorting. I'll set the set points to 30C-40C, and the temperature probe will be taped to the top back of the cabinet. If that part of the cabinet gets to about 30C, then the heat extraction fans aren't moving enough air to prevent hot air from recirculating to the front of the servers, so they need to ramp up.

When I was testing Fan Controller 3, I noticed that the 10% duty cycle command only corresponded to 50% of the max fan RPM and the 100% duty cycle corresponded to max fan RPM. At first, I thought something might be wrong with the controller, but measuring the average DC voltage of the PWM control pin showed that it, at 10% duty cycle setting, was reaching about 12% of the voltage at max duty cycle duty cycle setting, so that meant the controller was probably fine. Unfortunately, Mechatronics, the manufacturer of the fans I purchased, does not publish PWM vs. RPM data. I found digging through their website that 50% is the minimum fan RPM, which is what I observed earlier. It's unfortunate that those fans don't allow for lower RPM operation. I measured operating current and found that at the lowest setting, the fan draws 19% of the power it does at max RPM, which is a savings of ~35W for all three fans. It'd be nice if the fan controller could turn the fans off, but that'd only save an additional ~8W. Compared to leaving them running at full speed all of the time, I'm probably saving somewhere around $30/year (assuming the fans are at min throttle half of the year) by implementing fan temperature control.


On a separate note, I tried installing a second fan on the CPU cooler on CPU1 in the headnode. The Cooler Master Hyper 212 Evo's come with an extra pair of brackets for mounting a second 120mm fan on the other side of the heatsink. The 120mm fans and y splitters I bought were 3 pin, though, instead of 4-pin, which means that both fans ran at 100%. It'd have been better to buy another 4-pin fan and a 4-pin splitter cable so that they could be throttled with load. I realized that full speed was a lot faster than I had seen the fan spin before. I did a stress test with both fans installed, and the temps hovered below 50C. This made me think that maybe there was a BIOS setting for the fans, and there is. I switched the CPU fan mode to "high speed", took the second fan off of CPU1's heatsink, and ran the stress test again. The temperature of both CPUs hovered around 59-60C, which is great: about 5-10C lower than before and no large temperature difference between CPU1 and CPU2. So I don't need the second fans. Yay.

So, to do:
1. 3D print control board holder, install the new fan controller.
2. Replace ntpd with chronyd.

Sunday, October 28, 2018

Headnode Windows-Nvidia GPU Nonsense

I recently got into light computer gaming for the second time in my life. My parents never let me have video games as a kid. I played the MMORPG Mu for about a year in middle school, but lost interest. I started playing Diablo 3 a few months ago...it's pretty fun. I use my Windows 10 Pro installation (separate SSD) in the  the headnode for the game. My headnode has a GTX Titan (original, superclocked), so it's perfectly capable of running Diablo 3 at the max framerate my screen can handle 60FPS). And it was working fine, until one day I started getting the blue screen of death and/or crashes every few minutes.

At first, I thought it might be the new windows update installed nvidia driver not playing nice with Diablo 3. I installed the latest nvidia driver from the website, but that didn't help. I also tried the oldest available on the website (388.31) after uninstalling the other, but that also didn't work. To make sure it wasn't just Diablo, I ran some stress tests, specifically userbenchmark and furmark. Both caused crashes. This meant it was either a driver problem or a hardware problem. Since I could control a software problem, I decided to try that first.

It turns out that not completely, completely, uninstalling and removing an old nvidia driver can cause crashes. So I downloaded the popular DDU (display driver uninstaller). This program suggests booting into safe mode, so I did that, and ran it with the default options. This deleted the driver(s) I had attempted to install. On normal boot, the gpu was using the basic windows display adapter according to the device manager. However, a few minutes after booting into normal Windows, Windows Update automatically installed an nvidia driver for it. Ah...maybe that's what's going on. It turns out removing the windows update driver and preventing its installation is a pain. Here's the process for it (Windows 10 Pro):
  1. Boot into safe mode
  2. Run DDU to delete nvidia drivers
  3. You can skip the above two steps if you have not tried to install any nvidia drivers yourself. Boot into normal mode. This auto installed the windows update nvidia driver after a few minutes.
  4. Follow this link for "rolling back" a driver. In short, go to the device in the device manager, go to the drivers tab, and click rollback. Note that nothing else in that link worked for me (uninstalling an update, blocking installation of an update via that troubleshooter tool). 
  5. Follow this link for how to block windows automatic driver installation for a particular device. To do this, you need to copy the hardware IDs from the GPU's device manager details tab, then adding a "device installation restrictions" group policy (gpedit) for those hardware IDs. Windows may download or try to update the nvidia drivers now, but it can't because of this block. 
  6. While you were doing 4 and 5, windows probably reinstalled its nvidia driver. You need to boot into safe mode again, and run DDU. DDU has an option to prevent windows from updating drivers, as well as an option to delete the nvidia C:/ folder. Select those options.
  7. Reboot into normal mode
  8. Check the GPU in device manager: it should be using the basic windows display adapter driver. Wait about 10 minutes. If Windows does not install the nvidia driver automatically, then you're all set. If it does, then go back to step 4 and try again, maybe with some more googling. Mine did not auto-update after this. 
  9. Now install the driver and physx only. If you use 3D, then you need the 3D drivers. If you have a separate high performance audio card, then the audio driver might be useful to you. Otherwise, don't install those. Don't install geforce experience unless you want to stream/record. I used the oldest driver listed on the website (388.31) because my GPU is older.
At this point, try your GPU again with the stress test programs. If it works, then you're all set. However, mine still failed. I tried some of the other drivers, but none helped. This led me to think it was a hardware issue, possibly overheating. I did the following to underclock it: 
  1. Install MSI Afterburner
  2. Turn down clock speed, reduce max power to 90% or lower
  3. Change fan profile to hit full throttle earlier
  4. Save the profile, apply it (check mark), and click the button that launches msi at startup. This will apply the saved profile to the GPU everytime you boot windows. 
Unfortunately, this didn't help either. At this point I tried my other GTX Titan, but it still caused crashes. Note that, when you switch GPUs, you need to let windows install the basic adapter or the nvidia installer won't recognize your GPU. After that, you need to add the new GPU's hardware id's (every GPU has different hardware IDs) to the group policy from earlier to prevent windows from installing its nvidia driver. Anyways, this led me to believe it wasn't the GPU or driver.

Sometime between when it worked and when it stopped working, I had switched the CPUs to the new v4 ES's and moved the GPU from slot 1 to slot 3 (both on CPU 1). I wonder if either of those could have something to do with it. I tried moving the GPU from slot 3 up to slot 1. I repeated the instructions above for a clean driver (oldest) install, and did the underclock. This passed the stress test! Max GPU temps never got above 62C, so I could probably undo some of the underclock. My guess is that the ES (which is not a QS) in the CPU1 socket has some unstable PCI lanes that are associated with PCI slot 3 which are causing crashes under high loads. Interesting, I had tried the FDR Infiniband HCA in slot 3, and it worked great, but it's only x8 instead of x16, so one/some of the other lanes are probably at fault. I'll have to keep that in mind if I ever want to use more than one GPU in this build. It's possible that the other ES (CPU2) has the same problem. So in summary, I probably had a combination of driver conflicts and unstable pci lanes which were causing crashes under high loads. Hopefully this guide will help future nvidia GPU owners diagnose crashes, BSODs, and other problems.



To do: 
1. Switch from ntpd to chronyd
2. Add a second fan to each CPU cooler
3. Figure out how to deal with switching the heat extraction fans on and off so I don't have to open the cabinet door every time.

$31 Filament Dryer? Heck yes

This is a post I made on the /r/3Dprinting subreddit a few months ago.

I started seeing the signs of moist PLA filament a few weeks after opening a spool, so I bought this food dehydrator on eBay: item: 182608105385. It comes with shelves that just rotate to lock/unlock, so they're super easy to remove, making it perfect for a filament dryer. It will hold two normal width 1kg filament spools, or one wide spool + one normal spool (total internal height ~15cm).



The best part? Take a close look at the PrintDry Dryer and compare it to the picture I posted and in the eBay description. They use the same dryer base! The only difference are the filament tray/cylinder things and the "printdry" decal, and this being 1/3-1/4 the cost.

I'm sure I'm not the first person to realize this, but I thought I'd share. I've seen food dehydrator conversions, but they usually require some modifications like cutting out shelves or printing custom cylinders to hold the filament spools. This just worked out of the box.

Update January 2020: I found an almost identical one in the US on eBay, Rosewill brand, for a tad under $40 shipped. The base is identical except it as F and C units, and it runs on 110V instead of 220V like the UK one. The lid is the same. However, the shelves are molded into the rings on this one, which means I can't take them out easily. I may laser them out if I can get the laser head high enough. I could also try to cut them out. Anyways, I just replaced the old base with the new one for now. I took the old one apart. It's very basic. It has an AC motor that drives an impeller fan, which blows air up through a central core that contains a wire heating element similar to a blow dryers. The wire heating element's power is routed through a weird multi-plate resistor looking thing that's attached to the temperature knob...not really sure how that works. 


Monday, October 22, 2018

More thermal management

The headnode's CPU 1 sometimes shows temperatures about 6C higher than CPU 2, despite the same reported power draw. I tried tightening the screws on CPU slightly, but I don't want to wrench them down due to the lack of a back plate. It seemed to help slightly, maybe 1-2 C. The temperatures aren't breaching 70 C, so I'm not too concerned. I moved the GPU down a slot to give more room for the CPU fans to intake air.

As a follow on to this post, I purchased 3x new heat extraction fans. I couldn't get the 24V versions cheaply, so I bought 12V ones and a new 12V power supply for them. The ones I had in there before were louder than everything with the cabinet open, which defied the purpose of a soundproof cabinet. The new ones have same total max flow rate, but lower pressure and total noise. I soldered on fan connectors, made a custom 3 way splitter, connected them up, reinstalled the fan bracket, and tried it out. MUCH quieter with the cabinet closed up now. Definitely quieter than the server and switch with open doors, so that's good. The flow rate isn't as high, so I'm guessing there is more pressure drop than what I was measuring with the water manometer. I have them connected directly to the power supply instead of through the PWM fan controller because I think they will need to operate at full throttle all of the time. Total power draw is about 40W, which is a small price to pay for a quieter server. I did some stress testing to see how hot it would get in there. The server's system temp got to about 39 C with the doors closed, which is just 2C higher than with them open. No thermal shutdowns, so I think that's a success. I got that annoying segfault error again, twice. It said the source was the headnode this time, instead of node005. I'm not sure whether it's actually a component going bad, or some weird thing with the code. When it occurs is inconsistent, too. 

I purchased and installed 2x new 140mm case fans in the headnode into some blank spots to help with heat extraction. I also purchased another one to replace the fan in the PSU because it was clicking. However, when I took the fan out and ran it separately, it no longer clicked. I think the fan cable had wiggled loose and was touching the fan blade when it was installed in the PSU because, after I secured the cable, it no longer clicked. The server is pretty quiet now, even when running full blast.

I also mounted the power strip on the side of the cabinet. I had tried various tapes before, but they all eventually failed. This time, I drilled and screwed in brass M3 threaded inserts, 3D printed some brackets I designed to hold the power strip, and screwed them on. After that, I cleaned up the rest of the wiring in and around the cabinet.

No more falling power strip


To do:
1. Replace ntp with chrony on all nodes (ntp works between nodes, but headnode won't sync)
2. Figure out how to deal with switching the heat extraction fans on and off so I don't have to open the cabinet door every time.

Sunday, October 21, 2018

More OpenFOAM benchmarks

Now that I have the FDR Infiniband system installed, it's time to run some more benchmarks. The first test I did was 40 cores, headnode + node002. This completed in ~47 s, which is about 1-2s slower than with QDR. Not sure why it would be slower, but the one I recorded for QDR might just have been on the fast side of the spread from repeating tests. I then ran 100 cores (all nodes) and got ~14.2s , which is a speed up of ~25% compared to QDR. What's interesting is that this represents a scaling efficiency > 1 (~1.3)...in other words, the 20 core iter/s for the headnode + 4* the 20 core iter/s of a compute node is less than the 100 core iter/s for all 5 nodes with FDR. I have no idea how that could be possible. With QDR, I got perfect scaling (100 core iter/s = headnode + 4*compute node iter/s), which is what made me think upgrading to FDR wouldn't actually do much, but it really did. Perhaps summing iter/s isn't the best way to calculate scaling efficiency? I'll need to look into this more. Anyways, I'm really happy with the performance boost from FDR. I tried switching the HCA to a CPU 1 slot to see if it would make a difference, but it didn't, so I moved it back to a CPU 2 slot (bottom, out of the way).

In a post a few weeks ago, I mentioned I would be upgrading the headnode to 2x Intel Xeon E5-2690 v4 ES QHV5 (confirmed ES, not QS). Specs: 14c/28t, base frequency 2.4GHz, all core turbo 3.0 GHz, single core turbo 3.2 GHz. They're in perfect condition, which is rare for ES processors. The all core turbo is 3.0GHz, which is the same as the 10c/10t E5-4627 v3 QS I currently have. I replaced the Supermicro coolers and the E5-4627 v3's with the new procs and the Cooler Master Hyper 212 Evos.


I oriented the coolers so the airflow would be up. I'm planning on getting 2x more 140mmx25mm fans for the top heat extraction. I think I can wiggle the RAM out from under the coolers if I need to, which is convenient. This motherboard has the same issue that the SM X10DAi (which I finally sold thank goodness) had: the holes for the cooler are not drilled all the way through this motherboard, so you can't install the back plate. Instead, you have to screw the shorter standoffs into the threaded holes, then the CPU cooler bracket into those. Make sure not to over-tighten the CPU bracket screws because they are pulling up on the CPU plate, which is only attached to the surface of the motherboard PCB. If you tighten them too much, it could flex the plate enough to break it off the PCB.

Unfortunately, I completely forgot about clearing the CMOS, so I spent about an hour head scratching about why the computer was acting funny and turbo boost wasn't working. Once I pulled the CMOS battery (behind GPU, ugh) and cleared the CMOS, everything worked normally. Lesson learned: If no turbo boost with an intel cpu, trying clearing the CMOS. After that, I went into the BIOS and fixed all the settings: network adapter pxe disabled (faster boot), all performance settings enabled, RAID reconfigured, boot options, etc.

After confirming turbo was working as it should, the next step was to run the OpenFOAM benchmark on 1, 2, 4, 8, 12, 16, 20, 24, and 28 cores. On 20 cores, the new CPUs are ~10% faster than the old ones. Since the core-GHz was the same, that means the improvement was mostly due to memory speed/bandwidth. The RAM can now operate at 2400MHz instead of 2133MHz, which is about 12.5% faster. On 28 cores, the new CPUs were ~15% faster, only ~6% (4.5s) faster than 20 cores despite the additional 40% core*GHz. This is due to the memory bottleneck I mentioned in a previous post, and was expected. CPU1 showed about 5-6 C higher temps than CPU2 under full load despite similar power draws...I'll try tightening CPU1's heatsink screws a tad.

Finally, a full 108 core run: 12.7 s, or about 7.8 iterations/s. That's about 34% improvement over the 100 core, QDR cluster. Wow!

UPDATE (3 days later): I decided to do a 20, 40, 60, and 80 benchmark on just the compute nodes to try to track this greater than perfect scaling (iter/s / sumofnodes(iter/s)) thing down. The 20 took about 98 s, which is the same as before. The 40 took about 47s, which is the same as before, and about half the time with double the cores, which makes sense. But the 60 and 80 took about 10 s, which didn't make any sense. The cases were completing, too, no segfaults of anything like that, which I have seen in the past cause early termination and unreal low runtimes. I then compared how I was running each of them, along with the log.simpleFoam output files and figured out the problem. For less than 40 cores, I used the standard blockmesh, decomposePar, snappyHexMesh, simpleFoam run process. For greater than 40 cores, I tried something a little more advanced. snappyHexMesh does not scale as well as the actual solver algorithms, so for large numbers of cores, it can be less efficient to run the mesher on the same number of cores as you plan on running the case on. So I meshed the case on the headnode, then reconstructParMesh, then decomposePar again with number of cores I wanted to run, then ran it. What I didn't notice in the latter (n>40) cases were a few warnings near the top about polymesh missing some patches and that force coefficients couldn't be calculated (or something like that), and a bunch of 0's for coefficients in the solution time steps. The solver was solving fine, but it wasn't doing a big chunk of the work, so for the 100 and 108 core cases, the speed up appeared to be greater than 1. I fixed this and got 18.89 s for n=108, which corresponds to 0.98 scaling. Not as incredible as what I was seeing earlier, but still very very good.

Updated comparison
Reading the benchmark thread again, there were a few tips for getting the last little bit of performance out of the system. Flotus1 suggests doing these things:
sync; echo 3 > /proc/sys/vm/drop_caches\
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
The first clears the cache, and the second sets the OS's cpu scaling governor to performance (defaults to powersave). I didn't notice any improvement from clearing the cache, but the performance command did shave off about 1s (~1%) from the headnode's benchmark. To make that permanent, I created a systemd service script called cpupower.service:
[Unit] 
Description=sets CPUfreq to performance  
[Service]
Type=oneshot
ExecStart=/usr/bin/cpupower frequency-set --governor performance 
[Install]
WantedBy=multi-user.target
Then systemctl daemon-reload, and systemctl enable cpupower.service. This will load the service at boot, setting the cpufreq scaling governor to performance. Flotus1 also suggested turning off memory interleaving across sockets, but I don't think my motherboard does that because there were only options for channel and rank interleaving in the bios.

In other news:
I really need to learn my lesson about buying used PSUs...the fan in the Rosewill Gold 1000W PSU I bought is rattle-ing. Sounds like a dying phase. Ugh. This was the replacement for the used Superflower Platinum 1000W PSU that died on me a couple months ago. I'm going to replace the fan and hope nothing else breaks in it. Note to self: buy new PSU's with warranties.

To do:
1. Replace ntp with chrony on all nodes (ntp works between nodes, but headnode won't sync)
2. Install 2x new 140mm fans, replace 140mm fan in PSU.
3. Install new thermal management fans
4. Tighten CPU 1's heatsink screws
5. Move the GPU down a slot so CPU fans have more airflow room.


Friday, October 19, 2018

Infiniband and MPI benchmarking

I've done and documented a few benchmarks before now. One figured out if there was a performance difference between two HCA firmware versions using the perftest package. I've also run some OpenFOAM benchmarks. However, now I want to do some MPI bandwidth and latency comparisons between different types of Infiniband networks, specifically: 1. single rail QDR, 2. dual rail QDR (use both ports), 3. single rail FDR, 4. dual rail FDR (FDR10). Here a "rail" refers to a communication pathway, so "dual rail" means that both ports of the dual port HCAs are connected. Physically, it means plugging an Infiniband cable into both ports running to the same switch. OpenMPI by default will use all all highspeed network connects for MPI, which means using dual rail should be a breeze. The goal is to see which type of network achieves the highest bandwidth. I realize that some of these have theoretical performances higher than the available PCI 3.0 x8 bandwidth (7.88 GB/s), so it'll be interesting to see how close to that I can get. Speaking of theory, let's do some math first.

For PCI 2.0 (ConnectX-2), the speed is 5 GT/s per lane. The following equation then gives the max theoretical bandwidth of a PCI 2.0 x8 interface with 8/10 encoding: 
PCI_LANES(8)*PCI_SPEED(5)*PCI_ENCODING(0.8) = 32 Gb/s (4 GB/s)
So that is about the maximum I should expect from a PCI 2.0 x8 slot using 8/10 encoding. For PCI 3.0, the speed is 8 GT/s and the encoding changes to 128/130, which yields ~63 Gbit/s (7.88 GB/s) for a x8 slot. QDR and FDR10 Infiniband use 8/10 encoding. However, FDR uses a more efficient 64/66 encoding, though it's less efficient than the 128/130. A PCI 3.0 slot with 64/66 encoding has a max theoretical bandwidth of ~62 Gbit/s (7.76 GB/s). However, there are some more inefficiencies, so I expect the actual upper bandwidth limit to be slightly lower. QDR Infiniband is 4x 10 Gbit/s links with 8/10 encoding, which yields a theoretical max bandwidth of 32 Gbit/s (4GB/s). Thus, a single QDR link will saturate a PCI 2.0 x8 port. FDR10 Infiniband is 4x 10.3125 Gbit/s links, with 8/10 encoding, which yields 40 Gbit/s (5 GB/s). FDR Infiniband is 4x 14.0625 Gbit/s links, with 64/66 encoding, which yields 54.55 Gbit/s (~6.8 GB/s). Again, I expect some inefficiencies, so I doubt I'll hit those values. Since a single QDR link maxes out the PCI 2.0 x8 interface of the CX-2 HCA, I expect the dual rail QDR CX-2 case to not provide any additional bandwidth. The FDR10 and FDR cases will use a PCI 3.0 x8 interface. The single FDR link should not saturate the PCI 3.0 x8 slot, but dual rail QDR, FDR10, and FDR should, so their theoretical max bandwidths is the ~7.8 GB/s of the slot.

The first thing to do is install a benchmark suite. The OSU MicroBenchmarks is a popular MPI benchmark suite. Download the tarball, extract to a folder, go into the osu-benchmarks folder, and run the following:

module load mpi/openmpi-3.1.0
./configure CC=mpicc CXX=mpicxx --prefix=$(pwd)
make
make install

The first line is only if you've set up openmpi as a module, like I did previously. If not, then you need to point CC and CXX to the location of mpicc and mpicxx respectively. That will install the benchmark scripts in the current directory under libexec/osu-micro-benchmarks/mpi/. If you don't specify a prefix, it stuck the osu-micro-benchmarks folder in /usr/local/libexec for me. This needs to be done on the nodes you'll be running benchmarks on. I'm only going to do two node benchmarks, so I'm only installed this on the headnode and one compute node. In order to keep slurm power save from shutting down the compute node, I modified the SuspendTime variable in slurm.conf on the headnode and ran scontrol reconfig. I then turned on the QDR Infiniband network and made sure that the link was up.

I navigated to the folder containing the scripts, in particular the pt2pt scripts. I used the following commands to run a bandwidth and latency test:
srun -n 2 --ntasks-per-node=1 ./osu_bw > ~/results_bw_QDR_single.txt
srun -n 2 --ntasks-per-node=1 ./osu_latency > ~/results_latency_QDR_single.txt
Those run a bandwidth and latency test at many different message sizes between the headnode and one of the compute nodes, recording the results to text files in my home directory. You can use mpirun instead of srun, but you have to specify a hostfile and make sure that the compute node's environment (PATH, LD_LIBRARY_PATH) include mpirun. For the dual rail cases, you'll get an mpi warning about more than one active ports with the same GID prefix. If the ports are connected to different physical IB networks, then the MPI will fail because you have to have different subnet ID's for different subnets. Typically, when more than one port on a host is used, it's used as a redundant backup on a separate switch (subnet) in case a port goes down. However, I'm using them on the same subnet in order to increase available bandwidth, so I can safely ignore this warning, which also tells how to suppress it.

I ran the above for single and dual rail QDR (CX-2 cards) first. Then I put the new FDR CX-3 cards in and ran them again with the old Sun QDR switch. For the supermicro compute nodes, I had to enable 4G decoding in order to get the nodes to boot, though the headnode was fine without it. My guess is that the BAR space is larger for the firmware version on the CX-3 cards than the CX-2 cards, which is something I've run into before. Then I pulled the old switch out, installed the new FDR switch (SX6005), installed opensm, activated the opensm subnet manager (systemctl start opensm) because the SX6005 is unmanaged, and ran the single and dual rail FDR benchmarks again. Finally, I replaced the FDR cables with the QDR cables, which causes the link to become FDR10 (after a few minutes of "training", whatever that is). I then ran the benchmarks again. The end result was 16 text files of message size vs. bandwidth or latency. I wrote a little gnuplot script to make some plots of the results.


Examining the plateau region, it's clear that dual rail QDR (CX-2 HCAs) did not help, as expected. The max single rail CX-2 QDR bandwidth was about 3.4 GB/s, which is about 15% lower than the max theoretical of the slot and QDR (4 GB/s); these are those extra inefficiencies I mentioned. Single rail CX-3 QDR bandwidth was around 3.9 GB/s, which is only about 2.5% lower than the max theoretical QDR bandwidth. The majority of this efficiency improvement is likely due to the PCI 3.0 interface efficiency improvements. The dual rail CX-3 QDR bandwidth matched the single rail up to about 8k message sizes, then jumped up to about 5.8 GB/s. Since 3.9*2 = 7.8, which is about the max theoretical bandwidth of a PCI 3.0 x8 slot, the PCI interface or code must have some inefficiencies (~22-26%) that are limiting performance to ~5.8-6.2 GB/s. In fact, the FDR10 and FDR dual rail's had similar max measured bandwidths. The single rail FDR10 bandwidth was about 4.65 GB/s, which is about 7% less than max theoretical. The single rail FDR bandwidth was about 5.6 GB/s, which is about 18% less than max theoretical. Again, this is probably hitting some PCI interface or code inefficiencies. Doing echo connected > /sys/class/net/ib0/mode for ib0 and ib1 didn't seem to make a difference. That might only apply to ipoib, though.


Latency shows negligible differences for single vs. dual rail for medium-large message sizes. I only lose about 3-4% of max bandwidth (~13% near 32k) with the single rail FDR vs. the dual rail options. I don't currently own enough FDR cables to do dual rail FDR, but since the performance improvement is so small, I don't plan on purchasing 5x more of these cables.

Since I'll be using the SX6005 switch from no on, I enabled opensm so it will start every time the headnode boots.

This guy did something similar back in 2013. He got slightly higher bandwidths, and the FDR latency was higher than QDR for some reason. He does mention at the end that openmpi tended to have more variable results than mpich.

I decided to try to track down why I was seeing inefficiencies of ~22-26% in some cases. The first thing to check is process affinity. I discussed this some in a previous post, but basically the way processes are distributed to resources can be controlled. Since these tests only have two tasks, one running on each node, and there are 10 cores per socket and 2 sockets per node, then there are a total of 20 cores that the single task could be running on. Often, this task is bounced around between those cores, which is good for a normal computer running many different tasks, but it's bad for a compute node that only runs one main job due to the inefficiencies involved in moving that task around. Thus, it's better to bind that task to a core. There is some minor performance dependence based on which core in a socket the task is bound to, but there can be major performance differences depending on which socket the core (that the task is bound to) is in. If the IB HCA is in a PCI slot connected to CPU2 (logical socket 1), but the task is assigned to a core in CPU1, then the task has to communicate through the QPI link between the CPUs, which hurts bandwidth and latency. For the E5-4627 v3, the QPI has two 8 GT/s links, for a total bandwidth of about 2 GB/s...that could definitely be a bottleneck. I looked in my motherboard manuals for pci-CPU connections. The supermicro compute nodes' only slot is connected to CPU1 (logical socket 0), and the the ASUS headnode's HCA is in a slot connected to CPU2. But how do I know if core binding is on, and if so, what are the bindings? It turns out that it's hard to know with srun...there also isn't as much control over bindings and mapping in slurm. mpirun can output core bindings using the "--report-bindings" flag, but as I said earlier, I can't directly run mpirun without messing with the .bashrc/environment on the compute node. Instead of using the srun commands above, I wrote an sbatch script that calls mpirun. First, the SBATCH parameters job, output, ntasks=2, ntasks-per-node=1, nodelist=headnode,node002 are specified. These settings let slurm know that I'll need two nodes, and to put one of the two tasks on each node. Then the script runs "module load mpi/openmpi", which loads the mpi module. The mpirun command is then as follows:  mpirun -host headnode,node002 -np 2  -bind-to core -map-by ppr:1:node /path/to/osu_bw . It turns out that you don't need the -bind-to core or -map-by ppr:1:node flags; the results are the same without them. As long as task affinity is activated in the slurm.conf, then the default slurm behavior is to bind to core (and the ntasks-per-node sbatch command covers the map by node flag). Adding the --report-bindings flag revealed that mpirun placed a task on core 0 of socket 0 of the headnode and core 0 of socket 0 of the compute node. Interesting...perhaps some of the performance inefficiency is due to the fact that my headnode's HCA is in a CPU2 (socket 1) PCI slot.

At this point, I have replicated the srun command with sbatch...so why did I bother? Enter rankfiles. A mpirun rankfile allows you to specify exact task mapping on node, socket, and core levels, something you can't do in slurm. So I did that:
rank 0=headnode slot=1:0
rank 1=node002 slot=0:0
From the mpirun man page, "the process’ "rank" refers to its rank in MPI_COMM_WORLD", so rank 0 is the first task and rank 1 is the second task. The first line of the file says assign the first task to the headnode in slot (socket) 1 and core 0. The original slot 0 is fine for the compute node since that is where the HCA's PCI lanes are connected. Hint: use cat /proc/cpuinfo to get the socket ("physical ID") and core ("processor") logical numbers. I ran this for the single rail FDR case, and it made a big difference: ~6.3 GB/s. The inefficiency went from ~18% to ~7%! For dual rail FDR, bandwidth was about ~6.5 GB/s. Since the dual rail FDR should be able to saturate the PCI 3.0 x8 slot, then the max theoretical should be about 7.8, making the inefficiency about 17%...much better than 22-26%. I'd expect dual rail QDR and FDR10 bandwidth to be similar to the dual rail FDR. Latency improved, but not as much. The dual rail bandwidth is still only about 3% more than the single rail, so this doesn't change my conclusion that I don't need dual rail. However, if you have a company with a big QDR Infiniband installation with lots of extra switch ports, it would be cheaper to only replace the CX-2 HCAs with dual port CX-3 ones (assuming your hardware has PCI 3.0 slots, QDR already maxes out 2.0 x8) and double the number of QDR cables, than to replace all of the HCA's, cables and switches with FDR capable ones...AND you'd end up with slightly better performance. Another way to get around the PCI slot bandwidth limit, if you have enough slots, is to use multiple HCA's. For example, two single port FDR CX-3 HCA's in dual rail mode should be able to achieve ~6.3*2 = 12.6 GB/s, which is almost double what the single dual port FDR CX-3 HCA in dual rail mode could achieve. Cool stuff.

There's another way to get efficient process bindings with mpirun. I was able to achieve the exact same core bindings and performance as with the above rankfile by using the following command:
mpirun -host headnode,node002 -np 2 -bind-to core -map-by dist:span --report-bindings --mca rmaps_dist_device mlx4_0 /path/to/osu_bw
These flags tell mpi to bind processes to core and to map them by distance from the adapter mlx4_0, which is the name of the IB HCA (mine have the same name on all nodes). The nice thing here is that no rankfile was required.

I should note, though, that the original benchmarks involving crossing the QPI are probably a more realistic representation of max bandwidth and latency since all of the cores on a node will be used for real simulations. That's why I didn't bother to re-run all of the different cases.

To do:
1. Remove old QDR hardware.
2. Enable 4G decoding on other compute nodes and install FDR HCAs.
3. Plug everything back in/cable management
4. Run OpenFOAM benchmark with FDR.
5. Install new processors and coolers in headnode
6. Run OpenFOAM benchmark on headnode and cluster

Thursday, September 13, 2018

Infiniband Upgrade: FDR

In a previous post, I showed I had perfect performance scaling with QDR Infiniband. What this means is that the interconnect is no longer the performance bottleneck, so I didn't need anything faster. Thus, I upgraded to a faster FDR Infiniband system. ......shhh....

I purchased 5x Sun 7046442 rev. A3 HCAs. These are dual port CX-3 (pci 3, instead of cx-2, which was pci 2) re-branded Mellanox CX354A-Q HCAs. They're pretty cheap now. I got these for an average of about $28 each. You can reflash these with Mellanox stock firmware of the -F variety, which is the FDR speed version (see one of my previous infiniband posts on how to burn new firmware to these). So that's what I was planning to do. I also picked up 5 FDR rated cables for $18/each, and an EMC SX6005 "FDR 56Gb" switch (these are going for <$100 now, with the managed versions going for just over).

The first thing I tested was all of the Sun HCAs' ports and the cables. To my surprise, "ibstat" showed full FDR 56 Gbit/s link up. I guess the Sun firmware (2.11.1280) supports FDR. Lucky! Now I don't need to reflash their firmware. All of the cards and cables just worked. 

Bench testing HCAs and cables
I didn't have such luck with the switch. Both PSUs arrived half dead. It would pulse on and off when plugged in, so I had to send it back, and they sent a replacement. The replacement worked, but the links would not negotiate to anything faster than FDR10. ibportstate (lid) (port) is a good tool for checking what speeds should be available for your HCAs and switches (ibswitches gives lid of switch and ibstat gives lid of HCAs). I tried forcing the port speed using ibportstate (lid) (port) espeed 31 and other things (see the opensm.conf file for details), but nothing worked. I then did some research. This is an interesting thread for the managed EMC switches...turns out you can burn mlnx-os to them, overwriting the crappy EMC OS. Doesn't really apply to me though, since the SX6005 is unmanaged, so I'm running OpenSM.

I installed MFT and read the MFT manual and the SX6005 manual. I found the LID of the switch using ibswitches. I then queried the switch using flint: flint -d lid-X query full. This showed a slightly outdated firmware, as well as the PSID: EMC1260110026. Cross referencing that with the mellanox SX6005T (the FDR10 version) firmware download PSID: MT_1260110026, and you can clearly see that it's the FDR10 version. THAT's why the switch was auto-negotiating to FDR10 and not FDR. Turns out that you can update the firmware "inband", i.e. across an active infiniband connection. What's cooler: It's the exact same process as for the HCAs! HA! I'm in business. I downloaded the MSX6005F (not MSX6005T) firmware, PSID MT_1260110021, and followed my previous instructions with a slight modification to the burn step: 
flint -d lid-X -i fw.bin -allow_psid_change burn
, where X is the lid of the switch. I rebooted the switch (pulled the plugs, waited a minuted, plugged it back in, waited a few minutes), then queried the switch again, and it showed the new firmware and new PSID. I then checked ibstat, and BAM: 56 Gbit/s, full FDR. I posted this solution to the "beware of EMC switches" thread I linked earlier.

Another advantage of this switch over my current QDR switch is that this one only has 12 ports and is much smaller. It's also quieter, though that's like comparing a large jet engine to a small jet engine.

Now I just have to integrate all the new hardware into the cluster. 

Before I sell the QDR cables, I'm going to try running a dual rail setup (2 infiniband cables from each HCA) just to see what happens. Supposedly OpenMPI will automatically use both, which would be awesome because that'd max out the 80 Gbit/s PCI 3.0 X8 slot bandwidth. We'll see...

Wednesday, September 5, 2018

Supermicro X10DAi with big quiet fan coolers, new processors

Writing papers and studying for comprehensive exams has been eating most of my time recently, but I've done a few homelab-y things.

I've been trying to sell the X10DAi motherboard for a few months now to no avail. Super low demand for them for some reason. Anyways, it's a great motherboard, so I decided to use it to test some new (used) processors and CPU coolers.

I got a great deal on 2x E5-2690 V4 ES (might actually be QS). These are 14c/28t, base frequency 2.4GHz. They're in perfect condition, which is rare for ES processors. I haven't benchmarked them yet, but the all core turbo is probably about 3.0GHz, which is the same as the 10c/10t E5-4627 v3 QS. They can also use the 2400MHz memory I have at full speed. All in all, I should be getting a speed boost of somewhere between 10% (memory only) and 50% (memory and extra cores). This will make the headnode significantly faster than the compute nodes, which generally isn't useful, but if I allocate more tasks to it, then it should balance ok. There's also the possibility that the extra cores will end up being useless due to the memory bottleneck. With the E5-4627 v3's and the openfoam benchmark, going from 16 to 20 cores only improved performance by about 5%. The extra memory bandwidth will help this some, but I expect that the performance difference between 24 and 28 cores will be ~0.

Anyways, on to the coolers. The headnode workstation isn't really loud per se, but it isn't quiet either. This hasn't really mattered until now because the noise generated by the compute nodes + infiniband switch is on par with an R/C turbojet engine. Since the upgraded fans for the soundproof server cabinet are also loud, I haven't actually closed the doors on it yet. However, I'm planning to fix that soon, so I decided to try a quieter CPU cooling solution for the headnode. I looked into water coolers, but I only had space for two 140x140mm radiators, which means that they won't cool better than a good fan cooler. That, coupled with a price comparison, led me to fan coolers. It's a large case, and there's plenty of head space, but since it's a dual socket motherboard, I can't fit two gigantic fan coolers. I also wanted to be able to access the RAM with the coolers installed, which limited me to 120 or 140mm single fan coolers. I purchased two Cooler Master Hyper 212 Evo's (brand new from eBay was cheaper than amazon), which, at <$30 each, have an incredible $/performance ratio. I installed one CPU and one CPU cooler in the X10DAi to test both out. When I turned it on, I got the dreaded boot error beeps, 5 to be precise, which either means it can't detect a keyboard or a graphics card. The X10DAi does not have onboard graphics, so it requires a graphics card, which I had installed. I figured out I had a bad DVI cable, but that didn't fix the problem. After about an hour of head scratching, I realized that I hadn't cleared the CMOS. This is necessary when changing CPU's. I removed power, removed the CMOS battery, shorted the CMOS clear pins, put the CMOS battery back in, powered it on, and it booted right up. I then ran memtest86 on it. Then installed the second CPU and cooler, then ran memtest86 again. blah blah blah...

Anyways, the Cooler Master Hyper 212 Evo fan coolers work great on this dual socket motherboard. Note: you have to modify the instructions. The holes for the cooler are not drilled all the way through this motherboard, so you can't install the back plate. Instead, you have to screw the shorter standoffs into the threaded holes (same thread! that was lucky), then the CPU cooler bracket into those. Make sure not to over-tighten the CPU bracket screws because they are pulling up on the CPU plate, which is only attached to the surface of the motherboard PCB. If you tighten them too much, it could flex the plate enough to break it off the PCB.

Short standoffs installed on socket 2. You can see a post from socket 1 in upper right.

Both installed and running!
In case you're interested: the coolers could have been installed rotated 90 degrees. They're tall enough to clear the RAM, and I'm pretty sure tall enough to allow RAM access even if rotated 90 degrees. Since it's roughly the same size, I think they'll work on the ASUS Z10PE-D8 as well, which is where they'll ultimately be installed.



Coming soon:

  • Quieter cabinet air extraction fans
  • FDR Infiniband (oooo, shiny)


Thursday, August 9, 2018

Slurm and power saving

My cluster, as most private clusters, is not continuously used. In order to save power, I have been manually powering off and on the compute nodes when I don't need and need them. This is a pain. I was thinking I could write some sort of script that monitors slurm's squeue for idle nodes and power them off, and then if new jobs getting added that require resources, power them on.

Turns out, Slurm has a feature called power saving that does 80% of the work. I say 80% because you still have to write two scripts, one that powers off nodes (SuspendProgram) and one that powers on nodes (ResumeProgram). Slurm handles identifying which nodes to power off and when to call either program. The power off is fairly simple: a sudo call to poweroff should work. Powering them on is a little trickier. I should be able to do it with ipmitool, but before I do that, I need to setup the ipmi network.

I wanted to get a ~16-24 port gigabit network switch. I could then create two VLAN's, one for the basic intranet communication stuff (MPI, SSH, slurm, etc), and one for IPMI. However, I couldn't find one for less than about $40, and those were about 10 years old...eek. New 8 port unmanaged 1Gbe switches are only about $15, so I just bought another one of those. The two unmanaged switches are fanless and less power hungry, which is nice, and I really didn't need any other functionality that a managed switch provides. Time to hook it up:

Color coded cables :D

I have two ethernet ports on the headnode: one is connected to the intranet switch, and one to the ipmi switch. I made sure my firewall was configured correctly, then tried connecting to the ipmi web interface using node002's ipmi IP and a browser. This worked. For servers with IPMI, you can use ipmitool to do a lot of management functions via a terminal, including powering on and off the server:
ipmitool -H IP -v -I lanplus -U user -P password chassis power on
ipmitool -H IP -v -I lanplus -U user -P password chassis power soft
IP=IP address or hostname, user=ipmi username, password=ipmi password. Pretty cool, right?

I added the ipmi IP addresses to my /etc/hosts file with the hostnames as "ipminodeXXX". Then in the slurm suspend and resume scripts, I created a small routine that adds "ipmi" to the front of the hostnames that slurm passes to the scripts. This is used in the above ipmitool calls. The SuspendProgram and ResumeProgram are run by the slurm user, which does not have root privileges, so I also had to change the permissions to make those scripts executable by slurm.

You could also do the power off with the sudo poweroff command. In order to be able to run poweroff without entering a password, you can edit the /etc/sudoers file on the compute nodes and add the following lines:
cluster ALL=NOPASSWD: /sbin/poweroff
slurm ALL=NOPASSWD: /sbin/poweroff
This allows the users cluster and slurm to use the sudo poweroff command without entering a password. From a security standpoint, this is probably ok because the worst root privilege thing someone who gains access to either user can do is power the system off. You'll have to use something like wake-on-lan or ipmitool to boot, though.

To set ResumeTimeout, you need to know the time it takes for a node to boot. For my compute nodes, it's about 100s, so I set ResumeTimeout to 120s. The other settings were fairly obvious. Make sure path to the scripts are absolute paths. I excluded the headnode because I don't want slurm to turn it off.

Once I had everything set, I copied the slurm.conf to all nodes, as well as the new hosts file. I also copied the suspend and resume scripts, but I don't think that was necessary because I think only slurmctld (which is only on headnode) deals with power saving. I then tried the scontrol reconfig command, but it didn't seem to register the change, so I ended up having to restart the slurmd and slurmctld services on the head node. Then I saw something about the power save module in the slurmctld log file. I then waited around for 5 minutes and slurm successfully powered down the compute nodes! They are classified as "idle~" in sinfo, where the "~" means power save mode. I had it output to a power_save.log file, and I can see an entry in there for which nodes were powered down. An entry is automatically placed in the slurmctld log stating how many were put in power save mode, but not which ones. Then I started my mpi test script with ntasks=100. This caused slurm to call the slurmresume program for all nodes, which booted all of the nodes (sinfo shows "alloc#"), and then slurm allocated them and ran the job. Then five minutes later, slurm shut the nodes down again. Perfect. One of the rare times untested scripts I wrote just worked.

Some final notes:
  • This isn't very efficient for short jobs, but it will work great for a lot of long jobs with indeterminate end times. 
  • Interestingly, node003 was the slowest to boot by quite a lot...I'll have to see if there is something slowing its boot down. Luckily, that's the one I happened to time for setting the ResumeTimeout. Slurm places nodes in the down state when they take longer than ResumeTimeout to boot. 
  • I had an issue with node005 once earlier today where on boot the OS didn't load correctly or something...lots of weird error messages. Hasn't happened since, so hopefully it was just a fluke.\
Updates:
The next morning, node002 and node005 were in the "down~" state. Checking the slurmctld log, it looks like  node005 unexpectedly rebooted. The nodes' slurmd doesn't respond when they're off, and slurmctld logs this as an error, but knows that they're in power save mode, so doesn't set them down unless they randomly reboot. node005 did this and turned on, so it marked it as down.  node002 failed to resume in the ResumeTimeout limit, so slurm marked it as down. Not sure why it took so long last night. I booted it this morning in less than two minutes. Since node005 was already on, and node002 was now booted, I did the scontrol commands for resuming the nodes, which worked. Then 5 minutes later they were put in power save mode and are now "idle~". I then tested slurm again with the mpi script. It booted the idle~ nodes ("alloc#"), then allocated the job. Node002 failed to boot within 120s again, but node003 and node005 were ~80s. It seems boot times are very variable for some reason. I changed the ResumeTimeout to 180s. I tried resuming node002 once it had booted, and that worked, but then it wouldn't allocate it for some reason (stuck on "alloc#"), so it put in the down state again. I had to scancel the job, bring scontrol resume node002 manually again (now in state "idle~", which wasn't true...should be just idle), then restart slurmd on node002. That made all of the unallocated nodes "idle" according to sinfo. Then I tried submitting the mpi test job again. It allocated all of the idle nodes, ran the job, and exited. I then waited for the nodes to be shutdown, ran the mpi test script again, slurm resumed the nodes (~2 minutes), allocated them to the job, and ran the job (~3 minutes). Running the job took longer than a second or so (it's just a hello world, usually takes a second or less) because I think there was still some stuff slurm was doing. This is why it's not very efficient to power save the nodes for short jobs.

Process for fixing a node that randomly rebooted during a powersave:
  1.  Node should be on if this happened, but in "down~" state
  2. scontrol: update NodeName=nodeXXX State=RESUME
  3. The node should now be allocated
Process for fixing a node that failed to boot in time during a powersave resume:
  1. Make sure there aren't any jobs queued
  2. If the node is off, boot it manually and wait for it to boot
  3. scontrol: update NodeName=nodeXXX State=RESUME
  4. sinfo should now think the node is "idle~"
  5. ssh to the node and restart slurmd
  6. sinfo should now think the node is "idle"
 I'll keep updating this as I run into problems with slurm power save.

Helpful trick for preventing jobs from being allocated without having to scancel them: go into scontrol: hold XXX, where XXX is the job number. Then to un-hold it, use release XXX.

If the above doesn't work, or if the node powers up and is stuck in state "CF", but the job fails to start and it gets requeued (and this repeats), then there's something wrong with your configuration. In my case, chronyd was not disabled on the compute nodes, which prevented ntpd from starting, which messed up the time sync. I fixed this and was able to use the above steps to get the nodes working again.

Oh, and make sure your IPMI node names/IPs correspond to your intranet node names/IPs. I had node002 and node005 crossed, so slurm would end up shutting down node002 when it meant to shutdown node005 and vice versa. Oops.

Update Oct 2018: I went on a long trip and wasn't planning on running anything, so I had everything turned off. After I came back, I tried booting the headnode up and submitting jobs to see if slurm would just work. It didn't. I had to following the "process for fixing a node that failed to boot" steps above. I also had to power cycle the network switch that the ipmi network is on...I guess it entered some sort of low power mode.