Search This Blog

Thursday, September 13, 2018

Infiniband Upgrade: FDR

In a previous post, I showed I had perfect performance scaling with QDR Infiniband. What this means is that the interconnect is no longer the performance bottleneck, so I didn't need anything faster. Thus, I upgraded to a faster FDR Infiniband system. ......shhh....

I purchased 5x Sun 7046442 rev. A3 HCAs. These are dual port CX-3 (pci 3, instead of cx-2, which was pci 2) re-branded Mellanox CX354A-Q HCAs. They're pretty cheap now. I got these for an average of about $28 each. You can reflash these with Mellanox stock firmware of the -F variety, which is the FDR speed version (see one of my previous infiniband posts on how to burn new firmware to these). So that's what I was planning to do. I also picked up 5 FDR rated cables for $18/each, and an EMC SX6005 "FDR 56Gb" switch (these are going for <$100 now, with the managed versions going for just over).

The first thing I tested was all of the Sun HCAs' ports and the cables. To my surprise, "ibstat" showed full FDR 56 Gbit/s link up. I guess the Sun firmware (2.11.1280) supports FDR. Lucky! Now I don't need to reflash their firmware. All of the cards and cables just worked. 

Bench testing HCAs and cables
I didn't have such luck with the switch. Both PSUs arrived half dead. It would pulse on and off when plugged in, so I had to send it back, and they sent a replacement. The replacement worked, but the links would not negotiate to anything faster than FDR10. ibportstate (lid) (port) is a good tool for checking what speeds should be available for your HCAs and switches (ibswitches gives lid of switch and ibstat gives lid of HCAs). I tried forcing the port speed using ibportstate (lid) (port) espeed 31 and other things (see the opensm.conf file for details), but nothing worked. I then did some research. This is an interesting thread for the managed EMC switches...turns out you can burn mlnx-os to them, overwriting the crappy EMC OS. Doesn't really apply to me though, since the SX6005 is unmanaged, so I'm running OpenSM.

I installed MFT and read the MFT manual and the SX6005 manual. I found the LID of the switch using ibswitches. I then queried the switch using flint: flint -d lid-X query full. This showed a slightly outdated firmware, as well as the PSID: EMC1260110026. Cross referencing that with the mellanox SX6005T (the FDR10 version) firmware download PSID: MT_1260110026, and you can clearly see that it's the FDR10 version. THAT's why the switch was auto-negotiating to FDR10 and not FDR. Turns out that you can update the firmware "inband", i.e. across an active infiniband connection. What's cooler: It's the exact same process as for the HCAs! HA! I'm in business. I downloaded the MSX6005F (not MSX6005T) firmware, PSID MT_1260110021, and followed my previous instructions with a slight modification to the burn step: 
flint -d lid-X -i fw.bin -allow_psid_change burn
, where X is the lid of the switch. I rebooted the switch (pulled the plugs, waited a minuted, plugged it back in, waited a few minutes), then queried the switch again, and it showed the new firmware and new PSID. I then checked ibstat, and BAM: 56 Gbit/s, full FDR. I posted this solution to the "beware of EMC switches" thread I linked earlier.

Another advantage of this switch over my current QDR switch is that this one only has 12 ports and is much smaller. It's also quieter, though that's like comparing a large jet engine to a small jet engine.

Now I just have to integrate all the new hardware into the cluster. 

Before I sell the QDR cables, I'm going to try running a dual rail setup (2 infiniband cables from each HCA) just to see what happens. Supposedly OpenMPI will automatically use both, which would be awesome because that'd max out the 80 Gbit/s PCI 3.0 X8 slot bandwidth. We'll see...

Wednesday, September 5, 2018

Supermicro X10DAi with big quiet fan coolers, new processors

Writing papers and studying for comprehensive exams has been eating most of my time recently, but I've done a few homelab-y things.

I've been trying to sell the X10DAi motherboard for a few months now to no avail. Super low demand for them for some reason. Anyways, it's a great motherboard, so I decided to use it to test some new (used) processors and CPU coolers.

I got a great deal on 2x E5-2690 V4 ES (might actually be QS). These are 14c/28t, base frequency 2.4GHz. They're in perfect condition, which is rare for ES processors. I haven't benchmarked them yet, but the all core turbo is probably about 3.0GHz, which is the same as the 10c/10t E5-4627 v3 QS. They can also use the 2400MHz memory I have at full speed. All in all, I should be getting a speed boost of somewhere between 10% (memory only) and 50% (memory and extra cores). This will make the headnode significantly faster than the compute nodes, which generally isn't useful, but if I allocate more tasks to it, then it should balance ok. There's also the possibility that the extra cores will end up being useless due to the memory bottleneck. With the E5-4627 v3's and the openfoam benchmark, going from 16 to 20 cores only improved performance by about 5%. The extra memory bandwidth will help this some, but I expect that the performance difference between 24 and 28 cores will be ~0.

Anyways, on to the coolers. The headnode workstation isn't really loud per se, but it isn't quiet either. This hasn't really mattered until now because the noise generated by the compute nodes + infiniband switch is on par with an R/C turbojet engine. Since the upgraded fans for the soundproof server cabinet are also loud, I haven't actually closed the doors on it yet. However, I'm planning to fix that soon, so I decided to try a quieter CPU cooling solution for the headnode. I looked into water coolers, but I only had space for two 140x140mm radiators, which means that they won't cool better than a good fan cooler. That, coupled with a price comparison, led me to fan coolers. It's a large case, and there's plenty of head space, but since it's a dual socket motherboard, I can't fit two gigantic fan coolers. I also wanted to be able to access the RAM with the coolers installed, which limited me to 120 or 140mm single fan coolers. I purchased two Cooler Master Hyper 212 Evo's (brand new from eBay was cheaper than amazon), which, at <$30 each, have an incredible $/performance ratio. I installed one CPU and one CPU cooler in the X10DAi to test both out. When I turned it on, I got the dreaded boot error beeps, 5 to be precise, which either means it can't detect a keyboard or a graphics card. The X10DAi does not have onboard graphics, so it requires a graphics card, which I had installed. I figured out I had a bad DVI cable, but that didn't fix the problem. After about an hour of head scratching, I realized that I hadn't cleared the CMOS. This is necessary when changing CPU's. I removed power, removed the CMOS battery, shorted the CMOS clear pins, put the CMOS battery back in, powered it on, and it booted right up. I then ran memtest86 on it. Then installed the second CPU and cooler, then ran memtest86 again. blah blah blah...

Anyways, the Cooler Master Hyper 212 Evo fan coolers work great on this dual socket motherboard. Note: you have to modify the instructions. The holes for the cooler are not drilled all the way through this motherboard, so you can't install the back plate. Instead, you have to screw the shorter standoffs into the threaded holes (same thread! that was lucky), then the CPU cooler bracket into those. Make sure not to over-tighten the CPU bracket screws because they are pulling up on the CPU plate, which is only attached to the surface of the motherboard PCB. If you tighten them too much, it could flex the plate enough to break it off the PCB.

Short standoffs installed on socket 2. You can see a post from socket 1 in upper right.

Both installed and running!
In case you're interested: the coolers could have been installed rotated 90 degrees. They're tall enough to clear the RAM, and I'm pretty sure tall enough to allow RAM access even if rotated 90 degrees. Since it's roughly the same size, I think they'll work on the ASUS Z10PE-D8 as well, which is where they'll ultimately be installed.



Coming soon:

  • Quieter cabinet air extraction fans
  • FDR Infiniband (oooo, shiny)


Thursday, August 9, 2018

Slurm and power saving

My cluster, as most private clusters, is not continuously used. In order to save power, I have been manually powering off and on the compute nodes when I don't need and need them. This is a pain. I was thinking I could write some sort of script that monitors slurm's squeue for idle nodes and power them off, and then if new jobs getting added that require resources, power them on.

Turns out, Slurm has a feature called power saving that does 80% of the work. I say 80% because you still have to write two scripts, one that powers off nodes (SuspendProgram) and one that powers on nodes (ResumeProgram). Slurm handles identifying which nodes to power off and when to call either program. The power off is fairly simple: a sudo call to poweroff should work. Powering them on is a little trickier. I should be able to do it with ipmitool, but before I do that, I need to setup the ipmi network.

I wanted to get a ~16-24 port gigabit network switch. I could then create two VLAN's, one for the basic intranet communication stuff (MPI, SSH, slurm, etc), and one for IPMI. However, I couldn't find one for less than about $40, and those were about 10 years old...eek. New 8 port unmanaged 1Gbe switches are only about $15, so I just bought another one of those. The two unmanaged switches are fanless and less power hungry, which is nice, and I really didn't need any other functionality that a managed switch provides. Time to hook it up:

Color coded cables :D

I have two ethernet ports on the headnode: one is connected to the intranet switch, and one to the ipmi switch. I made sure my firewall was configured correctly, then tried connecting to the ipmi web interface using node002's ipmi IP and a browser. This worked. For servers with IPMI, you can use ipmitool to do a lot of management functions via a terminal, including powering on and off the server:
ipmitool -H IP -v -I lanplus -U user -P password chassis power on
ipmitool -H IP -v -I lanplus -U user -P password chassis power soft
IP=IP address or hostname, user=ipmi username, password=ipmi password. Pretty cool, right?

I added the ipmi IP addresses to my /etc/hosts file with the hostnames as "ipminodeXXX". Then in the slurm suspend and resume scripts, I created a small routine that adds "ipmi" to the front of the hostnames that slurm passes to the scripts. This is used in the above ipmitool calls. The SuspendProgram and ResumeProgram are run by the slurm user, which does not have root privileges, so I also had to change the permissions to make those scripts executable by slurm.

You could also do the power off with the sudo poweroff command. In order to be able to run poweroff without entering a password, you can edit the /etc/sudoers file on the compute nodes and add the following lines:
cluster ALL=NOPASSWD: /sbin/poweroff
slurm ALL=NOPASSWD: /sbin/poweroff
This allows the users cluster and slurm to use the sudo poweroff command without entering a password. From a security standpoint, this is probably ok because the worst root privilege thing someone who gains access to either user can do is power the system off. You'll have to use something like wake-on-lan or ipmitool to boot, though.

To set ResumeTimeout, you need to know the time it takes for a node to boot. For my compute nodes, it's about 100s, so I set ResumeTimeout to 120s. The other settings were fairly obvious. Make sure path to the scripts are absolute paths. I excluded the headnode because I don't want slurm to turn it off.

Once I had everything set, I copied the slurm.conf to all nodes, as well as the new hosts file. I also copied the suspend and resume scripts, but I don't think that was necessary because I think only slurmctld (which is only on headnode) deals with power saving. I then tried the scontrol reconfig command, but it didn't seem to register the change, so I ended up having to restart the slurmd and slurmctld services on the head node. Then I saw something about the power save module in the slurmctld log file. I then waited around for 5 minutes and slurm successfully powered down the compute nodes! They are classified as "idle~" in sinfo, where the "~" means power save mode. I had it output to a power_save.log file, and I can see an entry in there for which nodes were powered down. An entry is automatically placed in the slurmctld log stating how many were put in power save mode, but not which ones. Then I started my mpi test script with ntasks=100. This caused slurm to call the slurmresume program for all nodes, which booted all of the nodes (sinfo shows "alloc#"), and then slurm allocated them and ran the job. Then five minutes later, slurm shut the nodes down again. Perfect. One of the rare times untested scripts I wrote just worked.

Some final notes:
  • This isn't very efficient for short jobs, but it will work great for a lot of long jobs with indeterminate end times. 
  • Interestingly, node003 was the slowest to boot by quite a lot...I'll have to see if there is something slowing its boot down. Luckily, that's the one I happened to time for setting the ResumeTimeout. Slurm places nodes in the down state when they take longer than ResumeTimeout to boot. 
  • I had an issue with node005 once earlier today where on boot the OS didn't load correctly or something...lots of weird error messages. Hasn't happened since, so hopefully it was just a fluke.\
Updates:
The next morning, node002 and node005 were in the "down~" state. Checking the slurmctld log, it looks like  node005 unexpectedly rebooted. The nodes' slurmd doesn't respond when they're off, and slurmctld logs this as an error, but knows that they're in power save mode, so doesn't set them down unless they randomly reboot. node005 did this and turned on, so it marked it as down.  node002 failed to resume in the ResumeTimeout limit, so slurm marked it as down. Not sure why it took so long last night. I booted it this morning in less than two minutes. Since node005 was already on, and node002 was now booted, I did the scontrol commands for resuming the nodes, which worked. Then 5 minutes later they were put in power save mode and are now "idle~". I then tested slurm again with the mpi script. It booted the idle~ nodes ("alloc#"), then allocated the job. Node002 failed to boot within 120s again, but node003 and node005 were ~80s. It seems boot times are very variable for some reason. I changed the ResumeTimeout to 180s. I tried resuming node002 once it had booted, and that worked, but then it wouldn't allocate it for some reason (stuck on "alloc#"), so it put in the down state again. I had to scancel the job, bring scontrol resume node002 manually again (now in state "idle~", which wasn't true...should be just idle), then restart slurmd on node002. That made all of the unallocated nodes "idle" according to sinfo. Then I tried submitting the mpi test job again. It allocated all of the idle nodes, ran the job, and exited. I then waited for the nodes to be shutdown, ran the mpi test script again, slurm resumed the nodes (~2 minutes), allocated them to the job, and ran the job (~3 minutes). Running the job took longer than a second or so (it's just a hello world, usually takes a second or less) because I think there was still some stuff slurm was doing. This is why it's not very efficient to power save the nodes for short jobs.

Process for fixing a node that randomly rebooted during a powersave:
  1.  Node should be on if this happened, but in "down~" state
  2. scontrol: update NodeName=nodeXXX State=RESUME
  3. The node should now be allocated
Process for fixing a node that failed to boot in time during a powersave resume:
  1. Make sure there aren't any jobs queued
  2. If the node is off, boot it manually and wait for it to boot
  3. scontrol: update NodeName=nodeXXX State=RESUME
  4. sinfo should now think the node is "idle~"
  5. ssh to the node and restart slurmd
  6. sinfo should now think the node is "idle"
 I'll keep updating this as I run into problems with slurm power save.

Helpful trick for preventing jobs from being allocated without having to scancel them: go into scontrol: hold XXX, where XXX is the job number. Then to un-hold it, use release XXX.

If the above doesn't work, or if the node powers up and is stuck in state "CF", but the job fails to start and it gets requeued (and this repeats), then there's something wrong with your configuration. In my case, chronyd was not disabled on the compute nodes, which prevented ntpd from starting, which messed up the time sync. I fixed this and was able to use the above steps to get the nodes working again.

Oh, and make sure your IPMI node names/IPs correspond to your intranet node names/IPs. I had node002 and node005 crossed, so slurm would end up shutting down node002 when it meant to shutdown node005 and vice versa. Oops.

Monday, July 23, 2018

Making the Infiniband Switch Remote Power-able

I eventually want to be able to power on and off the slave nodes and infiniband switch from the headnode remotely. Once I setup DDNS (future post), I can then ssh into the headnode from anywhere and boot up or shutdown the cluster. The slave nodes are manageable via IPMI, so nothing else needs to be done there. The Infiniband switch is another story. It's technically a managed switch. However, the previous owner locked it down (changed default IP address, disabled serial com), so I have no way of accessing the management interface. Luckily it just worked, but I couldn't power it on or off remotely...until now.

I originally wanted something like a USB or ethernet connected power switch. You can buy network power switches, but they're usually over $100, the cheapest being ~$50 on eBay. I made a post on /r/homelab and some people suggested wifi power switches. I ended up purchasing a sonoff basic for about $7. The sonoff AC power switches are based on the ESP8266 wifi chip, which can be reflashed custom firmware to work with general home automation stuff, such as MQTT or hass.io. This firmware is what I used and is highly recommended. That wiki includes detailed instructions for performing the firmware flash. I didn't have a 3.3v programmer, but I did have a rpi3 model b. Unfortunately, I broke the SD card slot on it while messing with the super tight case that came with it. I managed to mostly re-solder the SD card slot back together, though it still needs a clothes pin to work. The noobs SD card that came with it was actually corrupted, so I had to reformat it and put noobs back on it. I got the pi to boot and installed raspbian OS. I then followed the instructions here for flashing the sonoff switch, except I had to look up the pinout for my pi model. I also soldered a 5pin header to the sonoff switch pcb, which made connecting the jumper cables more reliable.


Unfortunately, this didn't work for me. The 3.3V pin (GPIO1) could not provide the required current to power the sonoff switch, so the raspberry pi shutdown. I ordered a 3.3V FTDI programmer, which I'll use along with the Arduino ide instructions to do the programming.

Now that I have the FTDI programmer, I tried flashing the firmware. This time I followed the Arduino IDE instructions here. I used the precompiled arduino zip download in case I needed to delete it easily. I was able to flash the firmware ok, though it does throw an error about missing "cxxabi_tweaks.h". Search for that file in the path given by the error...it's in a subfolder. This needs to be copied up a directory into the other bits folder; once there, it should compile fine. However, it would not connect to wifi no matter what I tried...wouldn't even start the wifi manager access point. Some googling led to this thread. Turns out the EPS8266 version 2.4.X is incompatible with the Tasmota firmware 6.1.1, which is what I was using. So nuked my arduino folder and started over with the v2.3.0 EPS8266 board (select version in board manager when installing eps8266). I re-did the stuff in my user_config.h (wifi settings, etc), then compiled and uploaded following the instructions. This worked right away...didn't even need the wifi manager access point. My router gave it an IP, which I found by opening the arduino serial monitor right after re-inserting the USB programmer USB cable (repowering the sonoff) and setting baud rate to 115200. It should automatically display the IP address the sonoff is using. If you don't want to use DHCP, you can set a static IP in the user_config.h file. Going to that IP address in browser brings up the web interface for the device. From there, I can toggle the switch (toggles the LED) and edit configurations. This mess took a few hours to figure out...what a pain.

After that, I logged into my router and told it to always use the IP address it assigned it for this device. I could also reconfigure the user_config.h with a static IP address, but at the moment, I don't think it's necessary.

There are multiple ways to command it. There's the web user interface, which is accessed by using a browser with the IP address assigned to the switch. There's also MQTT, which is necessary if you're using it with a home automation program. This doesn't seem too difficult to set up, but it's more capability than I need. I really only need to be able to turn the infiniband switch that this switch will be commanding on and off remotely. Using curl http requests, it's pretty simple to do this via a terminal. I added the following lines to my .bashrc:
alias ibswitchon='curl http://X.X.X.X/cm?cmnd=Power%20On'
alias ibswitchoff='curl http://X.X.X.X/cm?cmnd=Power%20Off'
, where X.X.X.X is the IP address of the sonoff switch. This allows me to easily bring the infiniband switch on and offline.

I purchased a 0.5m C13-C14 server power extension cable from eBay for ~$3, cut it in half, attached the switch between, and re-linked the ground wires (using an extra scrap of brown wire). Because the cable is pretty thick, the included screws for the wire tension relief clamps were not long enough, but luckily I had some 1/2" long ones of the same diameter that worked.

Test fit. Wire clamps not pictured
I tested it a few times. Seems to work well.

Wednesday, July 18, 2018

Using the Cluster for Astrophysics-y stuff

I spend months building this awesome cluster, and the first person to fully use it is my wife, haha. Since she put up with the mess for so long, I think this is fair.

I don't have the physics background to understand her research, but she's processing a ton of astronomical data, and the algorithms require are very cpu intensive. Her scripts are in python, each case needs 8 cores, and she has a few hundred cases. Time to queue the cluster...

I installed anaconda on the headnode in /opt so it'd be available system wide. I created an environment module for this anaconda version, then she created the virtual environment(s) she wanted. Then instead of installing it on each slave node, I just tar'd, copied the whole anaconda directory to /opt on each node, and untar'd. I also copied the environment modulefile. If I ever make massive changes, it will be faster/easier to re-clone all of the drives again, but for single program installs, it's faster to scp/ssh into each node and make modifications. I'll look into something like pssh in the future to make this sort of thing easier. Anyways, now that anaconda was installed, and her virtual environments available on all nodes, it was time to get slurm working.

Turns out it's super simple to call a python script from an sbatch script. You only need three lines after the #SBATCH setting lines:
module load python/anaconda-5.2
source activate myenv
python .py
However, when I tried running this for multiple cases, it would only assign one job per node, even though I had ntasks=1 and cpus-per-task=8. Theoretically, it should assign two jobs per node since each node has 20 cores (cores=cpus to slurm if hyperthreading is off). I had set SelectType and SelectParameters correctly in the slurm.conf, but it turns out that you have to add the OverSubscribe=YES parameter to the partition definition in slurm.conf, or slurm defaults to 1 job per node. This allowed for scheduling two of these jobs per node. I updated the slurm instructions in the software guide part 3 to cover this.

I couldn't find the answer to this online, but it turns out that it's perfectly fine to activate the same anaconda virtual environment more than once per node as long as you do it in separate terminal sessions (separate jobs for slurm) and are not making modifications to the environment during the runs. This made life a lot easier because it meant that we didn't have to try to track which environments were being used at the time of job launching.

Friday, July 13, 2018

Cluster: thermal testing

I cleaned up some of the wiring and made everything neat in the cabinet.

The next step was to check my modified heat extraction system to see if it could handle the thermal load. The cabinet was originally designed for 800W max, but I'm pushing ~1300W inside the cabinet (~1600W including external desktop). I taped my multimeter's thermocouple to the top, back inside of the cabinet and ran the wires out the wire passage of the back door. I then ran the all-node benchmark openfoam case for ~10 minutes. At full fan speed, the temperature leveled out at ~34-35C, which is good. The more important temperature measurement is the inlet to the server, particularly the node furthest from the air intake slit, which is located on the left side of the cabinet just behind the front door. Node005 is the top right of the 4 node 2U SM server, so it should have the hottest inlet temperature. I re-ran the case with the thermocouple taped to the top of the server near the front right. Temps never got above 31C at full throttle, so that's good. The heat extraction system is adequate.

The only problem is the noise. The fans are way louder than the server hardware in the cabinet, i.e. if I turn off the extraction fans, I can barely hear any noise from the cabinet, but when I turn them on, it's super loud. The fans didn't seem this loud when they weren't mounted, so maybe something is resonating. I'll need to mess with it some.

The first time I ran the extended benchmark case for the thermal tests, I got a segfault in node005. The next time I ran it, it didn't happen. I've run all my memory through memtest (all passed), so I'm not sure what happened. I'll have to watch for segfaults.

So, to do:
  1. Fix fan noise problem
  2. Fix RAID1 data storage drive
  3. Compile guide
Update: I 3D printed a mount for the fan controller that replaces the fan blanking plate in the middle slot (I only used 2 out of 3 fans). I added a provision for the aquarium tube I used to create a water manometer for the Phi testing, which gives me a static port in the fan duct just past the fans. This should give me the static pressure generated by the fan(s), which I can use to calculate the flow rate from the published pressure vs. flow rate curve. For the first test, I removed one of the fans and put the blanking plate on. The bracket doesn't perfectly seal in the air, but it's pretty good. I then connected the fan directly to the 24V power supply and measured the static pressure...the manometer measured 0. Odd. I was certain that the pressure drop was so large that the fans were almost stalling, but apparently that's not the case. I added the second fan and repeated the test. The water rose ~0.25mm, so total pressure differences was maybe 0.5mm. Either the pressure measurement is wrong, or the fans are actually operating at close to full flow rate. I'm not really sure. I'd expect some fairly significant deltaP in the exit duct due to the sound baffles, so I was expecting a fairly high static pressure measurement, but apparently that's not the case. Unfortunately, there aren't a lot of options for significantly quieter, similar flow rate, but lower pressure fans. There are some lower flow rate ones, but I can just throttle these back and achieve similar noise. Not sure where to go from here.

Update 2: I found some lower flow rate ones that are significantly quieter. If I use 3, and assuming that the pressure drop really is that low in the passages, it should have about the same total flow rate as the two loud fans, but be 8 decibels quieter for one option and 14 decibels quieter for another option. Unfortunately, the latter fans are harder to get in the UK (more $$). I can get the former fans from China for fairly cheap. Regardless, this is going to take a few weeks to fix.

Thursday, July 12, 2018

Completed Cluster! Benchmarks

Finally...after months of working on this, the full 5 node cluster with infiniband works. I ran some more of the motorBike Openfoam benchmarks.

  • Headnode only, n=20: 1.12 iter/s
  • Compute node only, n=20: 1.015 ips
  • head+node002, n=40, 1Gbe: 1.75 ips
  • head+node002, n=40, QDR infiniband: 2.18 ips
  • all 5, n=100, 1Gbe: 1.56 ips
  • all 5, n=100, QDR infiniband: 5.24 ips
You can see that the 1Gb ethernet link is definitely the bottleneck. In fact, it's so restrictive that using 5 nodes or more actually hurts performance. My guess is that the maximum performance with the 1Gbe link is probably about 3 nodes. The QDR Infiniband link is a different story entirely. It shows perfect scaling (sum of the headnode + X compute node ips) up to 5 nodes, and it'd probably continue to show excellent scaling up to many more, particularly for larger meshes.



Feels good man...



Still have some stuff to do:
  1. Clean up the wiring
  2. Get everything situated in the soundproof cabinet
  3. Fix the heat extraction system if it isn't sufficient
  4. Fix the RAID1 data array in the headnode so it stops failing
  5. Compile these blog posts into step-by-step guides
  6. Use the cluster