Search This Blog

Monday, July 23, 2018

Making the Infiniband Switch Remote Power-able

I eventually want to be able to power on and off the slave nodes and infiniband switch from the headnode remotely. Once I setup DDNS (future post), I can then ssh into the headnode from anywhere and boot up or shutdown the cluster. The slave nodes are manageable via IPMI, so nothing else needs to be done there. The Infiniband switch is another story. It's technically a managed switch. However, the previous owner locked it down (changed default IP address, disabled serial com), so I have no way of accessing the management interface. Luckily it just worked, but I couldn't power it on or off remotely...until now.

I originally wanted something like a USB or ethernet connected power switch. You can buy network power switches, but they're usually over $100, the cheapest being ~$50 on eBay. I made a post on /r/homelab and some people suggested wifi power switches. I ended up purchasing a sonoff basic for about $7. The sonoff AC power switches are based on the ESP8266 wifi chip, which can be reflashed custom firmware to work with general home automation stuff, such as MQTT or This firmware is what I used and is highly recommended. That wiki includes detailed instructions for performing the firmware flash. I didn't have a 3.3v programmer, but I did have a rpi3 model b. Unfortunately, I broke the SD card slot on it while messing with the super tight case that came with it. I managed to mostly re-solder the SD card slot back together, though it still needs a clothes pin to work. The noobs SD card that came with it was actually corrupted, so I had to reformat it and put noobs back on it. I got the pi to boot and installed raspbian OS. I then followed the instructions here for flashing the sonoff switch, except I had to look up the pinout for my pi model. I also soldered a 5pin header to the sonoff switch pcb, which made connecting the jumper cables more reliable.

Unfortunately, this didn't work for me. The 3.3V pin (GPIO1) could not provide the required current to power the sonoff switch, so the raspberry pi shutdown. I ordered a 3.3V FTDI programmer, which I'll use along with the Arduino ide instructions to do the programming.

Now that I have the FTDI programmer, I tried flashing the firmware. This time I followed the Arduino IDE instructions here. I used the precompiled arduino zip download in case I needed to delete it easily. I was able to flash the firmware ok, though it does throw an error about missing "cxxabi_tweaks.h". Search for that file in the path given by the's in a subfolder. This needs to be copied up a directory into the other bits folder; once there, it should compile fine. However, it would not connect to wifi no matter what I tried...wouldn't even start the wifi manager access point. Some googling led to this thread. Turns out the EPS8266 version 2.4.X is incompatible with the Tasmota firmware 6.1.1, which is what I was using. So nuked my arduino folder and started over with the v2.3.0 EPS8266 board (select version in board manager when installing eps8266). I re-did the stuff in my user_config.h (wifi settings, etc), then compiled and uploaded following the instructions. This worked right away...didn't even need the wifi manager access point. My router gave it an IP, which I found by opening the arduino serial monitor right after re-inserting the USB programmer USB cable (repowering the sonoff) and setting baud rate to 115200. It should automatically display the IP address the sonoff is using. If you don't want to use DHCP, you can set a static IP in the user_config.h file. Going to that IP address in browser brings up the web interface for the device. From there, I can toggle the switch (toggles the LED) and edit configurations. This mess took a few hours to figure out...what a pain.

After that, I logged into my router and told it to always use the IP address it assigned it for this device. I could also reconfigure the user_config.h with a static IP address, but at the moment, I don't think it's necessary.

There are multiple ways to command it. There's the web user interface, which is accessed by using a browser with the IP address assigned to the switch. There's also MQTT, which is necessary if you're using it with a home automation program. This doesn't seem too difficult to set up, but it's more capability than I need. I really only need to be able to turn the infiniband switch that this switch will be commanding on and off remotely. Using curl http requests, it's pretty simple to do this via a terminal. I added the following lines to my .bashrc:
alias ibswitchon='curl http://X.X.X.X/cm?cmnd=Power%20On'
alias ibswitchoff='curl http://X.X.X.X/cm?cmnd=Power%20Off'
, where X.X.X.X is the IP address of the sonoff switch. This allows me to easily bring the infiniband switch on and offline.

I purchased a 0.5m C13-C14 server power extension cable from eBay for ~$3, cut it in half, attached the switch between, and re-linked the ground wires (using an extra scrap of brown wire). Because the cable is pretty thick, the included screws for the wire tension relief clamps were not long enough, but luckily I had some 1/2" long ones of the same diameter that worked.

Test fit. Wire clamps not pictured
I tested it a few times. Seems to work well.

Wednesday, July 18, 2018

Using the Cluster for Astrophysics-y stuff

I spend months building this awesome cluster, and the first person to fully use it is my wife, haha. Since she put up with the mess for so long, I think this is fair.

I don't have the physics background to understand her research, but she's processing a ton of astronomical data, and the algorithms require are very cpu intensive. Her scripts are in python, each case needs 8 cores, and she has a few hundred cases. Time to queue the cluster...

I installed anaconda on the headnode in /opt so it'd be available system wide. I created an environment module for this anaconda version, then she created the virtual environment(s) she wanted. Then instead of installing it on each slave node, I just tar'd, copied the whole anaconda directory to /opt on each node, and untar'd. I also copied the environment modulefile. If I ever make massive changes, it will be faster/easier to re-clone all of the drives again, but for single program installs, it's faster to scp/ssh into each node and make modifications. I'll look into something like pssh in the future to make this sort of thing easier. Anyways, now that anaconda was installed, and her virtual environments available on all nodes, it was time to get slurm working.

Turns out it's super simple to call a python script from an sbatch script. You only need three lines after the #SBATCH setting lines:
module load python/anaconda-5.2
source activate myenv
python .py
However, when I tried running this for multiple cases, it would only assign one job per node, even though I had ntasks=1 and cpus-per-task=8. Theoretically, it should assign two jobs per node since each node has 20 cores (cores=cpus to slurm if hyperthreading is off). I had set SelectType and SelectParameters correctly in the slurm.conf, but it turns out that you have to add the OverSubscribe=YES parameter to the partition definition in slurm.conf, or slurm defaults to 1 job per node. This allowed for scheduling two of these jobs per node. I updated the slurm instructions in the software guide part 3 to cover this.

I couldn't find the answer to this online, but it turns out that it's perfectly fine to activate the same anaconda virtual environment more than once per node as long as you do it in separate terminal sessions (separate jobs for slurm) and are not making modifications to the environment during the runs. This made life a lot easier because it meant that we didn't have to try to track which environments were being used at the time of job launching.

Friday, July 13, 2018

Cluster: thermal testing

I cleaned up some of the wiring and made everything neat in the cabinet.

The next step was to check my modified heat extraction system to see if it could handle the thermal load. The cabinet was originally designed for 800W max, but I'm pushing ~1300W inside the cabinet (~1600W including external desktop). I taped my multimeter's thermocouple to the top, back inside of the cabinet and ran the wires out the wire passage of the back door. I then ran the all-node benchmark openfoam case for ~10 minutes. At full fan speed, the temperature leveled out at ~34-35C, which is good. The more important temperature measurement is the inlet to the server, particularly the node furthest from the air intake slit, which is located on the left side of the cabinet just behind the front door. Node005 is the top right of the 4 node 2U SM server, so it should have the hottest inlet temperature. I re-ran the case with the thermocouple taped to the top of the server near the front right. Temps never got above 31C at full throttle, so that's good. The heat extraction system is adequate.

The only problem is the noise. The fans are way louder than the server hardware in the cabinet, i.e. if I turn off the extraction fans, I can barely hear any noise from the cabinet, but when I turn them on, it's super loud. The fans didn't seem this loud when they weren't mounted, so maybe something is resonating. I'll need to mess with it some.

The first time I ran the extended benchmark case for the thermal tests, I got a segfault in node005. The next time I ran it, it didn't happen. I've run all my memory through memtest (all passed), so I'm not sure what happened. I'll have to watch for segfaults.

So, to do:
  1. Fix fan noise problem
  2. Fix RAID1 data storage drive
  3. Compile guide
Update: I 3D printed a mount for the fan controller that replaces the fan blanking plate in the middle slot (I only used 2 out of 3 fans). I added a provision for the aquarium tube I used to create a water manometer for the Phi testing, which gives me a static port in the fan duct just past the fans. This should give me the static pressure generated by the fan(s), which I can use to calculate the flow rate from the published pressure vs. flow rate curve. For the first test, I removed one of the fans and put the blanking plate on. The bracket doesn't perfectly seal in the air, but it's pretty good. I then connected the fan directly to the 24V power supply and measured the static pressure...the manometer measured 0. Odd. I was certain that the pressure drop was so large that the fans were almost stalling, but apparently that's not the case. I added the second fan and repeated the test. The water rose ~0.25mm, so total pressure differences was maybe 0.5mm. Either the pressure measurement is wrong, or the fans are actually operating at close to full flow rate. I'm not really sure. I'd expect some fairly significant deltaP in the exit duct due to the sound baffles, so I was expecting a fairly high static pressure measurement, but apparently that's not the case. Unfortunately, there aren't a lot of options for significantly quieter, similar flow rate, but lower pressure fans. There are some lower flow rate ones, but I can just throttle these back and achieve similar noise. Not sure where to go from here.

Update 2: I found some lower flow rate ones that are significantly quieter. If I use 3, and assuming that the pressure drop really is that low in the passages, it should have about the same total flow rate as the two loud fans, but be 8 decibels quieter for one option and 14 decibels quieter for another option. Unfortunately, the latter fans are harder to get in the UK (more $$). I can get the former fans from China for fairly cheap. Regardless, this is going to take a few weeks to fix.

Thursday, July 12, 2018

Completed Cluster! Benchmarks

Finally...after months of working on this, the full 5 node cluster with infiniband works. I ran some more of the motorBike Openfoam benchmarks.

  • Headnode only, n=20: 1.12 iter/s
  • Compute node only, n=20: 1.015 ips
  • head+node002, n=40, 1Gbe: 1.75 ips
  • head+node002, n=40, QDR infiniband: 2.18 ips
  • all 5, n=100, 1Gbe: 1.56 ips
  • all 5, n=100, QDR infiniband: 5.24 ips
You can see that the 1Gb ethernet link is definitely the bottleneck. In fact, it's so restrictive that using 5 nodes or more actually hurts performance. My guess is that the maximum performance with the 1Gbe link is probably about 3 nodes. The QDR Infiniband link is a different story entirely. It shows perfect scaling (sum of the headnode + X compute node ips) up to 5 nodes, and it'd probably continue to show excellent scaling up to many more, particularly for larger meshes.

Feels good man...

Update (3 moths later): The n=100 result is not realistic. Coincidentally, a (corrected method) n=108 case with FDR Infiniband ended up with almost the same iter/s (5.29), so just imagine the caption replaced. See this post for an explanation.

Still have some stuff to do:
  1. Clean up the wiring
  2. Get everything situated in the soundproof cabinet
  3. Fix the heat extraction system if it isn't sufficient
  4. Fix the RAID1 data array in the headnode so it stops failing
  5. Compile these blog posts into step-by-step guides
  6. Use the cluster

Wednesday, July 11, 2018

Compute node drive cloning, take 2

The previous two posts detailed the mess and resolution to my drive cloning woes. This post will be about the actual cloning process.

I created a bootable USB of clonezilla live. I then installed that, and a second SSD into the slave node, and booted to clonezilla. Clonezilla has some nice instructions for drive cloning. This time it just worked. I then shutdown the node. To test the new drive, I removed the original SSD and the clonezilla usb, but left the new SSD in, then booted the node. No problems, worked just like the original! Until I put it in another node...then it failed to boot. Turns out that, because these are UEFI nodes and I didn't install CentOS on any but the first, they don't have an entry in the NVRAM to find the bootloader. So I had to create a new boot entry in the bios pointing to EFI boot loader. This worked. If you don't have the option to create a new boot entry in the bios, then you have to do it with efibootmgr via a rescue usb.

Some modifications need to be made to the new drive to convert it to a different node.
  1. Insert the drive into another node, say node003
  2. Create a new entry in the boot menu point to the EFI bootloader
  3. Boot node003 using the new option
  4. Change hostname to node003
  5. Change intranet and ipmi IPs
  6. Optional: I followed these instructions to reinstall grub2: yum reinstall grub2-efi shim. My thinking is that this might create the correct nvram entry. Probably good to do this if have inconsistent boot problems.
  7. Reboot and see if it boots correctly
Everytime a node is brought online or offline, the cluster hostfile needs the appropriate line uncommented or commented, and the slurm.conf needs to be modified and propagated to all nodes. After modifying a slurm.conf, I think you have to restart the slurmctld and slurmd on the headnode, and slurmd on the slave nodes. You might also have to bring the nodes back up. Supposedly "scontrol reconfigure" also works, but I haven't tried it. 

Weirdly, and I believe unrelated, the /data RAID1 volume on the headnode failed again. I gotta figure out why that keeps failing.

When booting the cluster, the headnode must be booted first, then all of the slave nodes can be booted. Otherwise NFS fails to mount on the slave nodes. Then the nodes need to be brought up for slurm (done on the headnode). 

Recreating the compute node drive

As mentioned in the last post, something as simple as not having uniform drives across nodes can cause a huge mess. In this case, I have to reinstall everything on the slave node SSD (the smallest one this time) before I can continue.

I'm following my software guide while doing this, which currently consists of 3 parts. I'm also updating/cleaning them up as I go. If you want to read about the various screw ups I had during this process, then keep reading. If not, skip to the "Final Steps" below.

I installed CentOS, but with manual partitioning, ext4, and no LVM this time. I also used a much smaller home directory, leaving a lot of free space on the drive, which should make copying it easier. I then did the update, reboot, installed all of the packages I thought I'd need, renamed it to node002, and rebooted again. At this point, I thought I'd try something that worked for me on the headnode. When I switched from a SSD to NVMe on the headnode (after switching motherboards), I didn't want to have to resintall everything again. After doing the OS and package installations, I copied over all of the directories I modified: root and cluster home's, /etc, /usr/, and /opt. In order to include hidden files, you want to do something like "cp /mnt/usb1/opt/. /opt/" (the dot is critical). This actually worked pretty well for openmpi and openfoam. However, this happened before parts 2 and 3, which required a ton of system level settings. My hope here is that I can do these large copies from my backup usb drive, test openmpi and openfoam, then carefully go through the guides finishing the settings without having to do all of the re-installation work.

Copying the home directories, opt, and etc went fine, but copying /usr caused CentOS to crash and no longer boot. Looking back at my notes, it seems that I only copied usr/local before. Unfortunately, some of the modifications occurred outside of /usr/local, so this might idea might still require quite a lot of work. I redid all of the steps (again), but just copied /usr/local (which turns out just contains the cmake install) instead of /usr this time. I killed the firewall on the slave node, did the network setup, commented out the nfs line on the copied fstab (since I haven't hooked it up to the intranet switch again yet), and rebooted. It was right around this time I realized that you can't copy the fstab file because it contains the unique disk identifiers. Oops. Probably not a good idea to copy all of /etc then. Starting over AGAIN.

Final steps:
  1. Install OS and packages as in software guide part 1
  2. Copy the cluster home, root home, /opt, and /usr/local (with overwrite and hidden files) directories from the backup drive
  3. reboot
  4. Test openmpi and openfoam on the one node
  5. Copy /etc/hosts file from backup drive
  6. Do network settings as in software guide part 2 (follow until further notice)
  7. Disable firewall
  8. Connect headnode and do NFS setup
  9. Test mpi over ethernet
  10. Copy the rdma.conf
  11. Reboot and setup infiniband
  12. Test mpi over infiniband
  13. Now on to software guide 3
  14. Copy over the module files
  15. test environment modules
  16. make sure ntpd is working on headnode
  17. setup ntp on slave node, can copy npt.conf
  18. Do all of the slurm stuff from scratch
  19. Test mpi and openfoam 
If you exclude all the initial errors I made, this process took about 5 hours. The biggest hangup was an odd ssh error that was resolved by doing "ssh-add", something I didn't have to do before (added it to the guide).

Some good news: it's a little faster with the E5-2690v2's and slightly better RAM. n=40 over ethernet took 57.2s, over infiniband took 45.87s. That's about 10% faster :)

Tuesday, July 10, 2018

Attempts to clone compute node drive - fail

I put the headnode back together after the Windows 10 fiasco. It seems that every time I have to remove the RAID array (that I'm planning to use for storage) that I have to recreate it when I put it back in. The RAID controller recognized it, but CentOS did not, and I had to boot into rescue mode and comment out the mount line in fstab for it in order to boot CentOS. I then tried booting into the bios->lsi megaraid controller and recreating the drive from scratch, but that didn't seem to help either. Opening the CentOS disk utility, I noticed that the name of the virtual drive had changed from md0 to dm-0. I formatted that as ext4 and mounted it, which seemed to work ok. I then uncommented the fstab line, changed the name to dm-0, rebooted, and that seemed to work. The drive auto-mounted to /data like it was supposed to. Looking back on it, the raid array virtual drive might have been fine, but somehow got renamed to dm-0, so I may have just been able to rename it in the fstab file. That's good news for when I have data stored on it and have to remove it and put it back in.

Another annoying thing: I still have a ghost boot loader for windows on the NVMe...I've deleted it twice, but it seems to still be there somehow. I also can't seem to change the boot order regardless of how I order it in the is always first. 

A few more hardware changes for the slave nodes. I had 4 different sets of RAM. I purchased and sold some, so now I have 24x of one type and 8x of another. My goal is to eventually have uniform (32x) RAM in all of the compute nodes.

Software wise, I left off last time with a working Slurm installation. I got it working with OpenMPI and PMI2, but I couldn't get OpenMPI's internal PMIx working with Slurm. I filed a bug report about this, but it's a very low priority since I don't have a support contract, so it will likely never be looked at. Another problem, for which I did not file a bug report, is that slurmd on the slave nodes does not seem to be honoring the srun port range setting in the conf file. This caused me to have to whitelist the entire private subnet instead of being able to open certain ports. I went back and closed the ports I had opened for slurm. 

Now that all of the software is finalized, it's time to clone the slave node drive 3x times. This could be avoided with PXE diskless booting, but that looks like it will be a huge pain to setup, and it will take up a lot of RAM if I ever decide to put a commercial CFD program on this cluster. Unfortunately, cloning the drive ended up being a huge pain, too. I have 3 different types of 120 or 128GB SSDs, and I installed everything on the largest. This is bad because now I can't use "dd" to clone the drive to the smaller drives. I tried clonezilla because it has an auto-resize advanced setting, but it failed (I doubt it ever works). If you don't have identical drives for your slave nodes, install everything on the smallest before cloning. I updated the software part 1 instructions with this information. What's even worse: the default CentOS 7 file system is XFS, which is not shrinkable, so I can't just shrink the home partition and logical volume. *slams head on desk repeatedly*. I'm beginning to expect shit like this to happen.

Note, none of the following ended up working. If you're in a similar situation, you're better off just starting from scratch on the smaller drive.

I really only need to shrink the home partition, which doesn't have much on it because I'm mounting the headnode's home folder via NFS. If you're in this situation, then you must do the following:

  1. attach another drive (can be a large USB)
  2. copy all of the /home files to it
  3. lvs should show the logical volumes on your drive
  4. umount /home
  5. lvchange -an /dev/centos/home
  6. lvremove /dev/centos/home
  7. lvs should now not show the "home" logical volume
  8. Create the new home logical volume in the centos volume group. Use a size that results in a total disk size a few G smaller than the smallest disk you have.: lvcreate -L 40G -n home centos
  9. Create the xfs (or whatever you want) for the new home logical volume. 
  10. mount /dev/centos/home /home
  11. Copy the files back from the other drive to the /home directory

The next step was to resize the physical volume. However, the free space ended up in the middle of the volume (between home and root), which meant I couldn't resize it. I also couldn't move the root part of the physical volume using pvmove because you can't overlap a volume move with itself. Useful link. If I had an extra 50 gb (size of root) of space, I could move the root part to that, then move it again to take up the current free space + some of the old root space, leaving the free space at the end of the volume, but I can't do that because I don't have the space. So, plan C: remove and recreate the root lv as above (this will move free space to end), then shrink the physical volume. Unfortunately, this requires using a liveCD to boot because you can't unmount root while booted. So I created a liveusb using dd and the live KDE image of CentOS. I then booted the node with that, accessed a terminal as root. Repeated the above steps for root, except don't mount it yet. Then:

  1. pvs -v --segments /dev/sda2 (this should now show all of the free space at the end, last line)
  2. pvresize --setphysicalvolumesize 102G /dev/sda2 (the size should be smaller than the available space on the smallest ssd you have, but make sure only cutting into free space)
  3. If the above completed successfully, run step 1 again, and you should see less free space at the end
  4. vgs and pvs should show smaller volume sizes now
The plan was then to shrink the sda2 partition, mount root somewhere, mount the usb drive I saved everything from root on, and copy everything back (note: don't need stuff in sys, tmp, or proc). However, I couldn't figure out how to shrink sda2. I tried booting to the drive, which sort of worked, but the whole permission structure of the filesystem is fucked, probably from the cp's. So yeah...going to have to reinstall EVERYTHING just because I didn't install it on the smaller drive, so I couldn't clone it. Damn this sucks.