Search This Blog

Friday, July 13, 2018

Cluster: thermal testing

I cleaned up some of the wiring and made everything neat in the cabinet.

The next step was to check my modified heat extraction system to see if it could handle the thermal load. The cabinet was originally designed for 800W max, but I'm pushing ~1300W inside the cabinet (~1600W including external desktop). I taped my multimeter's thermocouple to the top, back inside of the cabinet and ran the wires out the wire passage of the back door. I then ran the all-node benchmark openfoam case for ~10 minutes. At full fan speed, the temperature leveled out at ~34-35C, which is good. The more important temperature measurement is the inlet to the server, particularly the node furthest from the air intake slit, which is located on the left side of the cabinet just behind the front door. Node005 is the top right of the 4 node 2U SM server, so it should have the hottest inlet temperature. I re-ran the case with the thermocouple taped to the top of the server near the front right. Temps never got above 31C at full throttle, so that's good. The heat extraction system is adequate.

The only problem is the noise. The fans are way louder than the server hardware in the cabinet, i.e. if I turn off the extraction fans, I can barely hear any noise from the cabinet, but when I turn them on, it's super loud. The fans didn't seem this loud when they weren't mounted, so maybe something is resonating. I'll need to mess with it some.

The first time I ran the extended benchmark case for the thermal tests, I got a segfault in node005. The next time I ran it, it didn't happen. I've run all my memory through memtest (all passed), so I'm not sure what happened. I'll have to watch for segfaults.

So, to do:
  1. Fix fan noise problem
  2. Fix RAID1 data storage drive
  3. Compile guide
Update: I 3D printed a mount for the fan controller that replaces the fan blanking plate in the middle slot (I only used 2 out of 3 fans). I added a provision for the aquarium tube I used to create a water manometer for the Phi testing, which gives me a static port in the fan duct just past the fans. This should give me the static pressure generated by the fan(s), which I can use to calculate the flow rate from the published pressure vs. flow rate curve. For the first test, I removed one of the fans and put the blanking plate on. The bracket doesn't perfectly seal in the air, but it's pretty good. I then connected the fan directly to the 24V power supply and measured the static pressure...the manometer measured 0. Odd. I was certain that the pressure drop was so large that the fans were almost stalling, but apparently that's not the case. I added the second fan and repeated the test. The water rose ~0.25mm, so total pressure differences was maybe 0.5mm. Either the pressure measurement is wrong, or the fans are actually operating at close to full flow rate. I'm not really sure. I'd expect some fairly significant deltaP in the exit duct due to the sound baffles, so I was expecting a fairly high static pressure measurement, but apparently that's not the case. Unfortunately, there aren't a lot of options for significantly quieter, similar flow rate, but lower pressure fans. There are some lower flow rate ones, but I can just throttle these back and achieve similar noise. Not sure where to go from here.

Update 2: I found some lower flow rate ones that are significantly quieter. If I use 3, and assuming that the pressure drop really is that low in the passages, it should have about the same total flow rate as the two loud fans, but be 8 decibels quieter for one option and 14 decibels quieter for another option. Unfortunately, the latter fans are harder to get in the UK (more $$). I can get the former fans from China for fairly cheap. Regardless, this is going to take a few weeks to fix.

Thursday, July 12, 2018

Completed Cluster! Benchmarks

Finally...after months of working on this, the full 5 node cluster with infiniband works. I ran some more of the motorBike Openfoam benchmarks.

  • Headnode only, n=20: 1.12 iter/s
  • Compute node only, n=20: 1.015 ips
  • head+node002, n=40, 1Gbe: 1.75 ips
  • head+node002, n=40, QDR infiniband: 2.18 ips
  • all 5, n=100, 1Gbe: 1.56 ips
  • all 5, n=100, QDR infiniband: 5.24 ips
You can see that the 1Gb ethernet link is definitely the bottleneck. In fact, it's so restrictive that using 5 nodes or more actually hurts performance. My guess is that the maximum performance with the 1Gbe link is probably about 3 nodes. The QDR Infiniband link is a different story entirely. It shows perfect scaling (sum of the headnode + X compute node ips) up to 5 nodes, and it'd probably continue to show excellent scaling up to many more, particularly for larger meshes.

Feels good man...

Still have some stuff to do:
  1. Clean up the wiring
  2. Get everything situated in the soundproof cabinet
  3. Fix the heat extraction system if it isn't sufficient
  4. Fix the RAID1 data array in the headnode so it stops failing
  5. Compile these blog posts into step-by-step guides
  6. Use the cluster

Wednesday, July 11, 2018

Compute node drive cloning, take 2

The previous two posts detailed the mess and resolution to my drive cloning woes. This post will be about the actual cloning process.

I created a bootable USB of clonezilla live. I then installed that, and a second SSD into the slave node, and booted to clonezilla. Clonezilla has some nice instructions for drive cloning. This time it just worked. I then shutdown the node. To test the new drive, I removed the original SSD and the clonezilla usb, but left the new SSD in, then booted the node. No problems, worked just like the original! Until I put it in another node...then it failed to boot. Turns out that, because these are UEFI nodes and I didn't install CentOS on any but the first, they don't have an entry in the NVRAM to find the bootloader. So I had to create a new boot entry in the bios pointing to EFI boot loader. This worked. If you don't have the option to create a new boot entry in the bios, then you have to do it with efibootmgr via a rescue usb.

Some modifications need to be made to the new drive to convert it to a different node.
  1. Insert the drive into another node, say node003
  2. Create a new entry in the boot menu point to the EFI bootloader
  3. Boot node003 using the new option
  4. Change hostname to node003
  5. Change intranet and ipmi IPs
  6. Optional: I followed these instructions to reinstall grub2: yum reinstall grub2-efi shim. My thinking is that this might create the correct nvram entry. Probably good to do this if have inconsistent boot problems.
  7. Reboot and see if it boots correctly
Everytime a node is brought online or offline, the cluster hostfile needs the appropriate line uncommented or commented, and the slurm.conf needs to be modified and propagated to all nodes. After modifying a slurm.conf, I think you have to restart the slurmctld and slurmd on the headnode, and slurmd on the slave nodes. You might also have to bring the nodes back up. Supposedly "scontrol reconfigure" also works, but I haven't tried it. 

Weirdly, and I believe unrelated, the /data RAID1 volume on the headnode failed again. I gotta figure out why that keeps failing.

When booting the cluster, the headnode must be booted first, then all of the slave nodes can be booted. Otherwise NFS fails to mount on the slave nodes. Then the nodes need to be brought up for slurm (done on the headnode). 

Recreating the compute node drive

As mentioned in the last post, something as simple as not having uniform drives across nodes can cause a huge mess. In this case, I have to reinstall everything on the slave node SSD (the smallest one this time) before I can continue.

I'm following my software guide while doing this, which currently consists of 3 parts. I'm also updating/cleaning them up as I go. If you want to read about the various screw ups I had during this process, then keep reading. If not, skip to the "Final Steps" below.

I installed CentOS, but with manual partitioning, ext4, and no LVM this time. I also used a much smaller home directory, leaving a lot of free space on the drive, which should make copying it easier. I then did the update, reboot, installed all of the packages I thought I'd need, renamed it to node002, and rebooted again. At this point, I thought I'd try something that worked for me on the headnode. When I switched from a SSD to NVMe on the headnode (after switching motherboards), I didn't want to have to resintall everything again. After doing the OS and package installations, I copied over all of the directories I modified: root and cluster home's, /etc, /usr/, and /opt. In order to include hidden files, you want to do something like "cp /mnt/usb1/opt/. /opt/" (the dot is critical). This actually worked pretty well for openmpi and openfoam. However, this happened before parts 2 and 3, which required a ton of system level settings. My hope here is that I can do these large copies from my backup usb drive, test openmpi and openfoam, then carefully go through the guides finishing the settings without having to do all of the re-installation work.

Copying the home directories, opt, and etc went fine, but copying /usr caused CentOS to crash and no longer boot. Looking back at my notes, it seems that I only copied usr/local before. Unfortunately, some of the modifications occurred outside of /usr/local, so this might idea might still require quite a lot of work. I redid all of the steps (again), but just copied /usr/local (which turns out just contains the cmake install) instead of /usr this time. I killed the firewall on the slave node, did the network setup, commented out the nfs line on the copied fstab (since I haven't hooked it up to the intranet switch again yet), and rebooted. It was right around this time I realized that you can't copy the fstab file because it contains the unique disk identifiers. Oops. Probably not a good idea to copy all of /etc then. Starting over AGAIN.

Final steps:
  1. Install OS and packages as in software guide part 1
  2. Copy the cluster home, root home, /opt, and /usr/local (with overwrite and hidden files) directories from the backup drive
  3. reboot
  4. Test openmpi and openfoam on the one node
  5. Copy /etc/hosts file from backup drive
  6. Do network settings as in software guide part 2 (follow until further notice)
  7. Disable firewall
  8. Connect headnode and do NFS setup
  9. Test mpi over ethernet
  10. Copy the rdma.conf
  11. Reboot and setup infiniband
  12. Test mpi over infiniband
  13. Now on to software guide 3
  14. Copy over the module files
  15. test environment modules
  16. make sure ntpd is working on headnode
  17. setup ntp on slave node, can copy npt.conf
  18. Do all of the slurm stuff from scratch
  19. Test mpi and openfoam 
If you exclude all the initial errors I made, this process took about 5 hours. The biggest hangup was an odd ssh error that was resolved by doing "ssh-add", something I didn't have to do before (added it to the guide).

Some good news: it's a little faster with the E5-2690v2's and slightly better RAM. n=40 over ethernet took 57.2s, over infiniband took 45.87s. That's about 10% faster :)

Tuesday, July 10, 2018

Attempts to clone compute node drive - fail

I put the headnode back together after the Windows 10 fiasco. It seems that every time I have to remove the RAID array (that I'm planning to use for storage) that I have to recreate it when I put it back in. The RAID controller recognized it, but CentOS did not, and I had to boot into rescue mode and comment out the mount line in fstab for it in order to boot CentOS. I then tried booting into the bios->lsi megaraid controller and recreating the drive from scratch, but that didn't seem to help either. Opening the CentOS disk utility, I noticed that the name of the virtual drive had changed from md0 to dm-0. I formatted that as ext4 and mounted it, which seemed to work ok. I then uncommented the fstab line, changed the name to dm-0, rebooted, and that seemed to work. The drive auto-mounted to /data like it was supposed to. Looking back on it, the raid array virtual drive might have been fine, but somehow got renamed to dm-0, so I may have just been able to rename it in the fstab file. That's good news for when I have data stored on it and have to remove it and put it back in.

Another annoying thing: I still have a ghost boot loader for windows on the NVMe...I've deleted it twice, but it seems to still be there somehow. I also can't seem to change the boot order regardless of how I order it in the is always first. 

A few more hardware changes for the slave nodes. I had 4 different sets of RAM. I purchased and sold some, so now I have 24x of one type and 8x of another. My goal is to eventually have uniform (32x) RAM in all of the compute nodes.

Software wise, I left off last time with a working Slurm installation. I got it working with OpenMPI and PMI2, but I couldn't get OpenMPI's internal PMIx working with Slurm. I filed a bug report about this, but it's a very low priority since I don't have a support contract, so it will likely never be looked at. Another problem, for which I did not file a bug report, is that slurmd on the slave nodes does not seem to be honoring the srun port range setting in the conf file. This caused me to have to whitelist the entire private subnet instead of being able to open certain ports. I went back and closed the ports I had opened for slurm. 

Now that all of the software is finalized, it's time to clone the slave node drive 3x times. This could be avoided with PXE diskless booting, but that looks like it will be a huge pain to setup, and it will take up a lot of RAM if I ever decide to put a commercial CFD program on this cluster. Unfortunately, cloning the drive ended up being a huge pain, too. I have 3 different types of 120 or 128GB SSDs, and I installed everything on the largest. This is bad because now I can't use "dd" to clone the drive to the smaller drives. I tried clonezilla because it has an auto-resize advanced setting, but it failed (I doubt it ever works). If you don't have identical drives for your slave nodes, install everything on the smallest before cloning. I updated the software part 1 instructions with this information. What's even worse: the default CentOS 7 file system is XFS, which is not shrinkable, so I can't just shrink the home partition and logical volume. *slams head on desk repeatedly*. I'm beginning to expect shit like this to happen.

Note, none of the following ended up working. If you're in a similar situation, you're better off just starting from scratch on the smaller drive.

I really only need to shrink the home partition, which doesn't have much on it because I'm mounting the headnode's home folder via NFS. If you're in this situation, then you must do the following:

  1. attach another drive (can be a large USB)
  2. copy all of the /home files to it
  3. lvs should show the logical volumes on your drive
  4. umount /home
  5. lvchange -an /dev/centos/home
  6. lvremove /dev/centos/home
  7. lvs should now not show the "home" logical volume
  8. Create the new home logical volume in the centos volume group. Use a size that results in a total disk size a few G smaller than the smallest disk you have.: lvcreate -L 40G -n home centos
  9. Create the xfs (or whatever you want) for the new home logical volume. 
  10. mount /dev/centos/home /home
  11. Copy the files back from the other drive to the /home directory

The next step was to resize the physical volume. However, the free space ended up in the middle of the volume (between home and root), which meant I couldn't resize it. I also couldn't move the root part of the physical volume using pvmove because you can't overlap a volume move with itself. Useful link. If I had an extra 50 gb (size of root) of space, I could move the root part to that, then move it again to take up the current free space + some of the old root space, leaving the free space at the end of the volume, but I can't do that because I don't have the space. So, plan C: remove and recreate the root lv as above (this will move free space to end), then shrink the physical volume. Unfortunately, this requires using a liveCD to boot because you can't unmount root while booted. So I created a liveusb using dd and the live KDE image of CentOS. I then booted the node with that, accessed a terminal as root. Repeated the above steps for root, except don't mount it yet. Then:

  1. pvs -v --segments /dev/sda2 (this should now show all of the free space at the end, last line)
  2. pvresize --setphysicalvolumesize 102G /dev/sda2 (the size should be smaller than the available space on the smallest ssd you have, but make sure only cutting into free space)
  3. If the above completed successfully, run step 1 again, and you should see less free space at the end
  4. vgs and pvs should show smaller volume sizes now
The plan was then to shrink the sda2 partition, mount root somewhere, mount the usb drive I saved everything from root on, and copy everything back (note: don't need stuff in sys, tmp, or proc). However, I couldn't figure out how to shrink sda2. I tried booting to the drive, which sort of worked, but the whole permission structure of the filesystem is fucked, probably from the cp's. So yeah...going to have to reinstall EVERYTHING just because I didn't install it on the smaller drive, so I couldn't clone it. Damn this sucks.

Friday, June 22, 2018

More Windows 10 troubles

After the CentOS/Phi fiasco a few days ago, I reassembled my computer only to find out that the windows drive wouldn't boot anymore. In a previous post, I talked about the trouble I had installing Windows 10. This time it loads the Windows logo and just hangs.

So I created a Windows 10 installation USB, booted to it, and launch auto repair, which of course failed. I then used the command prompt to examine the volumes and partitions on the disk. Turns out the EFI partition was missing, so I recreated it following these instructions, which basically involved shrinking the C partition, creating a new system efi partition, and then using bcdboot to write the new boot files. When I rebooted, there were 3 entries in the boot menu. One just hangs at the logo like before (probably the original), and two cause the blue Recovery screen to pop up, which of course didn't work. *Sigh...I think I'm cursed. So back to the installation usb, which now appears to not work.

That's the tricky part with OS recovery. When is it more time efficient just to start over? I think I've wasted more time trying to recover it than it would take to reinstall.

I ended up recreating the installation media and reinstalling.

Lesson learned:
  1. Do not install windows with any other drives present
  2. If you didn't follow lesson one, take out the other drives, wipe the drive in another computer with another program, and then reinstall windows because eventually it will mess up even if you think you fixed it.
  3. Windows 10 will probably still fail again, so backup your files (luckily didn't lose anything other than time)
  4. I didn't actually try this, but some people have better luck using USB 2.0 ports than USB 3.0 ports. Try that.
Update (next day): It failed to boot again. I haven't even reinstalled the other drives yet. The best part: the windows installation media won't boot now either, even after re-creating it. The drive's windows randomly booted into recovery mode once, and I tried fixing it again, but none of that helped of course. Tried doing a slow format of the usb drive and recreating the recovery media again. I then tried booting to UEFI shell, which somehow caused windows to boot.

Update (a few weeks later): I left it on for a few weeks because I didn't feel like dealing with this shit. Somehow, it magically fixed itself though. Shutdowns and boots now work fine, even after putting the CentOS NVMe and RAID array back in. 

I really think this showcases the decline of the Windows OS. From googling around, these sorts of errors are incredibly common. Granted most Linux distros can give you this much trouble, but they're free...

Wednesday, June 20, 2018

A horrible waste of time

Incredibly frustrating day. I got a Xeon Phi back from a buyer who bought it and had no idea what it was or how to use it. eBay has fairly awful seller protection: if the buyer wants to return something without paying for shipping, all they have to do is select "item wasn't as described", and the seller is instantly slammed with the shipping costs. It could come back in pieces, and they still have to issue a refund. Luckily, the Phi seemed fine, but I needed to test it. 12+ hours of hell later...

Turns out that the Intel MPSS that built fine on CentOS 7.4 does not at all on CentOS 7.5. Someone posted a patch on the intel forums, but I tried doing the source code modifications and it didn't work. There goes 1.5 hours. Intel MPSS comes with RPMs for CentOS 7.3, so I thought I'd try that. I then try creating a bootable CentOS 7.3 drive and the real awfulness started. I've made about 10 bootable CentOS drives over the past year and have never had this much trouble. All you have to do is pop a usb drive in and use the dd command to copy the iso over. I usually had to do it twice due to some sort of bug, but fine. This time I could not get the installer to run on my workstation no matter what. I tried UEFI, with and without basic graphics mode, and legacy with and without basic graphics mode, and two different graphics cards. The screen always blanked out immediately after selecting the install option. I tried two different usb drives and probably wrote the iso about 8 times. I tried overwritting the first X MB with zeros using dd. Nope, doesn't help. I try booting the drive with my laptop: no problem, the installer starts right up. What the fuck. So I take everything nonessential out of my workstation, try again: nothing. Nuclear time. I used my laptop's windows diskpart to delete the partition on one of the usb drives, clean it, then do a slow fat32 format to make sure everything is wiped. While that was running, I pulled the cmos battery out of the desktop, put it back in, and reset the bios settings to defaults. Since the workstation is out of commission (because I pulled the nvme drive and the raid array (which is now fucked) so it won't boot without me booting into rescue mode and editing fstab), I downloaded the Centos 7.3 iso to my windows (25 minutes later...), checked the sha1sum (windows has a built in utility called certUtil), and used rufus to put it on the now cleaned and formatted usb drive (which took an hour). I tested it with my laptop, no problems, then tried the work station again, but it still fails to load the installer. There must be something seriously wrong with my work station. I tried reinstalling the BIOS, didn't help. Tried re-writing the drive again, nope. Tried using a known working ubuntu drive, first two times caused boot to hang, third time after selecting install caused the workstation to reset, and fourth time finally started the install correctly...ok. Try centos drive again. Nope, failed. Tried the minimal iso...still failed. I'm going to try CentOS 7.4 DVD, which I have successfully used on this workstation before. If that doesn't work, I'm not sure what else to do. I've tried everything I can think of.

Maybe it's some sort of bios incompatibility. The bios I'm using is from Jan. 2018, while CentOS 7.3 was released in Dec. 2016. 7.4 is from Sept. 2017. I downloaded the 7.4 DVD iso and used rufus to put it on the drive. And it actually booted the installer! Holy shit. It must be some sort of bios incompatibility. I've never heard of such a thing, but that's the only thing it could be. However, the drive failed the installer self test, so I re-did the rufus dd thing again. I've had to do it twice before, so this wasn't surprising. However, that failed. I think it has to be done with linux dd twice to work right. So I booted into my laptop's ubuntu and dd'd the iso to the drive twice (there goes another 40 minutes). And this worked. No problems installing.

There is nothing in the google-verse about a (modern) bios and linux operating system version being incompatible that I can find (apart from something really stupid, like an ARM OS on an x64 architecture). But that's what happened here.

Lessons learned:

  • CentOS 7.3 is NOT compatible with the ASUS Z10PE-D8 BIOS version 3501. It probably isn't with the latest (3703) either. I didn't try any others, but it wouldn't surprise me if other os/bios combinations are not compatible. OS/Bios combinations close in date probably have the best chance of working. 
  • Don't use RUFUS on windows for centos installers

Also, eBay was nice enough to refund the return shipping I had to give the buyer after I explained the situation. Good customer support.