Rocket Science: June 2018

Friday, June 22, 2018

More Windows 10 troubles

After the CentOS/Phi fiasco a few days ago, I reassembled my computer only to find out that the windows drive wouldn't boot anymore. In a previous post, I talked about the trouble I had installing Windows 10. This time it loads the Windows logo and just hangs.

So I created a Windows 10 installation USB, booted to it, and launch auto repair, which of course failed. I then used the command prompt to examine the volumes and partitions on the disk. Turns out the EFI partition was missing, so I recreated it following these instructions, which basically involved shrinking the C partition, creating a new system efi partition, and then using bcdboot to write the new boot files. When I rebooted, there were 3 entries in the boot menu. One just hangs at the logo like before (probably the original), and two cause the blue Recovery screen to pop up, which of course didn't work. *Sigh...I think I'm cursed. So back to the installation usb, which now appears to not work.

That's the tricky part with OS recovery. When is it more time efficient just to start over? I think I've wasted more time trying to recover it than it would take to reinstall.

I ended up recreating the installation media and reinstalling.

Lesson learned:

Do not install windows with any other drives present
If you didn't follow lesson one, take out the other drives, wipe the drive in another computer with another program, and then reinstall windows because eventually it will mess up even if you think you fixed it.
Windows 10 will probably still fail again, so backup your files (luckily didn't lose anything other than time)
I didn't actually try this, but some people have better luck using USB 2.0 ports than USB 3.0 ports. Try that.

Update (next day): It failed to boot again. I haven't even reinstalled the other drives yet. The best part: the windows installation media won't boot now either, even after re-creating it. The drive's windows randomly booted into recovery mode once, and I tried fixing it again, but none of that helped of course. Tried doing a slow format of the usb drive and recreating the recovery media again. I then tried booting to UEFI shell, which somehow caused windows to boot.

Update (a few weeks later): I left it on for a few weeks because I didn't feel like dealing with this shit. Somehow, it magically fixed itself though. Shutdowns and boots now work fine, even after putting the CentOS NVMe and RAID array back in.

I really think this showcases the decline of the Windows OS. From googling around, these sorts of errors are incredibly common. Granted most Linux distros can give you this much trouble, but they're free...

Wednesday, June 20, 2018

A horrible waste of time

Incredibly frustrating day. I got a Xeon Phi back from a buyer who bought it and had no idea what it was or how to use it. eBay has fairly awful seller protection: if the buyer wants to return something without paying for shipping, all they have to do is select "item wasn't as described", and the seller is instantly slammed with the shipping costs. It could come back in pieces, and they still have to issue a refund. Luckily, the Phi seemed fine, but I needed to test it. 12+ hours of hell later...

Turns out that the Intel MPSS that built fine on CentOS 7.4 does not at all on CentOS 7.5. Someone posted a patch on the intel forums, but I tried doing the source code modifications and it didn't work. There goes 1.5 hours. Intel MPSS comes with RPMs for CentOS 7.3, so I thought I'd try that. I then try creating a bootable CentOS 7.3 drive and the real awfulness started. I've made about 10 bootable CentOS drives over the past year and have never had this much trouble. All you have to do is pop a usb drive in and use the dd command to copy the iso over. I usually had to do it twice due to some sort of bug, but fine. This time I could not get the installer to run on my workstation no matter what. I tried UEFI, with and without basic graphics mode, and legacy with and without basic graphics mode, and two different graphics cards. The screen always blanked out immediately after selecting the install option. I tried two different usb drives and probably wrote the iso about 8 times. I tried overwritting the first X MB with zeros using dd. Nope, doesn't help. I try booting the drive with my laptop: no problem, the installer starts right up. What the fuck. So I take everything nonessential out of my workstation, try again: nothing. Nuclear time. I used my laptop's windows diskpart to delete the partition on one of the usb drives, clean it, then do a slow fat32 format to make sure everything is wiped. While that was running, I pulled the cmos battery out of the desktop, put it back in, and reset the bios settings to defaults. Since the workstation is out of commission (because I pulled the nvme drive and the raid array (which is now fucked) so it won't boot without me booting into rescue mode and editing fstab), I downloaded the Centos 7.3 iso to my windows (25 minutes later...), checked the sha1sum (windows has a built in utility called certUtil), and used rufus to put it on the now cleaned and formatted usb drive (which took an hour). I tested it with my laptop, no problems, then tried the work station again, but it still fails to load the installer. There must be something seriously wrong with my work station. I tried reinstalling the BIOS, didn't help. Tried re-writing the drive again, nope. Tried using a known working ubuntu drive, first two times caused boot to hang, third time after selecting install caused the workstation to reset, and fourth time finally started the install correctly...ok. Try centos drive again. Nope, failed. Tried the minimal iso...still failed. I'm going to try CentOS 7.4 DVD, which I have successfully used on this workstation before. If that doesn't work, I'm not sure what else to do. I've tried everything I can think of.

Maybe it's some sort of bios incompatibility. The bios I'm using is from Jan. 2018, while CentOS 7.3 was released in Dec. 2016. 7.4 is from Sept. 2017. I downloaded the 7.4 DVD iso and used rufus to put it on the drive. And it actually booted the installer! Holy shit. It must be some sort of bios incompatibility. I've never heard of such a thing, but that's the only thing it could be. However, the drive failed the installer self test, so I re-did the rufus dd thing again. I've had to do it twice before, so this wasn't surprising. However, that failed. I think it has to be done with linux dd twice to work right. So I booted into my laptop's ubuntu and dd'd the iso to the drive twice (there goes another 40 minutes). And this worked. No problems installing.

There is nothing in the google-verse about a (modern) bios and linux operating system version being incompatible that I can find (apart from something really stupid, like an ARM OS on an x64 architecture). But that's what happened here.

Lessons learned:

CentOS 7.3 is NOT compatible with the ASUS Z10PE-D8 BIOS version 3501. It probably isn't with the latest (3703) either. I didn't try any others, but it wouldn't surprise me if other os/bios combinations are not compatible. OS/Bios combinations close in date probably have the best chance of working.
Don't use RUFUS on windows for centos installers

Also, eBay was nice enough to refund the return shipping I had to give the buyer after I explained the situation. Good customer support.

Saturday, June 16, 2018

More cluster hardware changes

I got a good deal on 4x E5-2690 v2's, so I decided to switch out the E5-2690's for them and run some tests. With 20 cores, the motorbike benchmark ended up about 2s faster than with the E5-2667 v2's on 16 cores. On 16 cores, the E5-2690 v2's were about 1s slower. That's well within the repeatability margin. On one core, the E5-2667 v2 was bout 7% faster, which makes sense because it's single core turbo frequency is about 10% higher. The reason the E5-2690 v2's aren't faster despite having a higher core*GHz (28.8 vs. 33) is due to the memory bottleneck. For these nodes and this benchmark, time improvement is on the order of 10's of seconds for 8 cores and up and only seconds for 16 cores and up. The memory bandwidth is fully saturated, so additional cores simply don't help. This is why processors with many (>16), but slower cores are not recommend for CFD. On non-memory bottleneck workloads (pretty much everything not CFD), the E5-2690 v2's should be faster due to the additional cores, so I decided to go with those. The seller had more of them, so I got 4x more at the same price and replaced the E5-2667 v2's. Now my cluster has 100 cores at 3-3.3GHz. Nice.

Saturday, June 9, 2018

Experience installing Windows 10

I needed Windows to run some programs. I've set up dual boot Windows - Ubuntu on both my (Win 8.1, Ubuntu 16) and (Win 10, Ubuntu 14 and 18) laptops before, but I decided I didn't want to clutter the NVMe drive. I purchased a separate SSD, installed it, downloaded the Win 10 installation media to a USB, and installed it using "custom installation". I made sure to select the SSD drive. Windows seemed to install fine. I then rebooted. The windows boot manager appeared on the NVMe. Oh shit. I booted into CentOS, which luckily still worked fine. I did some searching in directories and googling and discovered something...incredibly...stupid. Windows 10 does not necessarily install its bootloader on the drive you select! It just installs it on the first drive in the boot order. What the f&*#! Why, just why... There's an easy fix luckily. In windows, open an admin powershell and run "bcdboot C:\Windows /s C:". This adds the boot files to the C: drive. Then shutdown, and your BIOS should detect the bootloader on the correct disk. To get rid of the bootloader on my linux drive (the NVMe), I just deleted the Microsoft folder in the /boot/efi/EFI directory. Oddly, the BIOS still thinks it exists, but trying to boot with it doesn't do anything. Booting with the windows boot manager on the correct drive now works, though.

Lesson learned: Unplug all drives if installing windows on a new/separate drive.

Another problem I ran into is that Windows wouldn't shut down. Clicking the power button and selecting shutdown just logged me out and blanked the screen. I had to disable Fastboot under power button settings->advanced settings to get the computer to shutdown.

To activate it, I purchased a Windows 10 Pro product key from eBay. They're only a few dollars, so I thought it would be worth a try. The first seller I purchased from strung me a long with not-working keys for a couple days before I filed for a refund (eBay money back guarantee). The second seller had a better feedback score (99.8). I received the key within 1 minute via email and it worked. I don't think it's transferable, but for 5 dollars, I could buy many many of these keys for the price of an actual Windows 10 Pro disc. I think the key is to buy from a seller that has a high feedback rating and seems to be selling a lot of them.

Another issue I ran into recently with Win 10 on another laptop: you can't create recovery media using the recovery menu option. Supposedly it works right after installing windows 10, but after a bunch of updates, it always fails. There's a huge thread about this on the micrsoft forum. The only way around this is to create new installation media, which contains recovery options.

It's shit like this that reminds me why I prefer Linux now. Windows is supposed to "just work", but it can be just as much of a pain in the ass to deal with as Linux, but without the ability to fix it.

Wednesday, June 6, 2018

Wanhao i3 V2.1: more upgrades and maintenance

I did some maintenance on and upgrades to the Wanhao i3 V2.1 today.

Some cheap Chinese printers suffer from under-spec'd power connectors on their main boards. Over time, these can heat up, char, and eventually fail, causing fire. While none of my connectors have shown any sign of discoloration or impending failure after 75 days of printing time (6.8 km of filament), I decided to do a common preventative upgrade. Since the heated bed is the primary power consuming device, it makes sense to offload the heated bed's power. Since the heated bed is PWM'd to achieve different average power, you can add a MOSFET between the signal, power, and heated bed power wires, where the signal comes from the original heated bed connections on the controller board, power comes directly from the DC power supply, and the heated bed power wires are the original. Before doing this, I printed this thing, which is a nice holder for the MOSFET board. That gets installed on two standoffs under the Melzi, and the MOSFET board screws to it.

Upper right: MOSFET board installed

I used ~10AWG wire for the connection from the PSU, but it was pretty unwieldy. I'd suggest using 14AWG wire. Anyways, instead of just jamming the wires under the terminals, I soldered forked spade connectors to the ends and heat shrunk them. I did the same for the heated bed wires.

While in the safety mood, I installed a smoke detector on the ceiling of the room with the printer.

The wires to the hot end heater cartridge come crimped from the factory. I found it annoying to have to work on the hot end with it attached to the carriage, so I replaced the crimps with micro-Deans (micro T) connectors. These are rated up to 10A, but the Chinese knock-offs are probably not able to handle that much current. Since the heater cartridge only uses ~40/12=~3.3A, it should be fine.

I also re-built the hot-end again. In a previous post, I detailed all the things I tried to get back to how it would print previously (no blobs, no initial under extrusion, etc.). All of that failed, and I ended up applying software band-aids, but the last physical thing I tried was spacing the nozzle off the heated block and using the original PTFE tube, which was too short. This time I rebuilt it with a new nozzle, hex edge flush against the heated block and using a proper-length PTFE tube. I made sure everything was tight to prevent leaks, then put the whole hot end back together.

I then oiled the bearings and greased the z-axis threaded rods.

I also printed a new Diii Cooler, this time out of PETG, which has a higher temperature limit than PLA. Before doing that, I printed a temperature test tower, with increments of 5 degrees from 250-200C (+10 from markings on tower). These are great for finding the optimal print temperature for a filament.

Settings were 0.2mm layer height, 40 mm/s (20 mm/s outer wall), 0.1 extra prime, 0.15 coast, 0.4mm nozzle, 0.5mm line width, 1mm walls, 10% infill, 100% fan after 1st layer, 70C bed temp, etc. I forgot to use a brim, so the little side bridge part popped off, and I had to pause the print to tape it down. Ignore the lower 3 bridges because of that. I started to hear some skips at 225C, and I had to stop the print at 205C. The best combination of strength and quality seemed to be about 245C. I could probably print at 230-240C if I cared more about quality than strength. Using less fan would probably let me print at a lower temperature, too, but since the DiiiCooler has huge overhangs that can't be supported, I need the fan. I then printed an XYZ calibration cube using those settings, 245C, and a layer height of 0.25mm.

It came out great. The first layer was smashed too much, so I lowered the bed slightly. Small warping at the downward facing points of the X and Y, but that was almost always present with PLA. Good layer adhesion. Might be slightly overextruding because the X and Y were ~20.2mm instead of 20. It popped right off the bed, so I increased the bed temp to 75C. Interestingly, the PLA DiiiCooler has had no problem with the high extruder temps...no signs of warping or melting. The PID controller isn't as steady at the higher temps, ~-2/+1C instead of +/-1C. But I figured that was good enough. Then I printed the DiiiCooler:

Sporting some CA on the side

The extruder was skipping on the long path inner walls (40 mm/s), so I turned speed down to ~65% to stop the skipping. I guess flow rate is more limited with PETG than for PLA. The slower speed throws off the extra prime and coast settings, i.e. you need less of both. This caused the blobs and inconsistent extrusion near the right side of the fan opening, which is where I set the Z change location. I think the layer thickness was too high...only two layers on top and bottom, and it was clearly not air tight on the top (CA to the rescue). If I were to print it again, I'd use 0.2 mm layer height or lower and go slightly faster (~30 mm/s).

Summary of PETG-specific settings:

Same speed settings as PLA
Use PLA prime and coast values
Only use fan if printing overhangs/bridges, otherwise don't need it
Strong temp: 245C (maybe lower if 0 fan?)
Good quality temp: 235C
75C bed
Max flow: 40 mm/s @ 0.2 mm layer height, 0.5mm line width. For long continuous extrusions, that will probably cause skips, so turn down to ~30 mm/s. Needs about 25 mm/s max for 0.25 mm layer heights. When reducing speed, reduce prime and coast approximately proportionally.

I'm using PLA next, so I'll need to heat the extruder to ~225C, remove the PETG, insert the PLA, and extrude a bunch to get all of the PETG out so it doesn't clog.

One of the last things I need to do to this printer is replace the worn X-axis belt and install the belt tensioner. However, the belt is tight and still functional, so I'm going to wait on it to wear out more before doing that.

Cluster Software, Part 3

Environment Modules

Environment modules are very convenient ways to manage your environment, particularly when you have multiple conflicting packages, .e.g two versions of gcc or different MPI's. I'll be following that guide with a few modifications. Luckily, CentOS 7.5 has the "environment-modules" v3.2.10 package, so there's no need to compile from source. When that is installed with yum, the directory corresponding to the "/usr/local/Modules/default/" directory in that link is "/usr/share/Modules". It's automatically available to all users, so there's no need to link the sh init script.

I created a directory "mpi" under /usr/share/Modules, then copied the "modules" module file to it, renaming to "openmpi-3.1.0". I used it as a template to create the openmpi module file. I followed the above guide, as well as these guides: 1, 2. After saving, I commented out the openmpi-specific additions to my .bashrc files, rebooted, and checked to make sure they weren't in my path. Then I tried "module help mpi/openmpi-3.1.0" and "module whatis mpi/openmpi-3.1.0" just to make sure module sees the module. Then I loaded it "module load mpi/openmpi-3.1.0", and checked the PATH and LD_LIBRARY_PATH to make sure they were modified correctly. Then I unloaded it to make sure that the environment was reset.

The first link shows an example for GCC. If I ever compile a non-system one (in say, /opt) I can create a module file for it and load the module, which will pre-append to the PATH the location of gcc, so it will be the version used. Unloading the module will undo the environment changes.

I created another module file for OpenFOAM v1712 to replace the alias I've been using. This was quite a bit trickier. Module files use the TCL language, which does not allow for executing bash commands like "source", so I couldn't just source the openfoam bashrc. Luckily, I found this link, the last post in which had a great idea. Save your environment before and after running the openfoam alias, find the differences (with "diff"), pipe that into sed, and then clean up the result. That sed command worked pretty well. I pulled the PATH and LD_LIBRARY_PATH lines out and changed them to prepend-path commands and just pre-pended the differences. This is to prevent setenv from overwriting those environment variables. Note that diff just copies the lines that have differences, not the differences inside lines, so I had to manually remove what was in PATH and LD_LIBRARY_PATH in the pre-source-bashrc environment. I also had to clean up a couple other setenv lines, in particular any that had "=" signs in them got messed up by sed. I added a "prereq" for "mpi/openmpi-3.1.0", so that if I try to load this openfoam module before openmpi-3.1.0, it throws an error.

I repeated the above for the slave nodes, though I just copied over the module files. The openfoam one needed some modifications because ParaView is not installed on the slave nodes, but MESA and the VTK libraries are.

In summary, the above environment modules will allow me to have multiple versions of different software installed simultaneously without messing up my environment. These module files will also be extremely useful in a job scheduler.

Synchronize System Clocks

It's important to have a synchronized system clock for clusters. This can be done easily using NTP. On the headnode:

yum install ntp
systemctl enable ntpd
systemctl start ntpd

After a few minutes, "ntpq -pn" should return a list of IP address, one with a * next to it. This means it's working.

The configuration file is located at /etc/ntp.conf. I just left the default servers. I think you used to need to select servers on your continent manually, however now they have auto-selecting servers. There are some settings that do need to be modified. This command guide is a good resource, as is this page. For the headnode, comment out "restrict default nomodify notrap nopeer noquery" and add "restrict default ignore", which means, "by default, ignore everything". Then add exceptions. The localhost is already an exception, so you don't need to modify that. Add this line, "restrict 192.168.2.0 mask 255.255.255.0 nomodify notrap", which allows all machines on that subnet to query the ntp server.
Later update: I think I remember it working originally, but a few months later, I actually had to comment out the "restrict default ignore" and uncomment the original line or it wouldn't sync to any time servers. Not sure what I changed to make that not work anymore.

While it seemed to be working without doing the following, I believe the service does need to be allowed through the firewall.

firewall-cmd --zone=home --add-service=ntp --permanent
firewall-cmd --add-service=ntp --permanent
firewall-cmd --reload
firewall-cmd --zone=home --list-all
firewall-cmd --zone=public --list-all

"ntp" should now be in the list of allowed services.

There is a lot of conflicting information about what to do if you don't have internet access or if you lose it. The confusion stems from a recent update where the "undisciplined clock" was superseded by an "orphan mode". The "undisciplined clock" method is easier to implement and basically says "use the internal clock of this computer if no clock sources better than it (lower than stratum N) are available". Just add these lines to the conf file of the ntp server (headnode):

server 127.127.1.0
fudge 127.127.1.0 stratum 10

Orphan mode is a little more complicated. You have to define a mesh of peers and clients in the conf files on all machines. It's advantageous when you have multiple nodes that can act as a ntp server, e.g. multiple headnodes, but I don't think it really helps if it's a simple one headnode-multiple client cluster. If your ntp server is completely isolated from the internet or any "real" clocks, then you should lower the stratum of your undisciplined clock to 1, but really, you should look into the orphan mode, which can take advantage of multiple internal clocks to keep better time.

After installing NTP on the slaves nodes, in their /etc/ntp.conf, comment out "restrict default nomodify notrap nopeer noquery" and add "restrict default ignore". Comment out all of the servers, then add "server 192.168.2.1 iburst prefer", where the IP address is the IP address of the headnode. The "prefer" is necessary because NTP tries to reject any sources it thinks are not trustworthy, and since the headnode (if it has a poor or no internet connection) is not necessarily a good NTP server, then it rejects it; "prefer" prevents that. Don't need to worry about firewall settings because the firewall was turned off. Since the slave nodes are connected via LAN to the ntp server (the headnode), they should never need their internal clock, so you don't need to add the server/fudge lines.

Once done making changes to ntp.conf, save it, then restart the ntpd service, wait ~30 min, and check again with "ntpq -pn".

Useful guides: 1,2,3

My internet connection is very poor due to a combination of usb wifi adapter and poor drivers. Because NTP requires continuous stable polling of the time servers, it often fails for me, reverting to the internal clock. If it's failing for you, it could be that, or another common problem is ISPs blocking port 123 traffic.

Note: As of RHEL/CentOS 7, ntpd has been replaced with chronyd, which is supposedly faster/better. I didn't know about this when I started, so I might switch to it in the future. For now, I disabled the chronyd service so that ntpd will start on boot. Link. You must stop and disable chronyd, or it will start on reboot and prevent ntpd from starting.

SLURM

SLURM is a job scheduler aimed at HPC clusters. It makes scheduling and running jobs easy once it is setup. First, we need to install some prerequisites.

Good links: 1,2,3,4,5,6,7,8,9,10

Optional: Install MariaDB for logging and accounting. This was already installed on my headnode. Don't need this on the slave nodes. I probably won't utilize this since I'll be the only user.

Make sure all UID and GID's are the same for each user across all nodes. "id (username)". I setup a user "cluster" earlier, however we also must create a "munge" and "slurm" users. I don't entirely understand why, and I don't think you log into them, but they run the munge and slurm daemons. Do the following:

export MUNGEUSER=991
groupadd -g $MUNGEUSER munge
useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
export SLURMUSER=992
groupadd -g $SLURMUSER slurm
useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm

Use whatever uid/gid you need to (those might already be taken), but make sure they are consistent on all nodes, i.e. the slurm user has the same uid and gid on all nodes. Now MUNGE needs to be installed. It's in package form in the epel-release repository.

yum install epel-release
yum install munge munge-libs munge-devel -y

I check the permissions of directories and files listed in the munge installation guide. /etc/munge and /var/log/munge were 0700, /var/run/munge was 0755, but /var/lib/munge needed to be changed to 0711. Those also need to be owned by munge, not root. Do the install and these changes on all nodes.

Now you have to create a key for munge, and make sure munge owns it:

/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

The dd part is optional, but makes it more random. This key needs to be copied to all of the slave nodes now. SSH to all nodes and check all munge related directory and file permissions. Now start and enable the munge service on all nodes:

systemctl start munge
systemctl enable munge

Then run the tests in the installation guide:

munge -n
munge -n | unmunge
munge -n | ssh nodeXXX unmunge
remunge

If no errors, then munge is working correctly. Some more prerequistes for slurm:

yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel man2html libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker

Most of those were already on my headnode, but not on the slavenodes. Useful link for downloading slurm. Once slurm is downloaded, do the following:

export VER=17.11.7
rpmbuild -ta slurm-$VER.tar.bz2

Use whichever version of slurm you downloaded. If you built as root, the rpms will be located in /root/rpmbuild/RPMS/x86_64. The rpms can be copied to the slave nodes by placing them in a directory within the NFS shared directory. In my case, this was /home/cluster/slurm. Since I want all nodes to be compute nodes, I installed (yum install) all of the slurm rpms on all nodes except for slurm-slurmdbd and slurm-slurmctld, which were only installed on the headnode because they're for database and controller (respectively) functionality. Different versions of slurm have different rpms. For example, the previous slurm version will have a slurm-munge rpm. What each package is is not documented well. Here's a list of what was built on my system:

slurm-17.11.7-1.el7.x86_64.rpm
slurm-contribs-17.11.7-1.el7.x86_64.rpm
slurm-devel-17.11.7-1.el7.x86_64.rpm
slurm-example-configs-17.11.7-1.el7.x86_64.rpm
slurm-libpmi-17.11.7-1.el7.x86_64.rpm
slurm-openlava-17.11.7-1.el7.x86_64.rpm
slurm-pam_slurm-17.11.7-1.el7.x86_64.rpm
slurm-perlapi-17.11.7-1.el7.x86_64.rpm
slurm-slurmctld-17.11.7-1.el7.x86_64.rpm
slurm-slurmd-17.11.7-1.el7.x86_64.rpm
slurm-slurmdbd-17.11.7-1.el7.x86_64.rpm
slurm-torque-17.11.7-1.el7.x86_64.rpm

Go here to make a slurm configuration file. There's also a link to a more advanced one. Copy that to a file slurm.conf in /etc/slurm on the headnode.

From the Slurm OpenMPI page, "Starting with Open MPI version 3.1, PMIx version 2 is natively supported. To launch Open MPI application using PMIx version 2 the '--mpi=pmix_v2' option must be specified on the srun command line or 'DefaultMpi=pmi_v2' configured in slurm.conf." So I changed DefaultMpi to "pmi_v2" in the slurm.conf file. However, I couldn't get pmix working, so I changed it to "DefaultMpi=pmi2", which worked eventually. See the troubleshooting section below for more details. This guide and link show how to use and external pmix installation. I'm honestly not sure what the advantages/disadvantages are.

I also added a line to the slurm.conf file to specify the port ranges for srun (look for the srunportrange section). The number of ports that must be available is dependent on the number of srun's. Since I'm limited to 100 cores, N=200 should be double what I'll ever need, so I'll go with 13 open ports. "SrunPortRange=60001-60013". This ended up being ignored by the slave node's slurmd (probably a bug, see troubleshooting below), and so I ended up whitelisting a subnet in the firewall on the headnode, meaning that these port restrictions are kind of pointless.

I now know that it's better to name all nodes consistently, e.g. nodeXXX. My headnode is named "headnode", which means I have to use a list of comma separated names for "NodeNames" in the slurm.conf file, instead of short notation, e.g. node[001-005].

The proctracktype needs to be changed to pgid unless you want to setup cgroups. Setting up cgroups is recommended.

Slurm defaults to one job per node. This is fine for CFD; the jobs are usually so intensive that they use a whole node (or more than one whole nodes). But for smaller jobs, it's often advantageous to run more than one job per node. In order to subscribe more than one job per node, you have to change a few settings in the slurm.conf file. First, you need to set "SelectType=select/cons_res" and
"SelectTypeParameters=CR_Core", where CR_Core means cores are the resource being shared, but this could be something else like memory. You also must add in the partition definition "OverSubscribe=YES:X", where X is the number of jobs that can share a node. I set this to 2. Helpful links: 1,2,3. In those links, it suggests changing the schedule type, setting memory limits, etc. This is pretty much a requirement for large multi-user clusters, but for a small homelab cluster, these settings don't really matter because you will generally know how much memory your jobs use.

Once you're done editing that file, copy it to /etc/slurm on all of the headnodes.

Slurm uses various files for logging, saving states, etc. You have to set these and their permissions up manually. This link (near bottom) has a list of all of these files and required permissions. I had to do the following on the headnode, text in () are comments:

touch /var/run/slurmctld.pid
chown slurm: /var/run/slurmctld.pid
touch /var/run/slurmd.pid
mkdir /var/spool/slurmd (must be writable by root, default permissions are 644, which is read/write by root, others read)
mkdir /var/spool/slurm.state
chown slurm: /var/spool/slurm.state (must be writable by slurm)
mkdir /var/log/slurmctld
touch /var/log/slurmctld/slurmctld.log
chown -R slurm: /var/log/slurmctld (must be writable by slurm)
touch /var/log/slurmd.log (must be writable by root)

And the following on the slave nodes:

touch /var/run/slurmd.pid
mkdir /var/spool/slurmd (must be writable by root, default permissions are 644, which is read/write by root, others read)
touch /var/log/slurmd.log (must be writable by root)

You need additional slurm.conf settings and files for databases, etc.

"slurmd -C" should return information about the node you ran it on. If not, there are configuration errors.

Opening ports for Slurm is tricky. Originally, you had to have no firewall operating on all compute nodes because srun - task communication used random ports. Now they all you to specify the port range for srun in the slurm.conf file (see above), which means you can have a firewall operating, which is useful for when your headnode is a compute node. I opened the following tcp ports on the headnode: 6817 (slurmctld), 6818 (slurmd), 6819 (slurmdbd?), 60001-60013 (srun).

firewall-cmd --permanent --zone=home --add-port=6817-6819/tcp
firewall-cmd --permanent --zone=home --add-port=60001-60013/tcp
firewall-cmd --reload

At this point, I still couldn't use the firewall on the headnode (see below Troubleshooting). I had to white-list the whole private subnet: firewall-cmd --permanent --zone=home --add-rich-rule='rule family="ipv4" source address="192.168.2.0/24" accept'. Then reload the firewall. firewall-cmd --zone=home --list-rich-rules should now show that rule. This is not ideal, but it does work. I later closed all of the port holes because I plan on keeping it this way.

On all nodes, do "systemctl daemon-reload" (see long troubleshooting paragraph below). On the slave nodes: "systemctl start slurmd". On the headnode, start slurmd, slurmctld, and optionally slurmdbd if you have database stuff setup. Also enable those services if you want them to start at boot (recommended).

If you were messing with one of the slave nodes and took it offline while the headnode was still online, then the state of the node according to slurm will be "down". You can check this on headnode with "sinfo" and (as root) "scontrol" (enters scontrol menu) "show node nodeXXX". If it is "DOWN", then (as root) in scontrol, do "update NodeName=nodeXXX State=RESUME". Check the state of the node again: it should say "idle". If yes, then you're good to go.

If you modify the slurm.conf file, you can update the changes by 1. copying it to all slave nodes (scp), then 2. running "scontrol reconfigure" on the headnode. If this didn't seem to work, you'll have to restart slurmd on all nodes (and slurmctld on headnode).

You will have to uninstall and install openmpi now if you did not originally use the configure options: "--with-slurm --with-pmi=/usr".

Troubleshooting

The above makes it seem straightforward, but many days of troubleshooting went in to creating those instructions. Originally, slurm failed for me. slurmctld received a terminate process command for some unknown reason, and neither slurmd or slurmctld would stay active.

I spent about 10 hours trying to figure this out. I turned off selinux and the firewall on both the headnode and node002. I turned debug up to debug5 for both slurmctld and slurmd in the conf file. I also took the headnode out of the compute node list so it's just one controller node and one compute node (node002). I ruled out a network problem by running systemctl start slurmctld.service on the headnode and then systemctl start slurmd.service on the slave node. Then I used bash's builtin tcp capabilities to try to talk to the headnode from node002 on the slurmctld listening port and talk to node002 from headnode on the slurmd listening port, e.g. cat < /dev/tcp/192.168.2.2/6818 or something like that. This worked (connection is refused on non-listening ports), which meant that they can talk to each other. The log files indicated that, yes, slurmd and slurmctld were talking, but slurmctld was receiving a terminate command and shutting down. The slurmd log seemed to just stop, and systemctl said it failed to start, but there was a slurmd process still running and listening on the correct port (checked with netstat). This led me to thing that it had nothing to do with network and that systemctl might be killing slurmctld because it was taking too long to start. I added a "TimeoutSec=240" property to the /usr/lib/systemd/system/slurmctld.service file. I know you're supposed to copy the file to /etc/systemd/system/ and edit it there, or add a separate conf file there for it, but whatever. I then did "systemctl daemon-reload" and tried to see if the property was set with "systemctl show slurmctld.service -p TimeoutSec", but it came back blank, while other property settings that were there before were returned, so I figured that it didn't set. And maybe it didn't, so I rebooted both nodes (after modifying the slurmd.service file the same way and doing systemctl daemon-reload). Now both services started fine. "scontrol show nodes" showed node002's information. I stopped both, went back and deleted the TimeoutSec lines from their systemd files, reloaded the daemon again, and started both again. No problems. I guess executing systemctl daemon-reload made them work.

I then attempted to run a test with the mpi hello world script and srun. This failed because it said that OpenMPI was not compiled with PMI support. This is extremely confusing. I think there are three options. Either force OpenMPI to compile its internal PMIx v2.1 and then point Slurm to that directory during slurm rpm building, or for older version of openmpi, build slurm first, then point openmpi to Slurm's PMI, or build a separate external PMIx and point both slurm and openmpi at it. I did not explicitly tell openmpi to compile --with-pmix or --with-pmi (or pmix2 or pmi2), so it may not have done either, though I did use the "--with-slurm" option, though there is no documentation saying what that does. I then tried creating a simple sbatch script which includes a module load line for openmpi in case srun doesn't propagate the environment, but I got the same error. The script works fine if I call mpirun directly. Thus, OpenMPI was not built with PMI support, or Slurm was not built with pmix support, or they were built with different PMI(x) versions, so none of the above options work. Quote from error:

The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using:

version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.

Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.

Please configure as appropriate and try again.

These links might also be helpful. 1,2

So that's that. Looking at the output of ompi_info, it looks like openmpi was built with pmix2 and pmi, so first, I'm going to try rebuilding slurm with the pmix option pointed at OpenMPI's pmix directory, and not install the slurm libpmi rpm (which I'm 75% certain contains pmi and pmi2). If that doesn't work, I'll try recompiling openmpi with the pmix internal configuration option explicitly stated. If that doesn't work, I might try pointing openmpi at slurm's pmi2 instead. And if that doesn't work, stick with calling mpirun directly.

Updates:

I tried rebuilding slurm with pmix support and pointing it at the openmpi internal pmix, but the slurm build log kept saying it couldn't find the pmix installation. I used the "--define '_with_pmix --with-pmix=/opt/openmpi-3.1.0'" rpmbuild option (see here) with various subfolders of openmpi-3.1.0, but none of it worked.

The srun --mpi=list command shows pmi2, none, and openmpi. pmi2 doesn't work because I didn't build openmpi pointing at slurm's pmi2 install (see above error message). The openmpi option doesn't work and is not documented anywhere I can find. Anyways, on to option 2: using pmi2 instead of pmix.

I installed slurm first (see above) with no special options. I then uninstalled openmpi (see FAQ 6 here), and reconfigured with the following options: "--prefix=/opt/openmpi-3.1.0 --with-verbs --with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr/lib64". Note that if you use the official instructions found pretty much everywhere (example), they say to do "--with-pmi-libdir=/usr/", which doesn't work. Looking at the output of configure closely, it says it couldn't find EITHER pmi/pmi2.h OR libpmi/libpmi2. If you look slightly above that error, you'll see it finds the headers fine, but it can't find libpmi/libpmi2, which exist in the /usr/lib64 directory. Unfortunately, trying to install openmpi with that configuration throws an error about not being able to find a pmi.h file in a pmix directory. The people in this thread found another work around. I tried configuring with "--prefix=/opt/openmpi-3.1.0 --with-verbs --with-slurm --with-pmi=/usr", and it found the files again. This time installation seemed to work. I then reinstalled openmpi on the slave node the same way (don't forget to install the slurm-devel package first...). I made sure the required slurm files were present and cleaned out, reloaded the systemctl daemons, then started slurmd on the slave node and slurmctld on the headnode. I then tried the sbatch mpirun script again to make sure openmpi is working, then with srun again, this time with the --mpi=pmi2 option. This worked! YES! I updated the openmpi build instructions in the previous post.

I added the defaultmpi=pmi2 option to the slurm.conf file so I wouldn't have to call the --mpi=pmi2 option every time I used srun. I stopped the slurm services on both nodes, copied the slurm.conf to the slave node, restarted the slurm services, and ran the sbatch test script again to make sure it worked.

Now that it's working on a slave node, time to add the headnode as a compute node. I stopped the slurm services on both nodes, modified the slurm.conf file to specify the headnode as a compute node (similar to the slave node(s) setup), copied the slurm.conf to the slave node, and restarted the slurm servies. I also started slurmd on the headnode. I reran the test sbatch srun script, this time on both the headnode and slave node. This worked.

Now that all that's working, time to add the firewall on the headnode back in. I uncommented the line in slurm.conf that specifies srun's tcp ports. I made sure that these were in the list of open ports of the firewall, along with the other slurm communication ports. I first stopped slurmd and slurmctld, then started firewalld, then slurmctld and slurmd. Then reran the sbatch srun script. This did not work. Checking slurmd.log on the slave node showed that srun was trying to communicate on random ports to the headnode despite "SrunPortRange=60001-60013" in the slurm.conf file. It looks like the srunportrange parameter is not being honored on the slave node, though it is on the headnode according to the slurmctld and slurmd logs. It's not clear why. The sbatch script works fine if it's just launching on the headnode. Rebooting the node didn't help. Tried clearing out the state files, also didn't work. I may file a bug report for this. While not ideal, I managed to get around this by white-listing the whole private subnet: firewall-cmd --permanent --zone=home --add-rich-rule='rule family="ipv4" source address="192.168.2.0/24" accept'. Then reload the firewall. firewall-cmd --zone=home --list-rich-rules should now show that rule. This worked. If I end up keeping it this way, I will close all of the port holes. I did a test with sbatch and openfoam, and it worked. Yay.

Tuesday, June 5, 2018

Wireless usb CentOS Fail

I have an Edimax usb wifi (EW-7811Un 802.11n Wireless Adapter [Realtek RTL8188CUS], uses the rtl8192cu driver) adapter for my headnode (because the router is too far from it to run a network cable easily). It works most of the time, but will randomly cut out. I have to either bring the interface down and up again, or unplug and plug it back in every time it does that, which is annoying. I wrote a script that monitors the wifi with ping and brings the interface down and up whenever it stops working, but it's an ugly fix. It sounds like a power saving issue, but the network manager power save feature was already off: iw dev wlan0 get power_save. If yours isn't already off, you can set it off with that command: iw dev wlan0 set power_save off, but it won't be persistent across reboots. I went ahead and followed these instructions anyways. I created a wifi-powersave-off.conf file under /etc/NetworkManager/conf.d with the following in it:

# File to be place under /etc/NetworkManager/conf.d [connection]
# Values are 0 (use default), 1 (ignore/don't touch), 2 (disable) or 3 (enable).
wifi.powersave = 2

This didn't seem to do anything, as expected. The next thing I tried was a kernel setting. These particular usb wifi adapters were pretty popular for the raspberry pi a few years ago, and they had the exact problem I've been having. I followed these instructions (link) to create a 8192cu.conf file in /etc/modprobe.d/ with the following lines:

# prevent power down of wireless when idle
options 8192cu rtw_power_mgnt=0 rtw_enusbss=0

I then unloaded and loaded the driver using rmmod rtl8192cu and modprobe rtl8192cu. Besides that, I and google seem to be completely out of ideas. Update (a few months later): this usb adapter finally died on me. May have gotten a bad one, which could account for the above problems.

I purchased a TP-Link TL-WN722N wifi usb adapter because a lot of people said it works well with Linux. It works without doing anything in Ubuntu 18.04, but it does not work well with CentOS. lsusb sees a device, but there aren't any compatible drivers installed. This particular one has a "V2" next to the FCC ID, but a "V3" in the serial number box. I tried downloading both V3 and the V2 linux drivers from TP-Link and compiling them, but they won't compile due to what looks like various syntax or coding errors. CentOS7.5's kernel version is within the supported range, so that's not the problem. I tried a bunch of fixes from google, including some sort of kernel driver firmware htc thing and a elrepo kmod package, but neither worked because I think they're only for the V1 version. I ended up giving up on it.

There are very few USB wifi adapters that will actually work with CentOS. Aside from the partially working Edimax one and thinkpenguin ones, I haven't found any others.

I also tried a cheapo usb wifi adapter based on the MT7601u. There is a linux driver, but again, I couldn't get it to compile on CentOS7.5. I'm probably doing something wrong, though. I don't understand toolchains and linux source and all of that stuff, though I do have gcc and the kernel source/headers installed.

I also tried a TP-Link TL-WN823N v3. Many of the reviews on amazon said it would work with various linux distros. No luck. lsusb sees a device, but no pre-installed drivers. I downloaded the drivers from jan 2018 for linux, and managed to compile and install them after following the instructions and editing the makefile. However, centos thought it was a usb-ethernet, not usb-wifi device. No matter what I tried to do with network scripts, I couldn't get it to come up as a wifi adapter, so I couldn't select the network and enter the password, so it didn't work. UGH

I caved and purchased a Think Penguin Wireless-N usb wifi adapter. They're about $27 on US eBay. It just works with CentOS 7. However, I had to use nmtui to delete the old wifi connections before plugging it in, or no wifi adapters would work. Probably some sort of conflict from trying to connect to the same wifi network with different hardware. The only downside is that it doesn't work with windows, so if I boot my windows drive, I have to switch USB adapters. Can't win 'em all I guess....

Rocket Science

Search This Blog