Search This Blog

Wednesday, June 6, 2018

Cluster Software, Part 3

 Environment Modules

Environment modules are very convenient ways to manage your environment, particularly when you have multiple conflicting packages, .e.g two versions of gcc or different MPI's. I'll be following that guide with a few modifications. Luckily, CentOS 7.5 has the "environment-modules" v3.2.10 package, so there's no need to compile from source. When that is installed with yum, the directory corresponding to the "/usr/local/Modules/default/" directory in that link is "/usr/share/Modules". It's automatically available to all users, so there's no need to link the sh init script.

I created a directory "mpi" under /usr/share/Modules, then copied the "modules" module file to it, renaming to "openmpi-3.1.0". I used it as a template to create the openmpi module file. I followed the above guide, as well as these guides: 1, 2. After saving, I commented out the openmpi-specific additions to my .bashrc files, rebooted, and checked to make sure they weren't in my path. Then I tried "module help mpi/openmpi-3.1.0" and "module whatis mpi/openmpi-3.1.0" just to make sure module sees the module. Then I loaded it "module load mpi/openmpi-3.1.0", and checked the PATH and LD_LIBRARY_PATH to make sure they were modified correctly. Then I unloaded it to make sure that the environment was reset. 

The first link shows an example for GCC. If I ever compile a non-system one (in say, /opt) I can create a module file for it and load the module, which will pre-append to the PATH the location of gcc, so it will be the version used. Unloading the module will undo the environment changes. 

I created another module file for OpenFOAM v1712 to replace the alias I've been using. This was quite a bit trickier. Module files use the TCL language, which does not allow for executing bash commands like "source", so I couldn't just source the openfoam bashrc. Luckily, I found this link, the last post in which had a great idea. Save your environment before and after running the openfoam alias, find the differences (with "diff"), pipe that into sed, and then clean up the result. That sed command worked pretty well. I pulled the PATH and LD_LIBRARY_PATH lines out and changed them to prepend-path commands and just pre-pended the differences. This is to prevent setenv from overwriting those environment variables. Note that diff just copies the lines that have differences, not the differences inside lines, so I had to manually remove what was in PATH and LD_LIBRARY_PATH in the pre-source-bashrc environment. I also had to clean up a couple other setenv lines, in particular any that had "=" signs in them got messed up by sed. I added a "prereq" for "mpi/openmpi-3.1.0", so that if I try to load this openfoam module before openmpi-3.1.0, it throws an error. 

I repeated the above for the slave nodes, though I just copied over the module files. The openfoam one needed some modifications because ParaView is not installed on the slave nodes, but MESA and the VTK libraries are.

In summary, the above environment modules will allow me to have multiple versions of different software installed simultaneously without messing up my environment. These module files will also be extremely useful in a job scheduler.

Synchronize System Clocks

It's important to have a synchronized system clock for clusters. This can be done easily using NTP. On the headnode:
  1. yum install ntp
  2. systemctl enable ntpd
  3. systemctl start ntpd
After a few minutes, "ntpq -pn" should return a list of IP address, one with a * next to it. This means it's working.

The configuration file is located at /etc/ntp.conf. I just left the default servers. I think you used to need to select servers on your continent manually, however now they have auto-selecting servers. There are some settings that do need to be modified. This command guide is a good resource, as is this page. For the headnode, comment out "restrict default nomodify notrap nopeer noquery" and add "restrict default ignore", which means, "by default, ignore everything". Then add exceptions. The localhost is already an exception, so you don't need to modify that. Add this line, "restrict 192.168.2.0 mask 255.255.255.0 nomodify notrap", which allows all machines on that subnet to query the ntp server.
 Later update: I think I remember it working originally, but a few months later, I actually had to comment out the "restrict default ignore" and uncomment the original line or it wouldn't sync to any time servers. Not sure what I changed to make that not work anymore.

While it seemed to be working without doing the following, I believe the service does need to be allowed through the firewall.
  1. firewall-cmd --zone=home --add-service=ntp --permanent
  2. firewall-cmd --add-service=ntp --permanent
  3. firewall-cmd --reload
  4. firewall-cmd --zone=home --list-all
  5. firewall-cmd --zone=public --list-all
"ntp" should now be in the list of allowed services.

There is a lot of conflicting information about what to do if you don't have internet access or if you lose it. The confusion stems from a recent update where the "undisciplined clock" was superseded by an "orphan mode". The "undisciplined clock" method is easier to implement and basically says "use the internal clock of this computer if no clock sources better than it (lower than stratum N) are available". Just add these lines to the conf file of the ntp server (headnode):
  • server 127.127.1.0
  • fudge 127.127.1.0 stratum 10
Orphan mode is a little more complicated. You have to define a mesh of peers and clients in the conf files on all machines. It's advantageous when you have multiple nodes that can act as a ntp server, e.g. multiple headnodes, but I don't think it really helps if it's a simple one headnode-multiple client cluster. If your ntp server is completely isolated from the internet or any "real" clocks, then you should lower the stratum of your undisciplined clock to 1, but really, you should look into the orphan mode, which can take advantage of multiple internal clocks to keep better time.

After installing NTP on the slaves nodes, in their /etc/ntp.conf, comment out "restrict default nomodify notrap nopeer noquery" and add "restrict default ignore". Comment out all of the servers, then add "server 192.168.2.1 iburst prefer", where the IP address is the IP address of the headnode. The "prefer" is necessary because NTP tries to reject any sources it thinks are not trustworthy, and since the headnode (if it has a poor or no internet connection) is not necessarily a good NTP server, then it rejects it; "prefer" prevents that. Don't need to worry about firewall settings because the firewall was turned off. Since the slave nodes are connected via LAN to the ntp server (the headnode), they should never need their internal clock, so you don't need to add the server/fudge lines.

Once done making changes to ntp.conf, save it, then restart the ntpd service, wait ~30 min, and check again with "ntpq -pn". 

Useful guides: 1,2,3

My internet connection is very poor due to a combination of usb wifi adapter and poor drivers. Because NTP requires continuous stable polling of the time servers, it often fails for me, reverting to the internal clock. If it's failing for you, it could be that, or another common problem is ISPs blocking port 123 traffic.

Note: As of RHEL/CentOS 7, ntpd has been replaced with chronyd, which is supposedly faster/better. I didn't know about this when I started, so I might switch to it in the future. For now, I disabled the chronyd service so that ntpd will start on boot. Link. You must stop and disable chronyd, or it will start on reboot and prevent ntpd from starting.

SLURM

SLURM is a job scheduler aimed at HPC clusters. It makes scheduling and running jobs easy once it is setup. First, we need to install some prerequisites.

Good links: 1,2,3,4,5,6,7,8,9,10

Optional: Install MariaDB for logging and accounting. This was already installed on my headnode. Don't need this on the slave nodes. I probably won't utilize this since I'll be the only user.

Make sure all UID and GID's are the same for each user across all nodes. "id (username)". I setup a user "cluster" earlier, however we also must create a "munge" and "slurm" users. I don't entirely understand why, and I don't think you log into them, but they run the munge and slurm daemons. Do the following:
  • export MUNGEUSER=991
  • groupadd -g $MUNGEUSER munge
  • useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
  • export SLURMUSER=992
  • groupadd -g $SLURMUSER slurm
  • useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
Use whatever uid/gid you need to (those might already be taken), but make sure they are consistent on all nodes, i.e. the slurm user has the same uid and gid on all nodes. Now MUNGE needs to be installed. It's in package form in the epel-release repository. 
  • yum install epel-release
  • yum install munge munge-libs munge-devel -y
I check the permissions of directories and files listed in the munge installation guide. /etc/munge and /var/log/munge were 0700, /var/run/munge was 0755, but /var/lib/munge needed to be changed to 0711. Those also need to be owned by munge, not root. Do the install and these changes on all nodes.

Now you have to create a key for munge, and make sure munge owns it:
  • /usr/sbin/create-munge-key -r
  • dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
  • chown munge: /etc/munge/munge.key
  • chmod 400 /etc/munge/munge.key
The dd part is optional, but makes it more random. This key needs to be copied to all of the slave nodes now. SSH to all nodes and check all munge related directory and file permissions. Now start and enable the munge service on all nodes:
  • systemctl start munge
  • systemctl enable munge
Then run the tests in the installation guide:
  • munge -n
  • munge -n | unmunge
  • munge -n | ssh nodeXXX unmunge
  • remunge
If no errors, then munge is working correctly. Some more prerequistes for slurm:
  • yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel man2html libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker
Most of those were already on my headnode, but not on the slavenodes. Useful link for downloading slurm. Once slurm is downloaded, do the following:
  • export VER=17.11.7
  • rpmbuild -ta slurm-$VER.tar.bz2
Use whichever version of slurm you downloaded. If you built as root, the rpms will be located in /root/rpmbuild/RPMS/x86_64. The rpms can be copied to the slave nodes by placing them in a directory within the NFS shared directory. In my case, this was /home/cluster/slurm. Since I want all nodes to be compute nodes, I installed (yum install) all of the slurm rpms on all nodes except for slurm-slurmdbd and slurm-slurmctld, which were only installed on the headnode because they're for database and controller (respectively) functionality. Different versions of slurm have different rpms. For example, the previous slurm version will have a slurm-munge rpm. What each package is is not documented well. Here's a list of what was built on my system:
  • slurm-17.11.7-1.el7.x86_64.rpm
  • slurm-contribs-17.11.7-1.el7.x86_64.rpm
  • slurm-devel-17.11.7-1.el7.x86_64.rpm
  • slurm-example-configs-17.11.7-1.el7.x86_64.rpm
  • slurm-libpmi-17.11.7-1.el7.x86_64.rpm
  • slurm-openlava-17.11.7-1.el7.x86_64.rpm
  • slurm-pam_slurm-17.11.7-1.el7.x86_64.rpm
  • slurm-perlapi-17.11.7-1.el7.x86_64.rpm
  • slurm-slurmctld-17.11.7-1.el7.x86_64.rpm
  • slurm-slurmd-17.11.7-1.el7.x86_64.rpm
  • slurm-slurmdbd-17.11.7-1.el7.x86_64.rpm
  • slurm-torque-17.11.7-1.el7.x86_64.rpm
Go here to make a slurm configuration file. There's also a link to a more advanced one. Copy that to a file slurm.conf in /etc/slurm on the headnode.

From the Slurm OpenMPI page, "Starting with Open MPI version 3.1, PMIx version 2 is natively supported. To launch Open MPI application using PMIx version 2 the '--mpi=pmix_v2' option must be specified on the srun command line or 'DefaultMpi=pmi_v2' configured in slurm.conf." So I changed DefaultMpi to "pmi_v2" in the slurm.conf file. However, I couldn't get pmix working, so I changed it to "DefaultMpi=pmi2", which worked eventually. See the troubleshooting section below for more details. This guide and link show how to use and external pmix installation. I'm honestly not sure what the advantages/disadvantages are.

I also added a line to the slurm.conf file to specify the port ranges for srun (look for the srunportrange section). The number of ports that must be available is dependent on the number of srun's. Since I'm limited to 100 cores, N=200 should be double what I'll ever need, so I'll go with 13 open ports. "SrunPortRange=60001-60013". This ended up being ignored by the slave node's slurmd (probably a bug, see troubleshooting below), and so I ended up whitelisting a subnet in the firewall on the headnode, meaning that these port restrictions are kind of pointless.

I now know that it's better to name all nodes consistently, e.g. nodeXXX. My headnode is named "headnode", which means I have to use a list of comma separated names for "NodeNames" in the slurm.conf file, instead of short notation, e.g. node[001-005].

The proctracktype needs to be changed to pgid unless you want to setup cgroups. Setting up cgroups is recommended.

Slurm defaults to one job per node. This is fine for CFD; the jobs are usually so intensive that they use a whole node (or more than one whole nodes). But for smaller jobs, it's often advantageous to run more than one job per node. In order to subscribe more than one job per node, you have to change a few settings in the slurm.conf file. First, you need to set "SelectType=select/cons_res" and
"SelectTypeParameters=CR_Core", where CR_Core means cores are the resource being shared, but this could be something else like memory. You also must add in the partition definition "OverSubscribe=YES:X", where X is the number of jobs that can share a node. I set this to 2. Helpful links: 1,2,3. In those links, it suggests changing the schedule type, setting memory limits, etc. This is pretty much a requirement for large multi-user clusters, but for a small homelab cluster, these settings don't really matter because you will generally know how much memory your jobs use.

Once you're done editing that file, copy it to /etc/slurm on all of the headnodes. 

Slurm uses various files for logging, saving states, etc. You have to set these and their permissions up manually. This link (near bottom) has a list of all of these files and required permissions. I had to do the following on the headnode, text in () are comments:
touch /var/run/slurmctld.pid
chown slurm: /var/run/slurmctld.pid
touch /var/run/slurmd.pid
mkdir /var/spool/slurmd (must be writable by root, default permissions are 644, which is read/write by root, others read)
mkdir /var/spool/slurm.state
chown slurm: /var/spool/slurm.state (must be writable by slurm)
mkdir /var/log/slurmctld
touch /var/log/slurmctld/slurmctld.log
chown -R slurm: /var/log/slurmctld (must be writable by slurm)
touch /var/log/slurmd.log (must be writable by root)

And the following on the slave nodes:
touch /var/run/slurmd.pid
mkdir /var/spool/slurmd (must be writable by root, default permissions are 644, which is read/write by root, others read)
touch /var/log/slurmd.log (must be writable by root)
You need additional slurm.conf settings and files for databases, etc.

"slurmd -C" should return information about the node you ran it on. If not, there are configuration errors.

Opening ports for Slurm is tricky. Originally, you had to have no firewall operating on all compute nodes because srun - task communication used random ports. Now they all you to specify the port range for srun in the slurm.conf file (see above), which means you can have a firewall operating, which is useful for when your headnode is a compute node. I opened the following tcp ports on the headnode: 6817 (slurmctld), 6818 (slurmd), 6819 (slurmdbd?), 60001-60013 (srun).
  • firewall-cmd --permanent --zone=home --add-port=6817-6819/tcp
  • firewall-cmd --permanent --zone=home --add-port=60001-60013/tcp
  • firewall-cmd --reload
At this point, I still couldn't use the firewall on the headnode (see below Troubleshooting).  I had to white-list the whole private subnet: firewall-cmd --permanent --zone=home --add-rich-rule='rule family="ipv4" source address="192.168.2.0/24" accept'. Then reload the firewall. firewall-cmd --zone=home --list-rich-rules should now show that rule. This is not ideal, but it does work. I later closed all of the port holes because I plan on keeping it this way.

On all nodes, do "systemctl daemon-reload" (see long troubleshooting paragraph below). On the slave nodes: "systemctl start slurmd". On the headnode, start slurmd, slurmctld, and optionally slurmdbd if you have database stuff setup. Also enable those services if you want them to start at boot (recommended).

If you were messing with one of the slave nodes and took it offline while the headnode was still online, then the state of the node according to slurm will be "down". You can check this on headnode with "sinfo" and (as root) "scontrol" (enters scontrol menu) "show node nodeXXX". If it is "DOWN", then (as root) in scontrol, do "update NodeName=nodeXXX State=RESUME". Check the state of the node again: it should say "idle". If yes, then you're good to go.

If you modify the slurm.conf file, you can update the changes by 1. copying it to all slave nodes (scp), then 2. running "scontrol reconfigure" on the headnode. If this didn't seem to work, you'll have to restart slurmd on all nodes (and slurmctld on headnode). 

You will have to uninstall and install openmpi now if you did not originally use the configure options: "--with-slurm --with-pmi=/usr".


Troubleshooting

The above makes it seem straightforward, but many days of troubleshooting went in to creating those instructions. Originally, slurm failed for me. slurmctld received a terminate process command for some unknown reason, and neither slurmd or slurmctld would stay active.

I spent about 10 hours trying to figure this out. I turned off selinux and the firewall on both the headnode and node002. I turned debug up to debug5 for both slurmctld and slurmd in the conf file. I also took the headnode out of the compute node list so it's just one controller node and one compute node (node002). I ruled out a network problem by running systemctl start slurmctld.service on the headnode and then systemctl start slurmd.service on the slave node. Then I used bash's builtin tcp capabilities to try to talk to the headnode from node002 on the slurmctld listening port and talk to node002 from headnode on the slurmd listening port, e.g. cat < /dev/tcp/192.168.2.2/6818 or something like that. This worked (connection is refused on non-listening ports), which meant that they can talk to each other. The log files indicated that, yes, slurmd and slurmctld were talking, but slurmctld was receiving a terminate command and shutting down. The slurmd log seemed to just stop, and systemctl said it failed to start, but there was a slurmd process still running and listening on the correct port (checked with netstat). This led me to thing that it had nothing to do with network and that systemctl might be killing slurmctld because it was taking too long to start. I added a "TimeoutSec=240" property to the /usr/lib/systemd/system/slurmctld.service file. I know you're supposed to copy the file to /etc/systemd/system/ and edit it there, or add a separate conf file there for it, but whatever. I then did "systemctl daemon-reload" and tried to see if the property was set with "systemctl show slurmctld.service -p TimeoutSec", but it came back blank, while other property settings that were there before were returned, so I figured that it didn't set. And maybe it didn't, so I rebooted both nodes (after modifying the slurmd.service file the same way and doing systemctl daemon-reload). Now both services started fine. "scontrol show nodes" showed node002's information. I stopped both, went back and deleted the TimeoutSec lines from their systemd files, reloaded the daemon again, and started both again. No problems. I guess executing systemctl daemon-reload made them work.

I then attempted to run a test with the mpi hello world script and srun. This failed because it said that OpenMPI was not compiled with PMI support. This is extremely confusing. I think there are three options. Either force OpenMPI to compile its internal PMIx v2.1 and then point Slurm to that directory during slurm rpm building, or for older version of openmpi, build slurm first, then point openmpi to Slurm's PMI, or build a separate external PMIx and point both slurm and openmpi at it. I did not explicitly tell openmpi to compile --with-pmix or --with-pmi (or pmix2 or pmi2), so it may not have done either, though I did use the "--with-slurm" option, though there is no documentation saying what that does. I then tried creating a simple sbatch script which includes a module load line for openmpi in case srun doesn't propagate the environment, but I got the same error. The script works fine if I call mpirun directly. Thus, OpenMPI was not built with PMI support, or Slurm was not built with pmix support, or they were built with different PMI(x) versions, so none of the above options work. Quote from error:
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using:

version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.

Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.

Please configure as appropriate and try again.
These links might also be helpful. 1,2

So that's that. Looking at the output of ompi_info, it looks like openmpi was built with pmix2 and pmi, so first, I'm going to try rebuilding slurm with the pmix option pointed at OpenMPI's pmix directory, and not install the slurm libpmi rpm (which I'm 75% certain contains pmi and pmi2). If that doesn't work, I'll try recompiling openmpi with the pmix internal configuration option explicitly stated. If that doesn't work, I might try pointing openmpi at slurm's pmi2 instead. And if that doesn't work, stick with calling mpirun directly.

Updates:

I tried rebuilding slurm with pmix support and pointing it at the openmpi internal pmix, but the slurm build log kept saying it couldn't find the pmix installation. I used the "--define '_with_pmix --with-pmix=/opt/openmpi-3.1.0'" rpmbuild option (see here) with various subfolders of openmpi-3.1.0, but none of it worked.

The srun --mpi=list command shows pmi2, none, and openmpi. pmi2 doesn't work because I didn't build openmpi pointing at slurm's pmi2 install (see above error message). The openmpi option doesn't work and is not documented anywhere I can find. Anyways, on to option 2: using pmi2 instead of pmix.

I installed slurm first (see above) with no special options. I then uninstalled openmpi (see FAQ 6 here), and reconfigured with the following options: "--prefix=/opt/openmpi-3.1.0 --with-verbs --with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr/lib64". Note that if you use the official instructions found pretty much everywhere (example), they say to do "--with-pmi-libdir=/usr/", which doesn't work. Looking at the output of configure closely, it says it couldn't find EITHER pmi/pmi2.h OR libpmi/libpmi2. If you look slightly above that error, you'll see it finds the headers fine, but it can't find libpmi/libpmi2, which exist in the /usr/lib64 directory. Unfortunately, trying to install openmpi with that configuration throws an error about not being able to find a pmi.h file in a pmix directory. The people in this thread found another work around. I tried configuring with  "--prefix=/opt/openmpi-3.1.0 --with-verbs --with-slurm --with-pmi=/usr", and it found the files again. This time installation seemed to work. I then reinstalled openmpi on the slave node the same way (don't forget to install the slurm-devel package first...). I made sure the required slurm files were present and cleaned out, reloaded the systemctl daemons, then started slurmd on the slave node and slurmctld on the headnode. I then tried the sbatch mpirun script again to make sure openmpi is working, then with srun again, this time with the --mpi=pmi2 option. This worked! YES! I updated the openmpi build instructions in the previous post.

I added the defaultmpi=pmi2 option to the slurm.conf file so I wouldn't have to call the --mpi=pmi2 option every time I used srun. I stopped the slurm services on both nodes, copied the slurm.conf to the slave node, restarted the slurm services, and ran the sbatch test script again to make sure it worked.

Now that it's working on a slave node, time to add the headnode as a compute node. I stopped the slurm services on both nodes, modified the slurm.conf file to specify the headnode as a compute node (similar to the slave node(s) setup), copied the slurm.conf to the slave node, and restarted the slurm servies. I also started slurmd on the headnode. I reran the test sbatch srun script, this time on both the headnode and slave node. This worked.

Now that all that's working, time to add the firewall on the headnode back in. I uncommented the line in slurm.conf that specifies srun's tcp ports. I made sure that these were in the list of open ports of the firewall, along with the other slurm communication ports. I first stopped slurmd and slurmctld, then started firewalld, then slurmctld and slurmd. Then reran the sbatch srun script. This did not work. Checking slurmd.log on the slave node showed that srun was trying to communicate on random ports to the headnode despite "SrunPortRange=60001-60013" in the slurm.conf file. It looks like the srunportrange parameter is not being honored on the slave node, though it is on the headnode according to the slurmctld and slurmd logs. It's not clear why. The sbatch script works fine if it's just launching on the headnode. Rebooting the node didn't help. Tried clearing out the state files, also didn't work. I may file a bug report for this. While not ideal, I managed to get around this by white-listing the whole private subnet: firewall-cmd --permanent --zone=home --add-rich-rule='rule family="ipv4" source address="192.168.2.0/24" accept'. Then reload the firewall. firewall-cmd --zone=home --list-rich-rules should now show that rule. This worked. If I end up keeping it this way, I will close all of the port holes. I did a test with sbatch and openfoam, and it worked. Yay.


1 comment:

  1. Hi Jed, your Troubleshooting solved my one week's issue and saved lots of time. Much appreciate for your information!!!

    ReplyDelete