Search This Blog

Monday, May 28, 2018

Clustering the Cluster and Software Part 2

This post will cover setting up the cluster networks and getting it working with OpenMPI.

First though, I have a major hardware change for the headnode. I managed to snag an Asus Z10PE-D8 with dual E5 v4 support (manufactured after Oct 2015) for a great price. It also has a PCIe 3.0 x4 M.2 slot for NVMe drives and 7 x16 slots, which are major improvements. Since I already benchmarked the one E5-2667v3 QS I have, I installed the two E5-4627v3 QS processors in the new motherboard straightaway. Oddly, the latest BIOS (3703) for this motherboard made it unstable, so I downgraded to the previous one. Not sure why. Maybe the "increase performance" entry in the release notes is some sort of default overclock that causes my system to be unstable? idk.

First, I was able to boot the new workstation from the old SSD. However, attempts to clone the the SSD to the NVMe failed (clonezilla). Thus, I had to install CentOS from scratch again. I then booted to the newly install OS, mounted the old SSD, and copied over any directories I modified, e.g. /opt, /usr/local, etc. Then I unmounted the old SSD, shutdown, and removed it. Everything worked great.

Now that I have a operational headnode, there are some things that need to be done to make it work in a cluster.

Headnode Cluster Setup

RAID

For my headnode, the OS is installed on an NVMe, but that's not big enough to store many CFD cases. For long term storage, I had two 3TB HDD's that I put in RAID1. The desktop did not have an onboard RAID controller, so I had to use mdadm. The Z10PE-D8 has a hardware RAID controller, but it only works with Windows according to the manual. It also has a software "LSI MegaRAID" controller. To use this, connect at least two HDDs to SATA ports. Make sure you are either in the SATA or sSATA ports, not in both. If you need more drives than there are SATA (or sSATA) ports, then you can use mdadm to create a software RAID that spans the multiple SATA controllers. Anyways, since I'm just using two in RAID1, I put them both in SATA ports. You then go into the BIOS, enable RAID in the PCH settings menu, then save changes and reset. The motherboard manual is incorrect after this point, though. There is no option to press "cntl+m" to start the LSI Megaraid utility. Instead, it seems to have been moved inside the BIOS. Boot into the BIOS again, this time look for the "LSI MegaRAID ..." in the Advanced menu. Go into that. The menus and explanations are fairly obvious. Select the RAID level, select the drives you want to put in RAID, name it, then create the RAID. Save changes and reset. Boot into the OS. "lsblk" should show the drives with the same virtual volume. In my case, I had sda1/md127 and sdb1/md127. So I created a directory "data" at root and mounted /dev/md127 to /data. Running "df -Th" shows /dev/md127 with a size of 2.7TB and filesystem ext4 mounted at /data. Success! To make the mount persistent, you must edit fstab so it is persistent across reboots:
  1. vi /etc/fstab 
  2. add line: /dev/md127 /data ext4 defaults 0 0
Also need to Need to rebuild initial ramdisk image (initramfs). Create backup first, then rebuild. 
  1. link 1 
  2. link 2
  3. cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
  4. ll /boot/initramfs-$(uname -r).img*
  5. dracut -f
Try rebooting. If the /data folder is still present, then it worked. It's probably also a good idea to chown the /data folder for the cluster user to make copying stuff to it easier.

Hostfile

Edit the /etc/hosts file to look like this:
127.0.0.1 localhost .... (etc)
::1....
192.168.2.1 headnode
192.168.2.2 node002
192.168.2.3 node003
192.168.2.4 node004
192.168.2.5 node005
 
In addition the above file, which is for network names, you need an mpi hostfile located somewhere easy to access, like the home directory. The name isn't important. You will call it with the "-f hostfile" mpirun parameter.
headnode slots=20
node002 slots=16
node003 slots=16
...etc...
 

Network and Intranet

Because of where this cluster is located, I do not have access to an internet ethernet port. I thus use a USB wifi dongle for internet. One of the headnode's LAN ports is connected to an 8 port unmanaged switch, along with one port from each of the slave nodes. This is going to be the gigabit intranet for SSH and MPI communications. In the future, I will create another network for IPMI using the other LAN port, but I don't have enough ports in the switch for it at the moment. I used the Network Manager GUI to do the following:
  1. Under default Ethernet connection, uncheck start automatically.  
  2. Also turn off both IPv4 and IPv6. This essentially disables the Ethernet profile. Alternatively, you could delete it.
  3. Create a new profile named intranet with IP 192.168.2.1, 255.255.255.0 subnet mask, 0.0.0.0 gateway, blank dns, connect automatically, and check "use this connection only for resources on its network"
  4. Create another profile named ipmi with IP 192.168.0.1, 255.255.255.0 subnet mask, 0.0.0.0 gateway, blank dns, do not check connect automatically, and check "use this connection only for resources on its network". This will be used later when I get more switch ports (another switch or a bigger switch) for talking to the slave node's management ports.  
I left the wifi settings default. My internet router assigns static IP's in the 192.168.1.X space, so that's why I skipped from 0 to 2. This allowed me to use internet and the intranet at the same time. I believe it should also be possible to use the other LAN port for internet at the same time as well if you turn IPv4 back on, but I have not tested this.

Firewalls

Because the headnode is connected to the internet, the firewall needs to stay active. However, the firewall will block MPI traffic. You have to assign ports for OpenMPI to use and then open those ports in the firewall. I'm not really sure how many ports to use...5 for both seemed to work ok.
  1. Create or edit a ~/.openmpi/mca-params.conf file. 
  2. Set the btl parameter "btl_tcp_port_min_v4" to some high port, e.g. 12341, and "btl_tcp_port_range_v4", which sometimes needs to be > the number of processes you plan on running. If it's 5, that makes the ports 12341-12345. This is done by adding a line, the parameter name = X, e.g. btl_tcp_port_min_v4=12500. 
  3. Set the oob parameter "oob_tcp_dynamic_ipv4_ports" to a range of ports. A small range works for this, like 5.
  4. Add the line: btl_tcp_if_include=192.168.2.0/24 . The restricts mpi tcp communication to the intranet subnet (change to your intranet subnet). An example of when this is important: Say you have two network interfaces and you don't want MPI using one of them, e.g. my ipmi interface, yet you want that interface to stay up all the time. If MPI sees that the interface is up, it will try to use it, even if there isn't a route between the nodes on that interface, so mpi will fail. This parameter prevents that error.
Those ports then need to be opened. Guide. Last time I did this, I was able to add the profiles intranet and ipmi to the home zone of the firewall. This was possible because the profiles were "logical devices". However, I am now unable to repeat this and must add the names of the interfaces, i.e. the ethernet adapter port names. Not sure what is different.
  1. sudo firewall-cmd --permanent --zone=home --change-interface=eth0 
  2. Then add all of the MPI TCP ports to the home zone's open ports. sudo firewall-cmd –-permanent --zone=home --add-port=12341-12345/tcp 
If any ethernet port associated with those profiles is ever connected to the internet, those ports will need to be closed. It's also possible to define MPI as a service (see previously linked guide).

NFS

Make sure to do the above firewall setup before doing NFS setup. Guides: 1, 2

The nfs-utils package should already be installed. If not, install that now. The user (cluster) home folder will be shared over the intranet link with the slave nodes. The directory must be completely owned by the user:
  • chmod -R 755 /home/cluster
  • chown -R cluster:cluster /home/cluster
The following services need to be enabled and started:
  • systemctl enable rpcbind
  • systemctl enable nfs-server
  • systemctl enable nfs-lock
  • systemctl enable nfs-idmap
  • systemctl start rpcbind
  • systemctl start nfs-server
  • systemctl start nfs-lock
  • systemctl start nfs-idmap
Next, edit the file /etc/exports, and add the following line to it. The "*" means "all IP addresses can mount this", which is fine for an intranet.
  • /home/cluster *(rw,sync,no_root_squash,no_all_squash,no_subtree_check)
Then do the following:
  1. systemctl restart nfs-server
  2. firewall-cmd --permanent --zone=home --add-service=nfs
  3. firewall-cmd --permanent --zone=home --add-service=mountd
  4. firewall-cmd --permanent --zone=home --add-service=rpc-bind
  5. firewall-cmd --reload

SSH

Need to setup passwordless SSH so the nodes can communicate. Guide. Make sure ssh is installed on all nodes. As the user (cluster), run "ssh-keygen", don't set a password (hit enter). This will create a SSH key that is automatically shared to all nodes due to sharing the home directory. Next, as cluster on the headnode, run "ssh-copy-id localhost". This copies the headnode's public key to the authorized keys file, which is then shared to all of the slave nodes via NFS. After setting up slave nodes (see below), it should be possible to SSH to all nodes, e.g. "ssh node001". MPI communication between nodes should also be possible.

Troubleshooting: make sure that your permissions are correct for the /home/cluster directory, /home/cluster/.ssh directory, and all of the files in .ssh. There are many forum posts on what these should be. If those are all correct and it doesn't work, try deleting all of the files in the .ssh folder and redoing the above steps. If that doesn't work, and you're getting something about "signing failed, agent refused operation", try running on headnode as cluster "ssh-add". I had to do the latter during one iteration of this cluster, but not during another...no idea why.

Slave Nodes

The plan is to setup one of the nodes first and get everything working with it, then clone the drive to three more SSDs for the other nodes. The only things that should have to be changed after cloning are hostname and network.

Hostfile

Edit (or copy from headnode) the /etc/hosts file to look like this:
127.0.0.1 localhost .... (etc)
::1....
192.168.2.1 headnode
192.168.2.2 node002
192.168.2.3 node003
192.168.2.4 node004
192.168.2.5 node005
 

Network and Intranet

The slave nodes will never be connected to the internet. They will only be connected to the intranet. Since there is no GUI, the settings have to be made with "nmtui" or manually using the /etc/sysconfig/network-scripts/ifcfg-ens2fX files, where X=0,1 for the two LAN ports. Change the IP settings to correspond to each node, then change ONBOOT to yes. The interface can be turned on from a terminal with ifup ens2f1.

Firewall

Kill the firewall on the slave nodes: systemctl stop firewalld, systemctl disable firewalld

NFS

Make sure to do the above firewall setup before doing NFS setup.

The nfs-utils package should already be installed. If not, install that now. Then run the following command to mount the NFS shared directory: sudo mount headnode:/home/cluster /home/cluster

To make this permanent, edit the /etc/fstab file and add the following line: headnode:/home/cluster /home/cluster nfs defaults 0 0

Repeat the above for all slave nodes. Test the NFS connection by creating a file in the home directory and check to see if it has propagated to all nodes. The following command may need to be run as root on all slave nodes to set SELinux policy:  "setsebool -P use_nfs_home_dirs 1". I don't remember the reason for this, but I think I had to do this to get it working before, though this time it seemed to work without setting that.


Testing

I received a SELinux alert about rpc on the headnode when I established the NFS connection on the slave node. I followed the instructions provided for changing the policy to allow this.

Now that everything is setup, you should be able to run the mpi_hello_world script across multiple nodes. Note: if not all slave nodes are hooked up yet, you must comment out their lines in the mpi hostfile with a "#". mpirun first tries to establish a connection to all nodes, so if one of the nodes in the list is offline, it throws an error.
  • mpirun -np 36 -mca btl ^openib -hostfile ~/cluster_hostfile ./mpi_hello_world
This should work with no errors.

OpenFOAM will not work across nodes though, because the environment is not copied. For that, I'll need to setup SLURM. However, as a quick fix, I changed the slave node .bashrc to source the OF bashrc instead of set it up as an alias. However, ultimately I will need to setup SLURM because I will likely be using multiple environments in the future. I ran the previous motorBike tutorial benchmark on 36 processes. This required changing the ./Allmesh script to call mpirun since the "runParallel" OF function does not use hostfiles. I commented out the runParallel line and added these two:
  • nProcs=$(getNumberOfProcessors system/decomposeParDict)
  • mpirun -np 36 -hostfile /home/cluster/cluster_hostfile snappyHexMesh -parallel -overwrite > log.snappyHexMesh 2>&1
Actually, running snappyHexMesh across 36 processes on a mesh this small is probably counterproductive, but it made a good test. IIRC, decomposePar can be run with a smaller number of partitions, snappyhexmesh can be run, then the mesh can be re-decomposed with more partitions for the actual case. Running the case was done with the following command:
  • mpirun -np 36 -hostfile /home/cluster/cluster_hostfile simpleFoam -parallel > log.simpleFoam 2>&1
This completed in 64.24s, or 1.56 iter/s. This is fast, but it's about 21% less than the sum of the two individual nodes' max iter/s. Part of the reason for this is that the slave node is about 20% slower than the headnode on a core-by-core basis, so the headnode's processes must wait for the slave node's processes to finish. The primary reason is likely communication delay caused by the gigabit network. Typically, 60-100k cells per process is close to optimal (can more or less depending on solver, network, and many other things). 2M/20 is 100k, and 2M/36 is 55k, so it makes sense that the speed up is not perfect. Faster networks allow for lower numbers of cells/process while maintaining speed up scaling.

Infiniband

I've had a few posts discussing this, but to summarize: I have a working QDR Infiniband system based on used Sun Infiniband cards (actually rebranded Mellanox MHQH29B) and a Sun switch. I've tested it with various performance tests, and it achieves the expected QDR performance. Now I need to make it work with MPI and OpenFOAM.

First, RDMA communications require memory, and if you use a lot of memory (like when running CFD), you will probably hit the system limits. You need to add a rdma.conf file to  /etc/security/limits.d that has the following:
# configuration for rdma tuning
* soft memlock unlimited
* hard memlock unlimited
# rdma tuning end
That file needs to be copied to the slave nodes:
  • (as root) scp /etc/security/limits.d/rdma.conf root@node002:/etc/security/limits.d/rdma.conf
Now reboot all nodes. Supposedly logging out and logging back in will work, and there's probably a clever way to reload that limit, but I don't know it. "ulimit -l" should give "unlimited". Now to run the mpi_hello_world script again, but this time without the mca parameter excluding openib:
  • mpirun -n 36 -hostfile ~/cluster_hostfile ./mpi_hello_world
That should run without errors. You can also just remove the "^" from the previous command to ensure that infiniband is being used.

Now to try that OpenFOAM benchmark again. Everything is the same as before (see "Testing" section above), except now the log file shouldn't have any warnings about not using verbs devices, and it should be faster. Run time was 51.21 s (1.95 iter/s), which is approximately 20% faster. In fact, it's only about 1% slower than the sum of the two nodes' individual max iter/s. This means scaling is almost perfect, which is excellent. 

Summary

I now have OpenFOAM working over Infiniband with a headnode and one slave node. There are still more things to do though:
  1. Setup environment module for openmpi
  2. Get SLURM working
  3. Copy the slave node's SSD 3x times and install those SSDs in the other slave nodes
  4. Modify node003-node005's hostname and network IP addresses.
  5. Test MPI over intranet and Infiniband with all nodes
I think I'm going to skip trying to get disk-less boot working. It's very complicated. Some of the software I eventually want to install on the slave nodes is rather heavy, and I don't want the OS and installation taking up a big chunk of the RAM.



Update: A few days after setting all this up (and the server being off), I had a weird error. SSH started asking for a password, and I couldn't access the /home/cluster directory on the slave node. I got a NFS error "Stale file handle". I fixed this by doing the following as root on the slave node:
  1. umount -f /home/cluster
  2. mount -t nfs headnode:/home/cluster /home/cluster
It seems to work now, even after rebooting both headnode and slave node. Not sure why it happens.

No comments:

Post a Comment