Search This Blog

Wednesday, July 11, 2018

Recreating the compute node drive

As mentioned in the last post, something as simple as not having uniform drives across nodes can cause a huge mess. In this case, I have to reinstall everything on the slave node SSD (the smallest one this time) before I can continue.

I'm following my software guide while doing this, which currently consists of 3 parts. I'm also updating/cleaning them up as I go. If you want to read about the various screw ups I had during this process, then keep reading. If not, skip to the "Final Steps" below.

I installed CentOS, but with manual partitioning, ext4, and no LVM this time. I also used a much smaller home directory, leaving a lot of free space on the drive, which should make copying it easier. I then did the update, reboot, installed all of the packages I thought I'd need, renamed it to node002, and rebooted again. At this point, I thought I'd try something that worked for me on the headnode. When I switched from a SSD to NVMe on the headnode (after switching motherboards), I didn't want to have to resintall everything again. After doing the OS and package installations, I copied over all of the directories I modified: root and cluster home's, /etc, /usr/, and /opt. In order to include hidden files, you want to do something like "cp /mnt/usb1/opt/. /opt/" (the dot is critical). This actually worked pretty well for openmpi and openfoam. However, this happened before parts 2 and 3, which required a ton of system level settings. My hope here is that I can do these large copies from my backup usb drive, test openmpi and openfoam, then carefully go through the guides finishing the settings without having to do all of the re-installation work.

Copying the home directories, opt, and etc went fine, but copying /usr caused CentOS to crash and no longer boot. Looking back at my notes, it seems that I only copied usr/local before. Unfortunately, some of the modifications occurred outside of /usr/local, so this might idea might still require quite a lot of work. I redid all of the steps (again), but just copied /usr/local (which turns out just contains the cmake install) instead of /usr this time. I killed the firewall on the slave node, did the network setup, commented out the nfs line on the copied fstab (since I haven't hooked it up to the intranet switch again yet), and rebooted. It was right around this time I realized that you can't copy the fstab file because it contains the unique disk identifiers. Oops. Probably not a good idea to copy all of /etc then. Starting over AGAIN.

Final steps:
  1. Install OS and packages as in software guide part 1
  2. Copy the cluster home, root home, /opt, and /usr/local (with overwrite and hidden files) directories from the backup drive
  3. reboot
  4. Test openmpi and openfoam on the one node
  5. Copy /etc/hosts file from backup drive
  6. Do network settings as in software guide part 2 (follow until further notice)
  7. Disable firewall
  8. Connect headnode and do NFS setup
  9. Test mpi over ethernet
  10. Copy the rdma.conf
  11. Reboot and setup infiniband
  12. Test mpi over infiniband
  13. Now on to software guide 3
  14. Copy over the module files
  15. test environment modules
  16. make sure ntpd is working on headnode
  17. setup ntp on slave node, can copy npt.conf
  18. Do all of the slurm stuff from scratch
  19. Test mpi and openfoam 
If you exclude all the initial errors I made, this process took about 5 hours. The biggest hangup was an odd ssh error that was resolved by doing "ssh-add", something I didn't have to do before (added it to the guide).

Some good news: it's a little faster with the E5-2690v2's and slightly better RAM. n=40 over ethernet took 57.2s, over infiniband took 45.87s. That's about 10% faster :)

No comments:

Post a Comment