Search This Blog

Wednesday, July 11, 2018

Compute node drive cloning, take 2

The previous two posts detailed the mess and resolution to my drive cloning woes. This post will be about the actual cloning process.

I created a bootable USB of clonezilla live. I then installed that, and a second SSD into the slave node, and booted to clonezilla. Clonezilla has some nice instructions for drive cloning. This time it just worked. I then shutdown the node. To test the new drive, I removed the original SSD and the clonezilla usb, but left the new SSD in, then booted the node. No problems, worked just like the original! Until I put it in another node...then it failed to boot. Turns out that, because these are UEFI nodes and I didn't install CentOS on any but the first, they don't have an entry in the NVRAM to find the bootloader. So I had to create a new boot entry in the bios pointing to EFI boot loader. This worked. If you don't have the option to create a new boot entry in the bios, then you have to do it with efibootmgr via a rescue usb.

Some modifications need to be made to the new drive to convert it to a different node.
  1. Insert the drive into another node, say node003
  2. Create a new entry in the boot menu point to the EFI bootloader
  3. Boot node003 using the new option
  4. Change hostname to node003
  5. Change intranet and ipmi IPs
  6. Optional: I followed these instructions to reinstall grub2: yum reinstall grub2-efi shim. My thinking is that this might create the correct nvram entry. Probably good to do this if have inconsistent boot problems.
  7. Reboot and see if it boots correctly
Everytime a node is brought online or offline, the cluster hostfile needs the appropriate line uncommented or commented, and the slurm.conf needs to be modified and propagated to all nodes. After modifying a slurm.conf, I think you have to restart the slurmctld and slurmd on the headnode, and slurmd on the slave nodes. You might also have to bring the nodes back up. Supposedly "scontrol reconfigure" also works, but I haven't tried it. 


Weirdly, and I believe unrelated, the /data RAID1 volume on the headnode failed again. I gotta figure out why that keeps failing.

When booting the cluster, the headnode must be booted first, then all of the slave nodes can be booted. Otherwise NFS fails to mount on the slave nodes. Then the nodes need to be brought up for slurm (done on the headnode). 

No comments:

Post a Comment