Search This Blog

Thursday, August 9, 2018

Slurm and power saving

My cluster, as most private clusters, is not continuously used. In order to save power, I have been manually powering off and on the compute nodes when I don't need and need them. This is a pain. I was thinking I could write some sort of script that monitors slurm's squeue for idle nodes and power them off, and then if new jobs getting added that require resources, power them on.

Turns out, Slurm has a feature called power saving that does 80% of the work. I say 80% because you still have to write two scripts, one that powers off nodes (SuspendProgram) and one that powers on nodes (ResumeProgram). Slurm handles identifying which nodes to power off and when to call either program. The power off is fairly simple: a sudo call to poweroff should work. Powering them on is a little trickier. I should be able to do it with ipmitool, but before I do that, I need to setup the ipmi network.

I wanted to get a ~16-24 port gigabit network switch. I could then create two VLAN's, one for the basic intranet communication stuff (MPI, SSH, slurm, etc), and one for IPMI. However, I couldn't find one for less than about $40, and those were about 10 years old...eek. New 8 port unmanaged 1Gbe switches are only about $15, so I just bought another one of those. The two unmanaged switches are fanless and less power hungry, which is nice, and I really didn't need any other functionality that a managed switch provides. Time to hook it up:

Color coded cables :D

I have two ethernet ports on the headnode: one is connected to the intranet switch, and one to the ipmi switch. I made sure my firewall was configured correctly, then tried connecting to the ipmi web interface using node002's ipmi IP and a browser. This worked. For servers with IPMI, you can use ipmitool to do a lot of management functions via a terminal, including powering on and off the server:
ipmitool -H IP -v -I lanplus -U user -P password chassis power on
ipmitool -H IP -v -I lanplus -U user -P password chassis power soft
IP=IP address or hostname, user=ipmi username, password=ipmi password. Pretty cool, right?

I added the ipmi IP addresses to my /etc/hosts file with the hostnames as "ipminodeXXX". Then in the slurm suspend and resume scripts, I created a small routine that adds "ipmi" to the front of the hostnames that slurm passes to the scripts. This is used in the above ipmitool calls. The SuspendProgram and ResumeProgram are run by the slurm user, which does not have root privileges, so I also had to change the permissions to make those scripts executable by slurm.

You could also do the power off with the sudo poweroff command. In order to be able to run poweroff without entering a password, you can edit the /etc/sudoers file on the compute nodes and add the following lines:
cluster ALL=NOPASSWD: /sbin/poweroff
slurm ALL=NOPASSWD: /sbin/poweroff
This allows the users cluster and slurm to use the sudo poweroff command without entering a password. From a security standpoint, this is probably ok because the worst root privilege thing someone who gains access to either user can do is power the system off. You'll have to use something like wake-on-lan or ipmitool to boot, though.

To set ResumeTimeout, you need to know the time it takes for a node to boot. For my compute nodes, it's about 100s, so I set ResumeTimeout to 120s. The other settings were fairly obvious. Make sure path to the scripts are absolute paths. I excluded the headnode because I don't want slurm to turn it off.

Once I had everything set, I copied the slurm.conf to all nodes, as well as the new hosts file. I also copied the suspend and resume scripts, but I don't think that was necessary because I think only slurmctld (which is only on headnode) deals with power saving. I then tried the scontrol reconfig command, but it didn't seem to register the change, so I ended up having to restart the slurmd and slurmctld services on the head node. Then I saw something about the power save module in the slurmctld log file. I then waited around for 5 minutes and slurm successfully powered down the compute nodes! They are classified as "idle~" in sinfo, where the "~" means power save mode. I had it output to a power_save.log file, and I can see an entry in there for which nodes were powered down. An entry is automatically placed in the slurmctld log stating how many were put in power save mode, but not which ones. Then I started my mpi test script with ntasks=100. This caused slurm to call the slurmresume program for all nodes, which booted all of the nodes (sinfo shows "alloc#"), and then slurm allocated them and ran the job. Then five minutes later, slurm shut the nodes down again. Perfect. One of the rare times untested scripts I wrote just worked.

Some final notes:
  • This isn't very efficient for short jobs, but it will work great for a lot of long jobs with indeterminate end times. 
  • Interestingly, node003 was the slowest to boot by quite a lot...I'll have to see if there is something slowing its boot down. Luckily, that's the one I happened to time for setting the ResumeTimeout. Slurm places nodes in the down state when they take longer than ResumeTimeout to boot. 
  • I had an issue with node005 once earlier today where on boot the OS didn't load correctly or something...lots of weird error messages. Hasn't happened since, so hopefully it was just a fluke.\
Updates:
The next morning, node002 and node005 were in the "down~" state. Checking the slurmctld log, it looks like  node005 unexpectedly rebooted. The nodes' slurmd doesn't respond when they're off, and slurmctld logs this as an error, but knows that they're in power save mode, so doesn't set them down unless they randomly reboot. node005 did this and turned on, so it marked it as down.  node002 failed to resume in the ResumeTimeout limit, so slurm marked it as down. Not sure why it took so long last night. I booted it this morning in less than two minutes. Since node005 was already on, and node002 was now booted, I did the scontrol commands for resuming the nodes, which worked. Then 5 minutes later they were put in power save mode and are now "idle~". I then tested slurm again with the mpi script. It booted the idle~ nodes ("alloc#"), then allocated the job. Node002 failed to boot within 120s again, but node003 and node005 were ~80s. It seems boot times are very variable for some reason. I changed the ResumeTimeout to 180s. I tried resuming node002 once it had booted, and that worked, but then it wouldn't allocate it for some reason (stuck on "alloc#"), so it put in the down state again. I had to scancel the job, bring scontrol resume node002 manually again (now in state "idle~", which wasn't true...should be just idle), then restart slurmd on node002. That made all of the unallocated nodes "idle" according to sinfo. Then I tried submitting the mpi test job again. It allocated all of the idle nodes, ran the job, and exited. I then waited for the nodes to be shutdown, ran the mpi test script again, slurm resumed the nodes (~2 minutes), allocated them to the job, and ran the job (~3 minutes). Running the job took longer than a second or so (it's just a hello world, usually takes a second or less) because I think there was still some stuff slurm was doing. This is why it's not very efficient to power save the nodes for short jobs.

Process for fixing a node that randomly rebooted during a powersave:
  1.  Node should be on if this happened, but in "down~" state
  2. scontrol: update NodeName=nodeXXX State=RESUME
  3. The node should now be allocated
Process for fixing a node that failed to boot in time during a powersave resume:
  1. Make sure there aren't any jobs queued
  2. If the node is off, boot it manually and wait for it to boot
  3. scontrol: update NodeName=nodeXXX State=RESUME
  4. sinfo should now think the node is "idle~"
  5. ssh to the node and restart slurmd
  6. sinfo should now think the node is "idle"
 I'll keep updating this as I run into problems with slurm power save.

Helpful trick for preventing jobs from being allocated without having to scancel them: go into scontrol: hold XXX, where XXX is the job number. Then to un-hold it, use release XXX.

If the above doesn't work, or if the node powers up and is stuck in state "CF", but the job fails to start and it gets requeued (and this repeats), then there's something wrong with your configuration. In my case, chronyd was not disabled on the compute nodes, which prevented ntpd from starting, which messed up the time sync. I fixed this and was able to use the above steps to get the nodes working again.

Oh, and make sure your IPMI node names/IPs correspond to your intranet node names/IPs. I had node002 and node005 crossed, so slurm would end up shutting down node002 when it meant to shutdown node005 and vice versa. Oops.

Update Oct 2018: I went on a long trip and wasn't planning on running anything, so I had everything turned off. After I came back, I tried booting the headnode up and submitting jobs to see if slurm would just work. It didn't. I had to following the "process for fixing a node that failed to boot" steps above. I also had to power cycle the network switch that the ipmi network is on...I guess it entered some sort of low power mode.