Search This Blog

Wednesday, July 18, 2018

Using the Cluster for Astrophysics-y stuff

I spend months building this awesome cluster, and the first person to fully use it is my partner, haha. Since they put up with the mess for so long, I think this is fair.

I don't have the physics background to understand their research, but they're processing a ton of astronomical data, and the algorithms require are very cpu intensive. Their scripts are in python, each case needs 8 cores, and they have a few hundred cases. Time to queue the cluster...

I installed anaconda on the headnode in /opt so it'd be available system wide. I created an environment module for this anaconda version, then  created the virtual environment(s) wanted. Then instead of installing it on each slave node, I just tar'd, copied the whole anaconda directory to /opt on each node, and untar'd. I also copied the environment modulefile. If I ever make massive changes, it will be faster/easier to re-clone all of the drives again, but for single program installs, it's faster to scp/ssh into each node and make modifications. I'll look into something like pssh in the future to make this sort of thing easier. Anyways, now that anaconda was installed, and the virtual environments available on all nodes, it was time to get slurm working.

Turns out it's super simple to call a python script from an sbatch script. You only need three lines after the #SBATCH setting lines:
module load python/anaconda-5.2
source activate myenv
python .py
However, when I tried running this for multiple cases, it would only assign one job per node, even though I had ntasks=1 and cpus-per-task=8. Theoretically, it should assign two jobs per node since each node has 20 cores (cores=cpus to slurm if hyperthreading is off). I had set SelectType and SelectParameters correctly in the slurm.conf, but it turns out that you have to add the OverSubscribe=YES parameter to the partition definition in slurm.conf, or slurm defaults to 1 job per node. This allowed for scheduling two of these jobs per node. I updated the slurm instructions in the software guide part 3 to cover this.

I couldn't find the answer to this online, but it turns out that it's perfectly fine to activate the same anaconda virtual environment more than once per node as long as you do it in separate terminal sessions (separate jobs for slurm) and are not making modifications to the environment during the runs. This made life a lot easier because it meant that we didn't have to try to track which environments were being used at the time of job launching.

No comments:

Post a Comment