Search This Blog

Friday, November 16, 2018

Automatic mulit-threading with python numpy

This came up while running my wife's python codes on the cluster. It turns out that numpy vector operations are automatically parallelized if numpy is linked against certain libraries, e.g. openBLAS or MKL, during compilation. Those linear algebra libraries will automatically use the max number of available cores (or if your processor is HT, 2x number of physical cores) for matrix operations. While that might seem convenient, it actually made a lot of people unhappy because of the overhead involved with multithreading lots of tiny matrix operations. Fortunately, there is a way to control the max number of threads used, and some devs are working on a way of dynamic control via numpy.

I created the following basic test script. It generates two random matrices, then multiplies them together. The random number generation is a serial operation, but the dot product is parallelized by default.
import os
#must set these before loading numpy:
os.environ["OMP_NUM_THREADS"] = '8' # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = '8' # export OPENBLAS_NUM_THREADS=4
os.environ["MKL_NUM_THREADS"] = '8' # export MKL_NUM_THREADS=6
#os.environ["VECLIB_MAXIMUM_THREADS"] = '4' # export VECLIB_MAXIMUM_THREADS=4
#os.environ["NUMEXPR_NUM_THREADS"] = '4' # export NUMEXPR_NUM_THREADS=6

import numpy as np
import time

#np.__config__.show() #looks like I have MKL and blas
np.show_config()

start_time=time.time()
#test script:
a = np.random.randn(5000, 50000)
b = np.random.randn(50000, 5000)
ran_time=time.time()-start_time
print("time to complete random matrix generation was %s seconds" % ran_time)
np.dot(a, b) #this line should be multi-threaded
print("time to complete dot was %s seconds" % (time.time() - start_time - ran_time))
The lines under import os set environment variables. The one(s) you need to set depend on what your numpy is linked against, as shown by np.show_config(). Note that those must be set before importing numpy.

I ran some experiments on one of the compute nodes (dual e5-2690v2) using slurm execution. Software was anaconda 5.2, so anyone with a recent anaconda should have similar behavior. My np.show_config() returned information about MKL and openBLAS, so I think those are the relevant variables to set.

Test 1: slurm cpus-per-task not set, ntasks=1, no thread limiting variables set.
Results: No multi-threading because slurm defaults to one cpu per task.

Test 2: slurm cpus-per-task=10, ntasks=1, no thread limiting variable set.
Results: dot used 10 threads (10.4s)

Test 3: slurm cpus-per-task=20, ntasks=1, no thread limiting variable set.
Results: dot used 20 threads (5.4s)

Test 4: slurm cpus-per-task=4, ntasks=1, no thread limiting variable set.
Results: dot used 4 threads (24.8s)

Test 5: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, OMP_NUM_THREADS=4
Results: dot used 4 threads (24.8s)

Test 6: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, OPENBLAS_NUM_THREADS=4
Results: dot used 10 threads (10.4s)

Test 7: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, MKL_NUM_THREADS=4
Results: dot used 4 threads (24.9s)

Test 8: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, VECLIB_MAXIMUM_THREADS=4
Results: dot used 10 threads (10.4s)

Test 9: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, NUMEXPR_NUM_THREADS=4
Results: dot used 10 threads (10.4s)

Test 10: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, OMP_NUM_THREADS=8, OPENBLAS_NUM_THREADS=8, MKL_NUM_THREADS=8
Results: dot used 8 threads (12.5s)

As you can see above, setting either MKL or OMP_NUM_THREADS will limit the number of threads, though apparently openBLAS is not being used, at least for dot. Also, limiting the number of cpus available will also limit the number of threads.

For my wife's code, which she has to run on 100's of different cases that can be run simultaneously, it looks like giving one full socket per case (10 cores) is optimal. The environment variables don't need to be set because the default behavior is to use all available cores (limited by slurm). That's assuming np.dot is a good indicator, which it might not be because her code is far more complicated.

Anyways, I hope someone finds this useful.

No comments:

Post a Comment