I created the following basic test script. It generates two random matrices, then multiplies them together. The random number generation is a serial operation, but the dot product is parallelized by default.
import osThe lines under import os set environment variables. The one(s) you need to set depend on what your numpy is linked against, as shown by np.show_config(). Note that those must be set before importing numpy.
#must set these before loading numpy:
os.environ["OMP_NUM_THREADS"] = '8' # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = '8' # export OPENBLAS_NUM_THREADS=4
os.environ["MKL_NUM_THREADS"] = '8' # export MKL_NUM_THREADS=6
#os.environ["VECLIB_MAXIMUM_THREADS"] = '4' # export VECLIB_MAXIMUM_THREADS=4
#os.environ["NUMEXPR_NUM_THREADS"] = '4' # export NUMEXPR_NUM_THREADS=6
import numpy as np
import time
#np.__config__.show() #looks like I have MKL and blas
np.show_config()
start_time=time.time()
#test script:
a = np.random.randn(5000, 50000)
b = np.random.randn(50000, 5000)
ran_time=time.time()-start_time
print("time to complete random matrix generation was %s seconds" % ran_time)
np.dot(a, b) #this line should be multi-threaded
print("time to complete dot was %s seconds" % (time.time() - start_time - ran_time))
I ran some experiments on one of the compute nodes (dual e5-2690v2) using slurm execution. Software was anaconda 5.2, so anyone with a recent anaconda should have similar behavior. My np.show_config() returned information about MKL and openBLAS, so I think those are the relevant variables to set.
Test 1: slurm cpus-per-task not set, ntasks=1, no thread limiting variables set.
Results: No multi-threading because slurm defaults to one cpu per task.
Test 2: slurm cpus-per-task=10, ntasks=1, no thread limiting variable set.
Results: dot used 10 threads (10.4s)
Test 3: slurm cpus-per-task=20, ntasks=1, no thread limiting variable set.
Results: dot used 20 threads (5.4s)
Test 4: slurm cpus-per-task=4, ntasks=1, no thread limiting variable set.
Results: dot used 4 threads (24.8s)
Test 5: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, OMP_NUM_THREADS=4
Results: dot used 4 threads (24.8s)
Test 6: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, OPENBLAS_NUM_THREADS=4
Results: dot used 10 threads (10.4s)
Test 7: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, MKL_NUM_THREADS=4
Results: dot used 4 threads (24.9s)
Test 8: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, VECLIB_MAXIMUM_THREADS=4
Results: dot used 10 threads (10.4s)
Test 9: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, NUMEXPR_NUM_THREADS=4
Results: dot used 10 threads (10.4s)
Test 10: slurm cpus-per-task=10, ntasks=1, ntasks-per-socket=1, OMP_NUM_THREADS=8, OPENBLAS_NUM_THREADS=8, MKL_NUM_THREADS=8
Results: dot used 8 threads (12.5s)
As you can see above, setting either MKL or OMP_NUM_THREADS will limit the number of threads, though apparently openBLAS is not being used, at least for dot. Also, limiting the number of cpus available will also limit the number of threads.
For this code, which has to run on 100's of different cases that can be run simultaneously, it looks like giving one full socket per case (10 cores) is optimal. The environment variables don't need to be set because the default behavior is to use all available cores (limited by slurm). That's assuming np.dot is a good indicator, which it might not be because her code is far more complicated.
Anyways, I hope someone finds this useful.
No comments:
Post a Comment