I've done and documented a few benchmarks before now.
One figured out if there was a performance difference between two HCA firmware versions using the perftest package. I've also run some OpenFOAM benchmarks. However, now I want to do some MPI bandwidth and latency comparisons between different types of Infiniband networks, specifically: 1. single rail QDR, 2. dual rail QDR (use both ports), 3. single rail FDR, 4. dual rail FDR (FDR10). Here a "rail" refers to a communication pathway, so "dual rail" means that both ports of the dual port HCAs are connected. Physically, it means plugging an Infiniband cable into both ports running to the same switch. OpenMPI by default will use all all highspeed network connects for MPI, which means using dual rail should be a breeze. The goal is to see which type of network achieves the highest bandwidth. I realize that some of these have theoretical performances higher than the available PCI 3.0 x8 bandwidth (7.88 GB/s), so it'll be interesting to see how close to that I can get. Speaking of theory, let's do some math first.
For PCI 2.0 (ConnectX-2), the speed is 5 GT/s per lane. The following equation then gives the max theoretical bandwidth of a PCI 2.0 x8 interface with 8/10 encoding:
PCI_LANES(8)*PCI_SPEED(5)*PCI_ENCODING(0.8) = 32 Gb/s (4 GB/s)
So that is about the maximum I should expect from a PCI 2.0 x8 slot using 8/10 encoding. For PCI 3.0, the speed is 8 GT/s and the encoding changes to 128/130, which yields ~63 Gbit/s (7.88 GB/s) for a x8 slot. QDR and FDR10 Infiniband use 8/10 encoding. However, FDR uses a more efficient 64/66 encoding, though it's less efficient than the 128/130. A PCI 3.0 slot with 64/66 encoding has a max theoretical bandwidth of ~62 Gbit/s (7.76 GB/s). However, there are some more inefficiencies, so I expect the actual upper bandwidth limit to be slightly lower. QDR Infiniband is 4x 10 Gbit/s links with 8/10 encoding, which yields a theoretical max bandwidth of 32 Gbit/s (4GB/s). Thus, a single QDR link will saturate a PCI 2.0 x8 port. FDR10 Infiniband is 4x 10.3125 Gbit/s links, with 8/10 encoding, which yields 40 Gbit/s (5 GB/s). FDR Infiniband is 4x 14.0625 Gbit/s links, with 64/66 encoding, which yields 54.55 Gbit/s (~6.8 GB/s). Again, I expect some inefficiencies, so I doubt I'll hit those values. Since a single QDR link maxes out the PCI 2.0 x8 interface of the CX-2 HCA, I expect the dual rail QDR CX-2 case to not provide any additional bandwidth. The FDR10 and FDR cases will use a PCI 3.0 x8 interface. The single FDR link should not saturate the PCI 3.0 x8 slot, but dual rail QDR, FDR10, and FDR should, so their theoretical max bandwidths is the ~7.8 GB/s of the slot.
The first thing to do is install a benchmark suite. The
OSU MicroBenchmarks is a popular MPI benchmark suite. Download the tarball, extract to a folder, go into the osu-benchmarks folder, and run the following:
module load mpi/openmpi-3.1.0
./configure CC=mpicc CXX=mpicxx --prefix=$(pwd)
make
make install
The first line is only if you've set up openmpi as a module, like I did previously. If not, then you need to point CC and CXX to the location of mpicc and mpicxx respectively. That will install the benchmark scripts in the current directory under libexec/osu-micro-benchmarks/mpi/. If you don't specify a prefix, it stuck the osu-micro-benchmarks folder in /usr/local/libexec for me. This needs to be done on the nodes you'll be running benchmarks on. I'm only going to do two node benchmarks, so I'm only installed this on the headnode and one compute node. In order to keep slurm power save from shutting down the compute node, I modified the SuspendTime variable in slurm.conf on the headnode and ran
scontrol reconfig. I then turned on the QDR Infiniband network and made sure that the link was up.
I navigated to the folder containing the scripts, in particular the pt2pt scripts. I used the following commands to run a bandwidth and latency test:
srun -n 2 --ntasks-per-node=1 ./osu_bw > ~/results_bw_QDR_single.txt
srun -n 2 --ntasks-per-node=1 ./osu_latency > ~/results_latency_QDR_single.txt
Those run a bandwidth and latency test at many different message sizes between the headnode and one of the compute nodes, recording the results to text files in my home directory. You can use mpirun instead of srun, but you have to specify a hostfile and make sure that the compute node's environment (PATH, LD_LIBRARY_PATH) include mpirun. For the dual rail cases, you'll get an mpi warning about more than one active ports with the same GID prefix. If the ports are connected to different physical IB networks, then the MPI will fail because you have to have different subnet ID's for different subnets. Typically, when more than one port on a host is used, it's used as a redundant backup on a separate switch (subnet) in case a port goes down. However, I'm using them on the same subnet in order to increase available bandwidth, so I can safely ignore this warning, which also tells how to suppress it.
I ran the above for single and dual rail QDR (CX-2 cards) first. Then I put the new FDR CX-3 cards in and ran them again with the old Sun QDR switch. For the supermicro compute nodes, I had to enable 4G decoding in order to get the nodes to boot, though the headnode was fine without it. My guess is that the BAR space is larger for the firmware version on the CX-3 cards than the CX-2 cards, which is
something I've run into before. Then I pulled the old switch out, installed the new FDR switch (SX6005), installed opensm, activated the opensm subnet manager (
systemctl start opensm) because the SX6005 is unmanaged, and ran the single and dual rail FDR benchmarks again. Finally, I replaced the FDR cables with the QDR cables, which causes the link to become FDR10 (after a few minutes of "training", whatever that is). I then ran the benchmarks again. The end result was 16 text files of message size vs. bandwidth or latency. I wrote a little gnuplot script to make some plots of the results.
Examining the plateau region, it's clear that dual rail QDR (CX-2 HCAs) did not help, as expected. The max single rail CX-2 QDR bandwidth was about 3.4 GB/s, which is about 15% lower than the max theoretical of the slot and QDR (4 GB/s); these are those extra inefficiencies I mentioned. Single rail CX-3 QDR bandwidth was around 3.9 GB/s, which is only about 2.5% lower than the max theoretical QDR bandwidth. The majority of this efficiency improvement is likely due to the PCI 3.0 interface efficiency improvements. The dual rail CX-3 QDR bandwidth matched the single rail up to about 8k message sizes, then jumped up to about 5.8 GB/s. Since 3.9*2 = 7.8, which is about the max theoretical bandwidth of a PCI 3.0 x8 slot, the PCI interface or code must have some inefficiencies (~22-26%) that are limiting performance to ~5.8-6.2 GB/s. In fact, the FDR10 and FDR dual rail's had similar max measured bandwidths. The single rail FDR10 bandwidth was about 4.65 GB/s, which is about 7% less than max theoretical. The single rail FDR bandwidth was about 5.6 GB/s, which is about 18% less than max theoretical. Again, this is probably hitting some PCI interface or code inefficiencies. Doing
echo connected > /sys/class/net/ib0/mode for ib0 and ib1 didn't seem to make a difference. That might only apply to ipoib, though.
Latency shows negligible differences for single vs. dual rail for medium-large message sizes. I only lose about 3-4% of max bandwidth (~13% near 32k) with the single rail FDR vs. the dual rail options. I don't currently own enough FDR cables to do dual rail FDR, but since the performance improvement is so small, I don't plan on purchasing 5x more of these cables.
Since I'll be using the SX6005 switch from no on, I enabled opensm so it will start every time the headnode boots.
This guy did something similar back in 2013. He got slightly higher bandwidths, and the FDR latency was higher than QDR for some reason. He does mention at the end that openmpi tended to have more variable results than mpich.
I decided to try to track down why I was seeing inefficiencies of ~22-26% in some cases. The first thing to check is process affinity. I discussed this some in a previous post, but basically the way processes are distributed to resources can be controlled. Since these tests only have two tasks, one running on each node, and there are 10 cores per socket and 2 sockets per node, then there are a total of 20 cores that the single task could be running on. Often, this task is bounced around between those cores, which is good for a normal computer running many different tasks, but it's bad for a compute node that only runs one main job due to the inefficiencies involved in moving that task around. Thus, it's better to bind that task to a core. There is some minor performance dependence based on which core in a socket the task is bound to, but there can be major performance differences depending on which socket the core (that the task is bound to) is in. If the IB HCA is in a PCI slot connected to CPU2 (logical socket 1), but the task is assigned to a core in CPU1, then the task has to communicate through the QPI link between the CPUs, which hurts bandwidth and latency. For the E5-4627 v3, the QPI has two 8 GT/s links, for a total bandwidth of about 2 GB/s...that could definitely be a bottleneck. I looked in my motherboard manuals for pci-CPU connections. The supermicro compute nodes' only slot is connected to CPU1 (logical socket 0), and the the ASUS headnode's HCA is in a slot connected to CPU2. But how do I know if core binding is on, and if so, what are the bindings? It turns out that it's hard to know with srun...there also isn't as much control over bindings and mapping in slurm. mpirun can output core bindings using the "--report-bindings" flag, but as I said earlier, I can't directly run mpirun without messing with the .bashrc/environment on the compute node. Instead of using the srun commands above, I wrote an sbatch script that calls mpirun. First, the SBATCH parameters job, output, ntasks=2, ntasks-per-node=1, nodelist=headnode,node002 are specified. These settings let slurm know that I'll need two nodes, and to put one of the two tasks on each node. Then the script runs "module load mpi/openmpi", which loads the mpi module. The mpirun command is then as follows:
mpirun -host headnode,node002 -np 2 -bind-to core -map-by ppr:1:node /path/to/osu_bw . It turns out that you don't need the
-bind-to core or
-map-by ppr:1:node flags; the results are the same without them. As long as task affinity is activated in the slurm.conf, then the default slurm behavior is to bind to core (and the ntasks-per-node sbatch command covers the map by node flag). Adding the
--report-bindings flag revealed that mpirun placed a task on core 0 of socket 0 of the headnode and core 0 of socket 0 of the compute node. Interesting...perhaps some of the performance inefficiency is due to the fact that my headnode's HCA is in a CPU2 (socket 1) PCI slot.
At this point, I have replicated the srun command with sbatch...so why did I bother? Enter rankfiles. A mpirun rankfile allows you to specify exact task mapping on node, socket, and core levels, something you can't do in slurm. So I did that:
rank 0=headnode slot=1:0
rank 1=node002 slot=0:0
From the mpirun man page, "the process’ "rank" refers to its rank in MPI_COMM_WORLD", so rank 0 is the first task and rank 1 is the second task. The first line of the file says assign the first task to the headnode in slot (socket) 1 and core 0. The original slot 0 is fine for the compute node since that is where the HCA's PCI lanes are connected. Hint: use
cat /proc/cpuinfo to get the socket ("physical ID") and core ("processor") logical numbers. I ran this for the single rail FDR case, and it made a big difference: ~6.3 GB/s. The inefficiency went from ~18% to ~7%! For dual rail FDR, bandwidth was about ~6.5 GB/s. Since the dual rail FDR should be able to saturate the PCI 3.0 x8 slot, then the max theoretical should be about 7.8, making the inefficiency about 17%...much better than 22-26%. I'd expect dual rail QDR and FDR10 bandwidth to be similar to the dual rail FDR. Latency improved, but not as much. The dual rail bandwidth is still only about 3% more than the single rail, so this doesn't change my conclusion that I don't need dual rail. However, if you have a company with a big QDR Infiniband installation with lots of extra switch ports, it would be cheaper to only replace the CX-2 HCAs with dual port CX-3 ones (assuming your hardware has PCI 3.0 slots, QDR already maxes out 2.0 x8) and double the number of QDR cables, than to replace all of the HCA's, cables and switches with FDR capable ones...AND you'd end up with slightly better performance. Another way to get around the PCI slot bandwidth limit, if you have enough slots, is to use multiple HCA's. For example, two single port FDR CX-3 HCA's in dual rail mode should be able to achieve ~6.3*2 = 12.6 GB/s, which is almost double what the single dual port FDR CX-3 HCA in dual rail mode could achieve. Cool stuff.
There's another
way to get efficient process bindings with mpirun. I was able to achieve the exact same core bindings and performance as with the above rankfile by using the following command:
mpirun -host headnode,node002 -np 2 -bind-to core -map-by dist:span --report-bindings --mca rmaps_dist_device mlx4_0 /path/to/osu_bw
These flags tell mpi to bind processes to core and to map them by distance from the adapter mlx4_0, which is the name of the IB HCA (mine have the same name on all nodes). The nice thing here is that no rankfile was required.
I should note, though, that the original benchmarks involving crossing the QPI are probably a more realistic representation of max bandwidth and latency since all of the cores on a node will be used for real simulations. That's why I didn't bother to re-run all of the different cases.
To do:
1. Remove old QDR hardware.
2. Enable 4G decoding on other compute nodes and install FDR HCAs.
3. Plug everything back in/cable management
4. Run OpenFOAM benchmark with FDR.
5. Install new processors and coolers in headnode
6. Run OpenFOAM benchmark on headnode and cluster