Search This Blog

Sunday, October 21, 2018

More OpenFOAM benchmarks

Now that I have the FDR Infiniband system installed, it's time to run some more benchmarks. The first test I did was 40 cores, headnode + node002. This completed in ~47 s, which is about 1-2s slower than with QDR. Not sure why it would be slower, but the one I recorded for QDR might just have been on the fast side of the spread from repeating tests. I then ran 100 cores (all nodes) and got ~14.2s , which is a speed up of ~25% compared to QDR. What's interesting is that this represents a scaling efficiency > 1 (~1.3)...in other words, the 20 core iter/s for the headnode + 4* the 20 core iter/s of a compute node is less than the 100 core iter/s for all 5 nodes with FDR. I have no idea how that could be possible. With QDR, I got perfect scaling (100 core iter/s = headnode + 4*compute node iter/s), which is what made me think upgrading to FDR wouldn't actually do much, but it really did. Perhaps summing iter/s isn't the best way to calculate scaling efficiency? I'll need to look into this more. Anyways, I'm really happy with the performance boost from FDR. I tried switching the HCA to a CPU 1 slot to see if it would make a difference, but it didn't, so I moved it back to a CPU 2 slot (bottom, out of the way).

In a post a few weeks ago, I mentioned I would be upgrading the headnode to 2x Intel Xeon E5-2690 v4 ES QHV5 (confirmed ES, not QS). Specs: 14c/28t, base frequency 2.4GHz, all core turbo 3.0 GHz, single core turbo 3.2 GHz. They're in perfect condition, which is rare for ES processors. The all core turbo is 3.0GHz, which is the same as the 10c/10t E5-4627 v3 QS I currently have. I replaced the Supermicro coolers and the E5-4627 v3's with the new procs and the Cooler Master Hyper 212 Evos.


I oriented the coolers so the airflow would be up. I'm planning on getting 2x more 140mmx25mm fans for the top heat extraction. I think I can wiggle the RAM out from under the coolers if I need to, which is convenient. This motherboard has the same issue that the SM X10DAi (which I finally sold thank goodness) had: the holes for the cooler are not drilled all the way through this motherboard, so you can't install the back plate. Instead, you have to screw the shorter standoffs into the threaded holes, then the CPU cooler bracket into those. Make sure not to over-tighten the CPU bracket screws because they are pulling up on the CPU plate, which is only attached to the surface of the motherboard PCB. If you tighten them too much, it could flex the plate enough to break it off the PCB.

Unfortunately, I completely forgot about clearing the CMOS, so I spent about an hour head scratching about why the computer was acting funny and turbo boost wasn't working. Once I pulled the CMOS battery (behind GPU, ugh) and cleared the CMOS, everything worked normally. Lesson learned: If no turbo boost with an intel cpu, trying clearing the CMOS. After that, I went into the BIOS and fixed all the settings: network adapter pxe disabled (faster boot), all performance settings enabled, RAID reconfigured, boot options, etc.

After confirming turbo was working as it should, the next step was to run the OpenFOAM benchmark on 1, 2, 4, 8, 12, 16, 20, 24, and 28 cores. On 20 cores, the new CPUs are ~10% faster than the old ones. Since the core-GHz was the same, that means the improvement was mostly due to memory speed/bandwidth. The RAM can now operate at 2400MHz instead of 2133MHz, which is about 12.5% faster. On 28 cores, the new CPUs were ~15% faster, only ~6% (4.5s) faster than 20 cores despite the additional 40% core*GHz. This is due to the memory bottleneck I mentioned in a previous post, and was expected. CPU1 showed about 5-6 C higher temps than CPU2 under full load despite similar power draws...I'll try tightening CPU1's heatsink screws a tad.

Finally, a full 108 core run: 12.7 s, or about 7.8 iterations/s. That's about 34% improvement over the 100 core, QDR cluster. Wow!

UPDATE (3 days later): I decided to do a 20, 40, 60, and 80 benchmark on just the compute nodes to try to track this greater than perfect scaling (iter/s / sumofnodes(iter/s)) thing down. The 20 took about 98 s, which is the same as before. The 40 took about 47s, which is the same as before, and about half the time with double the cores, which makes sense. But the 60 and 80 took about 10 s, which didn't make any sense. The cases were completing, too, no segfaults of anything like that, which I have seen in the past cause early termination and unreal low runtimes. I then compared how I was running each of them, along with the log.simpleFoam output files and figured out the problem. For less than 40 cores, I used the standard blockmesh, decomposePar, snappyHexMesh, simpleFoam run process. For greater than 40 cores, I tried something a little more advanced. snappyHexMesh does not scale as well as the actual solver algorithms, so for large numbers of cores, it can be less efficient to run the mesher on the same number of cores as you plan on running the case on. So I meshed the case on the headnode, then reconstructParMesh, then decomposePar again with number of cores I wanted to run, then ran it. What I didn't notice in the latter (n>40) cases were a few warnings near the top about polymesh missing some patches and that force coefficients couldn't be calculated (or something like that), and a bunch of 0's for coefficients in the solution time steps. The solver was solving fine, but it wasn't doing a big chunk of the work, so for the 100 and 108 core cases, the speed up appeared to be greater than 1. I fixed this and got 18.89 s for n=108, which corresponds to 0.98 scaling. Not as incredible as what I was seeing earlier, but still very very good.

Updated comparison
Reading the benchmark thread again, there were a few tips for getting the last little bit of performance out of the system. Flotus1 suggests doing these things:
sync; echo 3 > /proc/sys/vm/drop_caches\
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
The first clears the cache, and the second sets the OS's cpu scaling governor to performance (defaults to powersave). I didn't notice any improvement from clearing the cache, but the performance command did shave off about 1s (~1%) from the headnode's benchmark. To make that permanent, I created a systemd service script called cpupower.service:
[Unit] 
Description=sets CPUfreq to performance  
[Service]
Type=oneshot
ExecStart=/usr/bin/cpupower frequency-set --governor performance 
[Install]
WantedBy=multi-user.target
Then systemctl daemon-reload, and systemctl enable cpupower.service. This will load the service at boot, setting the cpufreq scaling governor to performance. Flotus1 also suggested turning off memory interleaving across sockets, but I don't think my motherboard does that because there were only options for channel and rank interleaving in the bios.

In other news:
I really need to learn my lesson about buying used PSUs...the fan in the Rosewill Gold 1000W PSU I bought is rattle-ing. Sounds like a dying phase. Ugh. This was the replacement for the used Superflower Platinum 1000W PSU that died on me a couple months ago. I'm going to replace the fan and hope nothing else breaks in it. Note to self: buy new PSU's with warranties.

To do:
1. Replace ntp with chrony on all nodes (ntp works between nodes, but headnode won't sync)
2. Install 2x new 140mm fans, replace 140mm fan in PSU.
3. Install new thermal management fans
4. Tighten CPU 1's heatsink screws
5. Move the GPU down a slot so CPU fans have more airflow room.


No comments:

Post a Comment