Rocket Science: Building a small HPC cluster, part 1

It's amazing how detrimental small apartment living has been to me as a maker. I can't emphasize enough how important the lab spaces and tools available to me at MIT were, all within walking distance. FIT has a nice shop and labs, but they close at 5 and I had to drive 20 minutes to get there. Living in England has been even harder in this respect. Everything is compact, and the small apartment we had in FL looks large compared to what we have now. I also don't have access to any machine shops. There are a few makerspaces in a city about 20 minutes away, but they don't have any serious machine tools.

I could whine some more, but I won't. I ended up adapting my interests.

I built a 15 machine ethernet windows cluster for running STAR-CCM+, a commercial CFD program, for the numerical part of my MS research. I borrowed all the computers in the lab for that, haha. Since my research is now primarily computational, I naturally began learning more and more about computers, server hardware/software, and networking. A few months of researching, and I decided I wanted to build a small cluster. I jumped and I had no idea how far down the hole went...

As part of the preliminary research into the world of server architecture, I asked myself, "What is the best $/performance ratio processors for CFD?" Turns out this is an incredibly complicated question to answer. Up until this point, I didn't really have a choice which systems I ran on. MSFC used the AMES Pleiades supercomputer, which at the time had a mix of sandy bridge, ivy bridge, and haswell nodes (and the haswells were always taken). At KSC, I could only really use the much neglected "america" mini-cluster, relegated to serving interns and low priority projects...I think it might have had dual 8 core ivy bridge processors, though I'm not sure which ones. At FIT, I had the random mix of AMD operton, nehalem xeon, and random i7's that had been in our lab for generations of grad students. This was the first time I had a choice, and I wanted to make the "right" one.

I knew I wanted real server hardware, meaning AMD server CPUs or Intel Xeons. Increased reliability and ECC are the two main reasons. However, I decided to keep Intel i5/i7/i9 processors in the running just for the sake of comparison.

The "$" and "performance" parts of the magic ratio can be looked at separately. Cost includes initial purchasing costs, and costs of ownership, which for me includes maintenance and home electricity costs, but for a company might include service contracts and rack space rental. I didn't want a system so old that the parts were likely to fail or have it be more of a space heater than a server. Overall though, the "$" is fairly straightforward to estimate.

"Performance", particularly for CFD, is much much harder to estimate, though it's fairly simple to define. A system which can complete more iterations in the same amount of time of the exact same case with the same program compared to another system has the superior performance. The relative performance between programs and solvers used may vary with server architecture, but most of the finite volume Navier-Stokes CFD programs will have the same trends, i.e. what will be faster for one program will be faster for another program. Processors determine the whole server system, since the systems are built around the processors. Thus, I focused on processor selection, in particular, two specific aspects: core*GHz and memory bandwidth.

Now, you're probably wondering why I didn't just look at one of the popular CPU benchmarks, e.g. cinebench, and use those numbers for my selection process. It turns out these benchmarks are pretty much useless for CFD. The algorithms used in CFD have very different process utilization than most programs. First, they will use 100% of a core 100% of the time. This makes hyperthreading at best useless, and at worst, detrimental to performance. Luckily, that can be turned off. Second, above some number of cores, they will use 100% of the memory bandwidth almost 100% of the time. In fact, CFD core scaling is usually memory bandwidth limited. In other words, if you add more cores (even within the same cpu) without increasing memory bandwidth, the performance plateaus and stops increasing. After how many cores this happens is dependent on many many factors. Third, accurate CFD simulations require (almost exclusively) double precision floating point operations. To put it simply, these bog down cores. Modern CFD programs can take advantage of modern advanced instruction sets, like AVX and AVX 2.0, but these don't exist in older CPU architectures.

I needed to set are a starting point for my search. I decided that anything older than sandy bridge (c. 2011) was too old to be worth it. This wasn't completely arbitrary, as AVX was first introduced with sandy bridge, and these see a significant inherent speed up over nehalem and older processors.

I also decided that I needed a model, one that I could input the specifications of a CPU, and the relative performance would come out. Unfortunately, as hinted at above, I can't simply go based on number of cores*GHz/core (or FLOPs). Newer generations of processors are inherently faster, and not just due to memory bandwidth. I needed to determine the relative influences of core*GHz, memory bandwidth, and processor generation on overall performance. And I needed real benchmarks to do this.

There are some CFD related benchmarks. There are a few random ones for random cases scattered throughout the google-verse (I'm pretty sure I found all of them up to a few months ago). These aren't particularly useful; to do a relevant comparison between processors requires a consistent benchmark. I stumbled upon the specfp benchmarks, and 3-4 of them are actually simple CFD codes. Comparing those to the other specfp benchmarks and you'll see some of the trends I mentioned earlier...CFD is "different". I wrote a python script that scraped all of the specfp benchmarks and sorted/averaged the relevant ones based on many different parameters. Unfortunately, I couldn't answer the memory bandwidth question because most of the benchmarks were done with the fastest memory rated for those CPUs. If I grouped the memory bandwidth in with the inherent generation performance increase, was able to determine some relative performance differences between processor generations. For example, given a haswell and a sandy bridge processor with the same core*GHz (or scaled to the same core*GHz), I was able to determine an average % performance increase to be expected of the haswell. The error on this number was very high, though. I believe part of the problem is that the specfp CFD benchmarks were created many many years ago for much slower processors, and thus they don't scale well, particularly for the very high core count systems from the past few years.

It was at this point that this research endeavor kind of fell apart. There just isn't enough publicly available CFD benchmark data. I was, however, able to determine some basic trends (for CFD): 1. Each generation is inherently a little faster than the previous...this can be as low as ~3%, but as high as ~15-30%. 2. Memory bandwidth is almost equally important as core*GHz, 3. Higher (10+) core count CPUs aren't very useful because the memory bottleneck is reached before all of the cores can be utilized...better off with higher GHz, lower core count processors. I decided that was enough information to get started. I don't think I've exhausted all of my options for this project yet, so I'll probably come back to it in the future.

If you're a computer/server savvy person, you may have already guessed which generation the magic ratio would favor. Sandy bridge processors won hands down. Comparable ivy bridge processors are about 2x the price new, for maybe 5-20% performance increase. Haswell and Broadwell systems were significantly faster, maybe 30-60%, but their cost is 4-20 times more, not accounting for the absurd costs of DDR4 ram at this time. I didn't bother considering the newer skylakes. I haven't examined the i5/i7/i9's, or AMD's architectures in any meaningful way yet, but I probably will one day. AMD EPYC seems very promising...2-3x the memory bandwidth of Broadwell at way way lower cost.

I did not consider electricity usage, but I probably should have. Another way of looking at the % faster numbers is % less electricity of the CPU for the same computation. The power consumption of the rest of the server is probably about the same, but since the CPU takes up a large portion of that, it's probably important.

Anyways, since initial costs seemed more important than eventual electricity costs, I had my generation chosen. The E5-2690 is the fastest sandy bridge (8 cores at ~3.2GHz all core turbo), so I figured I'd aim for that. Most sandy bridge systems are upgradeable to ivy bridge, so I figured I could do that in the future when the prices come down.

So the first computer I bought was based on an I7-5960x. Haha, oops. I found an incredibly good deal on it, so I decided I could use it as my head node and graphics processing node. I7-5960X with liquid cooler, 32GB 2400MHz RAM, 256GB M2 SSD, 3TB RAID 1 for storage, 1000W Platnium Superflower PSU, dual GTX Titan GPUs. The GTX Titan, Titan Black, and Titan Z have the same chip as the Tesla K20/K40/K80. Incredible double precision floating point performance. The only downside is they don't have ECC RAM, but they were significantly cheaper than the Tesla's when I bought them. The prices have shot up with the recent crypto mining craziness.

Desktop. I needed 12V from a molex connector for another project

The following is a summary of my first few months with server hardware.

Servers are loud. The noise doesn't matter in a data center, but it really matters in a homelab. I managed to score a ~$2000 APC Netshelter CX 12U soundproof server cabinet for ~$130. The thing is stupid heavy, and I had to take the castor wheels off to get it into my office. Unfortunately, my office has carpet, and this ~200kg behemoth doesn't roll on the carpet. I tried furniture sliders, but it's too heavy for them, so they don't slide. I ended up buying some plywood cut offs from eBay a few months later to lay down. Here's how it looks now:

Now it rolls

It killed these furniture sliders

Now it was time to fill it up. Well, sort of...my time line is all sorts of messed up. Anyways...

I've spent the last few months buying and selling various servers and server hardware. I made rule not to buy unless it was an absurdly good deal. I won't go through all of the details....rather dull anyways. I started with a HP DL380p G8 and DL360p G8, both with 2x E5-2690s and 64GB of DDR3 1600MHz (pc3-12800r) RAM. Unfortunately, something was messed up with the DL360p's on board memory, so I had some serious problems updating firmware. Eventually figured it out. I also got a Sun QDR Infiniband switch, a sun HCA for the desktop, and two HP Infiniband cards for the servers. I got lucky...it only took about a week to get QDR speeds working. I ran CentOS 7 on all three. Right about the time I got that system fully working, I bought a Supermicro 6027TR-HTRF for less than the price I paid for one of the HP servers. The 6027TR-HTRF contains 4x dual cpu sandy/ivy nodes, all in a 2U chassis. This particular one came with 8x E5-2650's, 256GB of pc3-12800r RAM, and 4x QLE7340 QDR infiniband cards. I've never seen a deal that good since then. The seller had ~5 of them originally and they all sold in a few days. I would have bought more, but my cabinet's thermal rating was only 800W, and this new server would be 1600W by itself. I had to build a more powerful fan system for the cabinet because I had doubled the thermal load. See this post.

In the middle of all of this, I won an auction for 12x Xeon Phi untested Co-Processor pcie cards. More on that here.

So I sold all of the HP servers after switching the E5-2690's for E5-2650's (made a little money if accounting for processors). I also switched out the RAM since I had previously managed to get 16x 16GB PC3L-12800R sticks for way less than it was worth. I got 2x more E5-2690's by smart bidding on servers/cpus on eBay. So now I had a kick ass 4 node server.

I then ran into problems with the Infiniband system. Basically, while Infiniband is supposed to be a standard similar to USB, it turns out that it's really not followed closely enough by all of the manufacturers. Intel/QLogic does not usually play nice with Mellanox and its rebrands. This is partially due to differences in architecture and the way MPI is handled (Intel uses PSM, Mellanox the traditional offload processing and Verbs). See this post for more details on my quest for Infiniband.

And now we're up to the present! Currently, I have the I7 desktop, the Supermicro 4 node server, the Sun Infiniband switch, an unmanaged 1Gbe switch, and a big Riello UPS (which I got new for ~$100). Near future changes include the 4x HCAs for the Supermicro nodes, which should get the Infiniband system back up and running. I can't expand anymore because I'm at the power limit of the circuit in my office, so this should be pretty close to the final configuration. My plan is to run some CFD benchmarks on these once I get everything setup.

Front view of inside of cabinet

3D printer

Stuff cleaned up

I also just bought another 140 of the Xeon Phi coprocessors, but that's a story for another time, haha. 3D printer in above photos is related.

If you'd like to start your own homelab, the /r/homelab and /r/homelabsales subreddits are great places to start.

Rocket Science

Search This Blog

Wednesday, March 21, 2018

Building a small HPC cluster, part 1

No comments:

Post a Comment

Followers