Search This Blog

Thursday, September 13, 2018

Infiniband Upgrade: FDR

In a previous post, I showed I had perfect performance scaling with QDR Infiniband. What this means is that the interconnect is no longer the performance bottleneck, so I didn't need anything faster. Thus, I upgraded to a faster FDR Infiniband system. ......shhh....

I purchased 5x Sun 7046442 rev. A3 HCAs. These are dual port CX-3 (pci 3, instead of cx-2, which was pci 2) re-branded Mellanox CX354A-Q HCAs. They're pretty cheap now. I got these for an average of about $28 each. You can reflash these with Mellanox stock firmware of the -F variety, which is the FDR speed version (see one of my previous infiniband posts on how to burn new firmware to these). So that's what I was planning to do. I also picked up 5 FDR rated cables for $18/each, and an EMC SX6005 "FDR 56Gb" switch (these are going for <$100 now, with the managed versions going for just over).

The first thing I tested was all of the Sun HCAs' ports and the cables. To my surprise, "ibstat" showed full FDR 56 Gbit/s link up. I guess the Sun firmware (2.11.1280) supports FDR. Lucky! Now I don't need to reflash their firmware. All of the cards and cables just worked. 

Bench testing HCAs and cables
I didn't have such luck with the switch. Both PSUs arrived half dead. It would pulse on and off when plugged in, so I had to send it back, and they sent a replacement. The replacement worked, but the links would not negotiate to anything faster than FDR10. ibportstate (lid) (port) is a good tool for checking what speeds should be available for your HCAs and switches (ibswitches gives lid of switch and ibstat gives lid of HCAs). I tried forcing the port speed using ibportstate (lid) (port) espeed 31 and other things (see the opensm.conf file for details), but nothing worked. I then did some research. This is an interesting thread for the managed EMC switches...turns out you can burn mlnx-os to them, overwriting the crappy EMC OS. Doesn't really apply to me though, since the SX6005 is unmanaged, so I'm running OpenSM.

I installed MFT and read the MFT manual and the SX6005 manual. I found the LID of the switch using ibswitches. I then queried the switch using flint: flint -d lid-X query full. This showed a slightly outdated firmware, as well as the PSID: EMC1260110026. Cross referencing that with the mellanox SX6005T (the FDR10 version) firmware download PSID: MT_1260110026, and you can clearly see that it's the FDR10 version. THAT's why the switch was auto-negotiating to FDR10 and not FDR. Turns out that you can update the firmware "inband", i.e. across an active infiniband connection. What's cooler: It's the exact same process as for the HCAs! HA! I'm in business. I downloaded the MSX6005F (not MSX6005T) firmware, PSID MT_1260110021, and followed my previous instructions with a slight modification to the burn step: 
flint -d lid-X -i fw.bin -allow_psid_change burn
, where X is the lid of the switch. I rebooted the switch (pulled the plugs, waited a minuted, plugged it back in, waited a few minutes), then queried the switch again, and it showed the new firmware and new PSID. I then checked ibstat, and BAM: 56 Gbit/s, full FDR. I posted this solution to the "beware of EMC switches" thread I linked earlier.

Another advantage of this switch over my current QDR switch is that this one only has 12 ports and is much smaller. It's also quieter, though that's like comparing a large jet engine to a small jet engine.

Now I just have to integrate all the new hardware into the cluster. 

Before I sell the QDR cables, I'm going to try running a dual rail setup (2 infiniband cables from each HCA) just to see what happens. Supposedly OpenMPI will automatically use both, which would be awesome because that'd max out the 80 Gbit/s PCI 3.0 X8 slot bandwidth. We'll see...

No comments:

Post a Comment