Search This Blog

Wednesday, March 21, 2018

Xeon Phi Co-processor Testing, part 1

I won an auction for 12 71S1p xeon phi coprocessor cards.


These are awesome...basically a 61 cores @ ~1.1GHz linux server with on-chip ultra high bandwidth memory in a double wide pcie card package. They're very picky about motherboards and thermals though. The motherboard must have "above 4G decoding" or "large memory BAR support" or something like that. Most supermicro's do, and all of the ASUS WS's seem to. Even then, it's not guaranteed all phis will work. The "p" phis are passively cooled, which means they're really meant for server applications. You can create some cooling fan ducts for them, or buy or 3D print them.

I tested them with another DL380p G8 (with 2x E5-2690's I later harvested) I had purchased. I bought the complete GPU package (see this post) for it, but the DL380p G8 only supports the 5110p Phi, which has a lower wattage rating than the 71S1p. I hacked (shorted the sense pins) an extra ATX PSU to power them, with the cables coming out of the other PCI riser's slots so I could close the lid.

The testing and firmware update procedure was fairly straightforward once I figured it out. This can probably be adapted for your own system. Most of this follows the readme text file and user guide that comes with the MPSS software.
  1. Update DL380p's firmware
  2. Install CentOS 7
  3. Install a Phi
  4. Change bios settings to enable large BAR support (in advanced menu) and set fans to max.
  5. Disable SELinux (re-enable after done testing Phi's.
  6. login as root, create RSA key so can use SSH later. ssh-keygen . You want to do this before configuring MPSS for the first time, otherwise you have to manually load the key (see readme text file)
  7. Download the MPSS software, readme, and user guide. If your firmware is older than that in the readme, try starting with an older MPSS. If you're using a kernel that isn't listed, then you can recompile the rpms using the instructions in the readme.
  8. Install MPSS (see the readme and user guide). I suggest rebooting.
On the host, from a terminal run: 
lspci | grep -i Co-processor
That will tell you which PCI port/slot thing its in. Mine was 24:00.0, so I did:
lspci -s 24:00.0 -vv
If lspci doesn't recognize it, then there's a problem with your card (assuming your motherboard is compatible). A likely culprit is thermal overload, especially if you're trying to use a passive "P" card without a cooling system. I actually went back to bios and enabled maximum cooling to help with this. If you have a desktop, you'll need to construct a custom cooling system (see above). Another possibility is that the card isn't seated well. Try reseating it. When none of that worked, I gave up on the card. I'm sure there is more advanced troubleshooting you could do, but I just don't know how to do it. Intel tech support seems to be pretty good, so it might be worth asking them.

Next, type:
modprobe mic
This starts the mic process. If you have just installed or reinstalled MPSS, then you need to do:
micctrl --initdefaults
 Then:
micflash -getversion 
This must be 375 for the latest MPSS release. Mine were 390. Then:
micctrl -s
This should return "ready". I'm not sure what to do if it does not.

Run:
micinfo -group Board
This should return a bunch of information about your Phi, though not all of it will be available because MPSS isn't running yet. Next:
micflash -update -device all -smcbootloader
Then restart the host, and:
modprobe mic 
micflash -getversion 
This should show the new firmware version. Next, start MPSS:
systemctl start mpss
 Now you should be able to ssh into the Phi's filesystem:
ssh mic0
 If that didn't work, you need to see the readme section about ssh keys and loading them.

Now, from the host, run:
miccheck
This should show all passes. Then run:
micinfo
This will show a lot of information about your Phi. You can launch a monitoring gui with:
micsmc
That's it. If your Phi passed all of that, you should be able to install software on it. I haven't done this yet...that will be the topic of another post.

You should also go to /etc/sysconfig/network-scripts/micX and change "onboot" to "no". I can't remember the exact reason for this, but it's in my notes.



For my lot, after all was said and done, 8/12 were recognized by lspci and tested to be working. The lspci recognition was spotty, though...probably because these weren't 5110p's. I managed to sell all of them for a profit. I kept one, the 7110p that was in the lot. While probably not useful for conventional CFD, it might be useful for something like OpenLB or anything super vectorizable that needs more umph per core than a GPU can provide. They're also supposedly really good for mining some cryptocurrencies, though I haven't tried.

No comments:

Post a Comment