Search This Blog

Thursday, May 31, 2018

Making more things

I made some things recently.

My wife likes pink, so I painted the desktop "candy pink" for her. She's been running astrophysics-y python codes, and her laptop just isn't cutting it anymore.

I also printed out the sugar glider decal.
Paint and decals make things faster, yes?

I modeled and 3D printed a Delta II rocket fairing for a downloadable 3D model of the Kepler Space Telescope, which I also 3D printed.





I put it on Thingiverse (see here for more info).

Monday, May 28, 2018

Clustering the Cluster and Software Part 2

This post will cover setting up the cluster networks and getting it working with OpenMPI.

First though, I have a major hardware change for the headnode. I managed to snag an Asus Z10PE-D8 with dual E5 v4 support (manufactured after Oct 2015) for a great price. It also has a PCIe 3.0 x4 M.2 slot for NVMe drives and 7 x16 slots, which are major improvements. Since I already benchmarked the one E5-2667v3 QS I have, I installed the two E5-4627v3 QS processors in the new motherboard straightaway. Oddly, the latest BIOS (3703) for this motherboard made it unstable, so I downgraded to the previous one. Not sure why. Maybe the "increase performance" entry in the release notes is some sort of default overclock that causes my system to be unstable? idk.

First, I was able to boot the new workstation from the old SSD. However, attempts to clone the the SSD to the NVMe failed (clonezilla). Thus, I had to install CentOS from scratch again. I then booted to the newly install OS, mounted the old SSD, and copied over any directories I modified, e.g. /opt, /usr/local, etc. Then I unmounted the old SSD, shutdown, and removed it. Everything worked great.

Now that I have a operational headnode, there are some things that need to be done to make it work in a cluster.

Headnode Cluster Setup

RAID

For my headnode, the OS is installed on an NVMe, but that's not big enough to store many CFD cases. For long term storage, I had two 3TB HDD's that I put in RAID1. The desktop did not have an onboard RAID controller, so I had to use mdadm. The Z10PE-D8 has a hardware RAID controller, but it only works with Windows according to the manual. It also has a software "LSI MegaRAID" controller. To use this, connect at least two HDDs to SATA ports. Make sure you are either in the SATA or sSATA ports, not in both. If you need more drives than there are SATA (or sSATA) ports, then you can use mdadm to create a software RAID that spans the multiple SATA controllers. Anyways, since I'm just using two in RAID1, I put them both in SATA ports. You then go into the BIOS, enable RAID in the PCH settings menu, then save changes and reset. The motherboard manual is incorrect after this point, though. There is no option to press "cntl+m" to start the LSI Megaraid utility. Instead, it seems to have been moved inside the BIOS. Boot into the BIOS again, this time look for the "LSI MegaRAID ..." in the Advanced menu. Go into that. The menus and explanations are fairly obvious. Select the RAID level, select the drives you want to put in RAID, name it, then create the RAID. Save changes and reset. Boot into the OS. "lsblk" should show the drives with the same virtual volume. In my case, I had sda1/md127 and sdb1/md127. So I created a directory "data" at root and mounted /dev/md127 to /data. Running "df -Th" shows /dev/md127 with a size of 2.7TB and filesystem ext4 mounted at /data. Success! To make the mount persistent, you must edit fstab so it is persistent across reboots:
  1. vi /etc/fstab 
  2. add line: /dev/md127 /data ext4 defaults 0 0
Also need to Need to rebuild initial ramdisk image (initramfs). Create backup first, then rebuild. 
  1. link 1 
  2. link 2
  3. cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
  4. ll /boot/initramfs-$(uname -r).img*
  5. dracut -f
Try rebooting. If the /data folder is still present, then it worked. It's probably also a good idea to chown the /data folder for the cluster user to make copying stuff to it easier.

Hostfile

Edit the /etc/hosts file to look like this:
127.0.0.1 localhost .... (etc)
::1....
192.168.2.1 headnode
192.168.2.2 node002
192.168.2.3 node003
192.168.2.4 node004
192.168.2.5 node005
 
In addition the above file, which is for network names, you need an mpi hostfile located somewhere easy to access, like the home directory. The name isn't important. You will call it with the "-f hostfile" mpirun parameter.
headnode slots=20
node002 slots=16
node003 slots=16
...etc...
 

Network and Intranet

Because of where this cluster is located, I do not have access to an internet ethernet port. I thus use a USB wifi dongle for internet. One of the headnode's LAN ports is connected to an 8 port unmanaged switch, along with one port from each of the slave nodes. This is going to be the gigabit intranet for SSH and MPI communications. In the future, I will create another network for IPMI using the other LAN port, but I don't have enough ports in the switch for it at the moment. I used the Network Manager GUI to do the following:
  1. Under default Ethernet connection, uncheck start automatically.  
  2. Also turn off both IPv4 and IPv6. This essentially disables the Ethernet profile. Alternatively, you could delete it.
  3. Create a new profile named intranet with IP 192.168.2.1, 255.255.255.0 subnet mask, 0.0.0.0 gateway, blank dns, connect automatically, and check "use this connection only for resources on its network"
  4. Create another profile named ipmi with IP 192.168.0.1, 255.255.255.0 subnet mask, 0.0.0.0 gateway, blank dns, do not check connect automatically, and check "use this connection only for resources on its network". This will be used later when I get more switch ports (another switch or a bigger switch) for talking to the slave node's management ports.  
I left the wifi settings default. My internet router assigns static IP's in the 192.168.1.X space, so that's why I skipped from 0 to 2. This allowed me to use internet and the intranet at the same time. I believe it should also be possible to use the other LAN port for internet at the same time as well if you turn IPv4 back on, but I have not tested this.

Firewalls

Because the headnode is connected to the internet, the firewall needs to stay active. However, the firewall will block MPI traffic. You have to assign ports for OpenMPI to use and then open those ports in the firewall. I'm not really sure how many ports to use...5 for both seemed to work ok.
  1. Create or edit a ~/.openmpi/mca-params.conf file. 
  2. Set the btl parameter "btl_tcp_port_min_v4" to some high port, e.g. 12341, and "btl_tcp_port_range_v4", which sometimes needs to be > the number of processes you plan on running. If it's 5, that makes the ports 12341-12345. This is done by adding a line, the parameter name = X, e.g. btl_tcp_port_min_v4=12500. 
  3. Set the oob parameter "oob_tcp_dynamic_ipv4_ports" to a range of ports. A small range works for this, like 5.
  4. Add the line: btl_tcp_if_include=192.168.2.0/24 . The restricts mpi tcp communication to the intranet subnet (change to your intranet subnet). An example of when this is important: Say you have two network interfaces and you don't want MPI using one of them, e.g. my ipmi interface, yet you want that interface to stay up all the time. If MPI sees that the interface is up, it will try to use it, even if there isn't a route between the nodes on that interface, so mpi will fail. This parameter prevents that error.
Those ports then need to be opened. Guide. Last time I did this, I was able to add the profiles intranet and ipmi to the home zone of the firewall. This was possible because the profiles were "logical devices". However, I am now unable to repeat this and must add the names of the interfaces, i.e. the ethernet adapter port names. Not sure what is different.
  1. sudo firewall-cmd --permanent --zone=home --change-interface=eth0 
  2. Then add all of the MPI TCP ports to the home zone's open ports. sudo firewall-cmd –-permanent --zone=home --add-port=12341-12345/tcp 
If any ethernet port associated with those profiles is ever connected to the internet, those ports will need to be closed. It's also possible to define MPI as a service (see previously linked guide).

NFS

Make sure to do the above firewall setup before doing NFS setup. Guides: 1, 2

The nfs-utils package should already be installed. If not, install that now. The user (cluster) home folder will be shared over the intranet link with the slave nodes. The directory must be completely owned by the user:
  • chmod -R 755 /home/cluster
  • chown -R cluster:cluster /home/cluster
The following services need to be enabled and started:
  • systemctl enable rpcbind
  • systemctl enable nfs-server
  • systemctl enable nfs-lock
  • systemctl enable nfs-idmap
  • systemctl start rpcbind
  • systemctl start nfs-server
  • systemctl start nfs-lock
  • systemctl start nfs-idmap
Next, edit the file /etc/exports, and add the following line to it. The "*" means "all IP addresses can mount this", which is fine for an intranet.
  • /home/cluster *(rw,sync,no_root_squash,no_all_squash,no_subtree_check)
Then do the following:
  1. systemctl restart nfs-server
  2. firewall-cmd --permanent --zone=home --add-service=nfs
  3. firewall-cmd --permanent --zone=home --add-service=mountd
  4. firewall-cmd --permanent --zone=home --add-service=rpc-bind
  5. firewall-cmd --reload

SSH

Need to setup passwordless SSH so the nodes can communicate. Guide. Make sure ssh is installed on all nodes. As the user (cluster), run "ssh-keygen", don't set a password (hit enter). This will create a SSH key that is automatically shared to all nodes due to sharing the home directory. Next, as cluster on the headnode, run "ssh-copy-id localhost". This copies the headnode's public key to the authorized keys file, which is then shared to all of the slave nodes via NFS. After setting up slave nodes (see below), it should be possible to SSH to all nodes, e.g. "ssh node001". MPI communication between nodes should also be possible.

Troubleshooting: make sure that your permissions are correct for the /home/cluster directory, /home/cluster/.ssh directory, and all of the files in .ssh. There are many forum posts on what these should be. If those are all correct and it doesn't work, try deleting all of the files in the .ssh folder and redoing the above steps. If that doesn't work, and you're getting something about "signing failed, agent refused operation", try running on headnode as cluster "ssh-add". I had to do the latter during one iteration of this cluster, but not during another...no idea why.

Slave Nodes

The plan is to setup one of the nodes first and get everything working with it, then clone the drive to three more SSDs for the other nodes. The only things that should have to be changed after cloning are hostname and network.

Hostfile

Edit (or copy from headnode) the /etc/hosts file to look like this:
127.0.0.1 localhost .... (etc)
::1....
192.168.2.1 headnode
192.168.2.2 node002
192.168.2.3 node003
192.168.2.4 node004
192.168.2.5 node005
 

Network and Intranet

The slave nodes will never be connected to the internet. They will only be connected to the intranet. Since there is no GUI, the settings have to be made with "nmtui" or manually using the /etc/sysconfig/network-scripts/ifcfg-ens2fX files, where X=0,1 for the two LAN ports. Change the IP settings to correspond to each node, then change ONBOOT to yes. The interface can be turned on from a terminal with ifup ens2f1.

Firewall

Kill the firewall on the slave nodes: systemctl stop firewalld, systemctl disable firewalld

NFS

Make sure to do the above firewall setup before doing NFS setup.

The nfs-utils package should already be installed. If not, install that now. Then run the following command to mount the NFS shared directory: sudo mount headnode:/home/cluster /home/cluster

To make this permanent, edit the /etc/fstab file and add the following line: headnode:/home/cluster /home/cluster nfs defaults 0 0

Repeat the above for all slave nodes. Test the NFS connection by creating a file in the home directory and check to see if it has propagated to all nodes. The following command may need to be run as root on all slave nodes to set SELinux policy:  "setsebool -P use_nfs_home_dirs 1". I don't remember the reason for this, but I think I had to do this to get it working before, though this time it seemed to work without setting that.


Testing

I received a SELinux alert about rpc on the headnode when I established the NFS connection on the slave node. I followed the instructions provided for changing the policy to allow this.

Now that everything is setup, you should be able to run the mpi_hello_world script across multiple nodes. Note: if not all slave nodes are hooked up yet, you must comment out their lines in the mpi hostfile with a "#". mpirun first tries to establish a connection to all nodes, so if one of the nodes in the list is offline, it throws an error.
  • mpirun -np 36 -mca btl ^openib -hostfile ~/cluster_hostfile ./mpi_hello_world
This should work with no errors.

OpenFOAM will not work across nodes though, because the environment is not copied. For that, I'll need to setup SLURM. However, as a quick fix, I changed the slave node .bashrc to source the OF bashrc instead of set it up as an alias. However, ultimately I will need to setup SLURM because I will likely be using multiple environments in the future. I ran the previous motorBike tutorial benchmark on 36 processes. This required changing the ./Allmesh script to call mpirun since the "runParallel" OF function does not use hostfiles. I commented out the runParallel line and added these two:
  • nProcs=$(getNumberOfProcessors system/decomposeParDict)
  • mpirun -np 36 -hostfile /home/cluster/cluster_hostfile snappyHexMesh -parallel -overwrite > log.snappyHexMesh 2>&1
Actually, running snappyHexMesh across 36 processes on a mesh this small is probably counterproductive, but it made a good test. IIRC, decomposePar can be run with a smaller number of partitions, snappyhexmesh can be run, then the mesh can be re-decomposed with more partitions for the actual case. Running the case was done with the following command:
  • mpirun -np 36 -hostfile /home/cluster/cluster_hostfile simpleFoam -parallel > log.simpleFoam 2>&1
This completed in 64.24s, or 1.56 iter/s. This is fast, but it's about 21% less than the sum of the two individual nodes' max iter/s. Part of the reason for this is that the slave node is about 20% slower than the headnode on a core-by-core basis, so the headnode's processes must wait for the slave node's processes to finish. The primary reason is likely communication delay caused by the gigabit network. Typically, 60-100k cells per process is close to optimal (can more or less depending on solver, network, and many other things). 2M/20 is 100k, and 2M/36 is 55k, so it makes sense that the speed up is not perfect. Faster networks allow for lower numbers of cells/process while maintaining speed up scaling.

Infiniband

I've had a few posts discussing this, but to summarize: I have a working QDR Infiniband system based on used Sun Infiniband cards (actually rebranded Mellanox MHQH29B) and a Sun switch. I've tested it with various performance tests, and it achieves the expected QDR performance. Now I need to make it work with MPI and OpenFOAM.

First, RDMA communications require memory, and if you use a lot of memory (like when running CFD), you will probably hit the system limits. You need to add a rdma.conf file to  /etc/security/limits.d that has the following:
# configuration for rdma tuning
* soft memlock unlimited
* hard memlock unlimited
# rdma tuning end
That file needs to be copied to the slave nodes:
  • (as root) scp /etc/security/limits.d/rdma.conf root@node002:/etc/security/limits.d/rdma.conf
Now reboot all nodes. Supposedly logging out and logging back in will work, and there's probably a clever way to reload that limit, but I don't know it. "ulimit -l" should give "unlimited". Now to run the mpi_hello_world script again, but this time without the mca parameter excluding openib:
  • mpirun -n 36 -hostfile ~/cluster_hostfile ./mpi_hello_world
That should run without errors. You can also just remove the "^" from the previous command to ensure that infiniband is being used.

Now to try that OpenFOAM benchmark again. Everything is the same as before (see "Testing" section above), except now the log file shouldn't have any warnings about not using verbs devices, and it should be faster. Run time was 51.21 s (1.95 iter/s), which is approximately 20% faster. In fact, it's only about 1% slower than the sum of the two nodes' individual max iter/s. This means scaling is almost perfect, which is excellent. 

Summary

I now have OpenFOAM working over Infiniband with a headnode and one slave node. There are still more things to do though:
  1. Setup environment module for openmpi
  2. Get SLURM working
  3. Copy the slave node's SSD 3x times and install those SSDs in the other slave nodes
  4. Modify node003-node005's hostname and network IP addresses.
  5. Test MPI over intranet and Infiniband with all nodes
I think I'm going to skip trying to get disk-less boot working. It's very complicated. Some of the software I eventually want to install on the slave nodes is rather heavy, and I don't want the OS and installation taking up a big chunk of the RAM.



Update: A few days after setting all this up (and the server being off), I had a weird error. SSH started asking for a password, and I couldn't access the /home/cluster directory on the slave node. I got a NFS error "Stale file handle". I fixed this by doing the following as root on the slave node:
  1. umount -f /home/cluster
  2. mount -t nfs headnode:/home/cluster /home/cluster
It seems to work now, even after rebooting both headnode and slave node. Not sure why it happens.

Homelab Cluster: Software Part 1

This is going to be a mess, curt, and full of errors. Fucking blogger blanked out then autosaved my nice draft of detailed instructions, so I had to recreate all of this from memory. I'm going to be updating it in stages since that's the only way to prevent autosave from deleting everything. There will be many places where I note I can't remember something exactly.

I finished updating BIOS and IMPI on all slave nodes. Downgraded all Sun Infiniband cards to the 2010 firmware and installed them. I built the new work station that will serve as a headnode. It's first configuration had a SM X10DAi motherboard and a single E5-2667v3 QS. It's newest configuration will be discussed in the next post.

Since the slave and head nodes have different software requirements, I've split them up.

Headnode Software

OS and Initial Setup

Most NVMe's and the X10DAi do not play nice together, so I ended up using a standard SATA SSD. Create a CentOS DVD USB installer. Install it with development tools, infiniband support, a gui, etc. Create an admin user "cluster". Do yum update. You may have to do "sudo systemctl stop packagekit" to stop the auto-updater from locking yum. Reboot. Install the following packages:
zlib-devel libXext-devel libGLU-devel libXt-devel libXrender-devel libXinerama-devel libpng-devel libXrandr-devel libXi-devel libXft-devel libjpeg-turbo-devel libXcursor-devel readline-devel ncurses-devel python python-devel qt-devel qt-assistant mpfr-devel gmp-devel libibverbs-devel numactl numactl-devel boost boost-devel environment-modules ntp
Most of the above are from step 2 here (scroll down to CentOS 7.4 instructions). If you'll be using slurm, you need to install the following:

  • yum install epel-release
  • yum install munge munge-libs munge-devel rpm-build gcc openssl openssl-devel libssh2-devel pam-devel hwloc hwloc-devel lua lua-devel rrdtool-devel gtk2-devel man2html libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker
Most of those will probably already be installed. If using an nvidia graphics card, you may need to install CentOS in basic graphics mode or with another simpler graphics card, then install the Nvidia driver before yum update all and reboot. Note: if you install the nvidia driver with a non-nvidia graphics card, you will need to put the nvidia gpu in between shutting down and rebooting or CentOS boot will hang. Uninstall the system cmake (sudo yum remove cmake). Name the headnode headnode using the hostnamectl command (hostnamectl set-hostname headnode).

I'll be installing pretty much everything as root instead of user, so that changes some things, but I can't remember what exactly.

Install cmake:
  1. Download the latest source and install script from their website. 
  2. Move the tar and .sh script to the /usr/local directory
  3. Run the .sh install script
  4. Keep hitting enter until see kitware. Hit enter 1 more time
  5. Type y to accept license agreement
  6. Type n to install in /usr/local so the system can find cmake
  7. cmake -version should return cmake's version

OpenMPI

The OpenFOAM ThirdParty folder contains an old version of OpenMPI. It's better to use a newer system install of OpenMPI. Download the latest version source and put it in the opt folder. Instructions for building OpenMPI are here. Configure "--prefix=/opt/openmpi-3.1.0 --with-verbs --with-slurm --with-pmi=/usr" for infiniband and slurm support. Note that slurm (and slurm's libpmi package) must be installed first if using pmi/pmi2. I detail this in part 3 of this series of posts. If not using slurm, can leave off the "--with-slurm --with-pmi" support openmpi configure options. It's probably a good idea to redirect output of configure and install to files, e.g. >log.install 2>&1. Make sure no build errors. Then add the openmpi bin to PATH, e.g. PATH=/opt/openmpi-3.1.0/bin:$PATH, and lib to LD_LIBRARY_PATH, e.g. LD_LIBRARY_PATH=/opt/openmpi-3.1.0/lib:$LD_LIBRARY_PATH, to both root and user's .bashrc. If you build in usr/local, then you don't have to do that, but then it's harder to maintain different versions of MPI. mpirun -version should give the openmpi version. In part 3, I discuss how to use environment modules instead of adding the paths to the .bashrc. Download a test mpirun hello world script. Compile and run it:
  • mpicc -o mpi_hello_world mpi_hello_world.c 
  • mpirun -n 16 -mca btl ^openib ./mpi_hello_world
You will get a warning about not using infiniband if you don't exclude openib.

OpenMPI can use different core bindings and distribution methods. I tried a bunch of these and had two long paragraphs detailing them, but again, this got deleted. The conclusion was that the v3.1.0 defaults seem to work the best for OpenFOAM.

OpenFOAM

These instructions loosely follow these links: wiki, install guide, build guide, system requirements.

The differences between OpenFOAM+, e.g. v1712, and OpenFOAM, e.g. v5, is not clear. They are maintained by different companies, but they work together. The two types of OpenFOAM share most of the code. OpenFOAM+ is released every 6 months, while OpenFOAM is released more often, but they both have development git hubs where you can download dev versions. The last version I used was OpenFOAM v4, so I decided to try OpenFOAM+ this time. They're both using Docker now with precompiled binaries. This is very convenient for single machine installs. However, Docker does not work well on clusters, which is why I have to install from source.

Download and untar v1712 according to the instructions. I'm installing OpenFOAM in /opt. Must modify the OpenFOAM/etc/bashrc file's install directory to be /opt (uncomment and comment out a line). While in there, change wm_label_size to 64 and make sure the mpi type is systemopenmpi.

CentOS 7.5's system installed stuff is all recent enough versions except for cmake, which was already taken care of above. CGAL is installed automatically by the ThirdParty folder, but must modify it's OpenFOAM/etc/config.sh file to set the boost library to boost-system so the ThirdParty boost isn't installed. The ThirdParty folder is missing METIS. This needs to be downloaded and unpacked in the ThirdParty folder, which is already setup to install it, so nothing further needs to be done (unless the version is different, in which case the config file needs to be changed). MESA is also missing, but only need that for the GPU-less slave nodes.

The shipped version of cfmesh does not build with OpenFOAM. You must update it from the github repository. Go to OpenFOAM/modules and mv cfmesh /opt/oldcfmesh. The old directory must be moved to a not-openfoam directory or Allwmake will find it and try to build it. Then clone the latest github version of the cfmesh repository to "cfmesh" in the same location. Now you should have a cfmesh folder with the new files. This will build automatically and shouldn't have any errors. This wasn't well documented until I filed a bug report, though it had been fixed back in January.

Need to source the OpenFOAM bashrc in the root and user bashrc. See the wiki guide for how to set this up as a convenient alias. Example:
  • alias of1712='source /opt/OpenFOAM-v1712/etc/bashrc FOAMY_HEX_MESH=yes'
Source it before continuing installation. You will probably see this warning: "No completion added for /opt/OpenFOAM-v1712/platforms/linux64GccDPInt640Opt/bin". It can be ignored. It should go away after resourcing the bashrc after building OpenFOAM or after a reboot.

Do not install the ThirdParty first. Follow these instructions, which are a mix of the wiki and official instructions:
  1. cd to the ThirdParty directory
  2. ./makeParaView -mpi -python -qmake $(which qmake-qt4) > log.makePV 2>&1
  3. check log for errors
  4. wmRefresh
  5. foam
  6. the above should change to the OpenFOAM directory. If it does not, something is wrong.
  7. foamSystemCheck
  8. export WM_NCOMPPROCS=8
  9. ./Allwmake > log.make 2>&1 (this should be the openfoam Allwmake, not the thirdparty folder one)
Check log for any errors.

After successful install:
  1. become user, source openfoam alias
  2. foamInstallationTest 
  3. mkdir -p $FOAM_RUN
  4. sudo chown -R cluster:cluster ~/OpenFOAM
  5. run 
  6. cp -r $FOAM_TUTORIALS/incompressible/simpleFoam/pitzDaily ./ 
  7. chown -R cluster pitzDaily
  8. cd pitzDaily 
  9. blockMesh 
  10. simpleFoam 
  11. paraFoam
  12. cp -r $FOAM_TUTORIALS/incompressible/simpleFoam/motorBike ./ 
  13. chown -R cluster motorBike
  14. cd motorBike 
  15. disable streamlines stuff in the controlDict
  16. ./Allrun
If the above works, then your OpenFOAM installation is working. The streamlines functions syntax has changed, but hasn't been updated in the tutorials, so it causes errors.

Slave Node Software

The plan is to create an installation on a small SSD on one slave node, make sure it is fully working, then clone it for all of the other slave nodes. If you do this, make sure that you are using the smallest SSD you have. Cloning from a smaller to a larger drive is easy, but cloning a larger drive to a smaller is almost impossible. There is another way which involves storing the configured slave node OS on the headnode, serving to each of the slave nodes with PXEBoot, and booting the OS into their RAM. This is better for small compute node OS installations, but ends up taking up a lot of RAM for larger installations. I'm going to use the cloned SSD's method for now, but I may try the network booting method later.

OS

Install CentOS compute node type with Infiniband support, development tools, etc, but no gui. Create the partitions manually. Use ext4 and no logical volumes because XFS is less flexible and logical volumes can be difficult to deal with for something this simple. Create a user "cluster". Make sure that the UID and GID of this user is the same for all nodes, including the headnode: "id (username)". You may have to set it. Do yum update. Reboot. Install the following packages:
zlib-devel libXext-devel libGLU-devel libXt-devel libXrender-devel libXinerama-devel libpng-devel libXrandr-devel libXi-devel libXft-devel libjpeg-turbo-devel libXcursor-devel readline-devel ncurses-devel python python-devel mpfr-devel gmp-devel libibverbs-devel numactl numactl-devel boost boost-devel environment-modules ntp
Most of the above are from step 2 here (scroll down to CentOS 7.4 instructions). This is similar to the headnode except no qt because only need qt for ParaView, which will not be installed on the slave nodes. If you'll be using slurm, you need to install the following:

  • yum install epel-release
  • yum instalmunge munge-libs munge-devel rpm-build gcc openssl openssl-devel libssh2-devel pam-devel hwloc hwloc-devel lua lua-devel rrdtool-devel gtk2-devel man2html libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker
Most of those will probably already be installed.  Name the slave nodes sequentially with the nodeXXX format using the "hostnamectl set-hostname" command. I like to make the XXX the same as the last three digits of the static intranet IP or IMPI IP address I set.

Install cmake as with the headnode.

OpenMPI

Install OpenMPI like on the headnode.

OpenFOAM

Follow the headnode instructions until it mentions MESA, then come here.

MESA is a graphics driver library thing for linux machines that do not have graphics hardware, like these slave nodes. Download MESA and unpack it in the third party folder. The rest of these instructions I can't remember clearly. The basic idea is to compile the VTK libraries and MESA so that the slave nodes can do stuff like write VTK files. There are some text files (build and readmes?) in the ThirdParty folder that help. You need to create a symbolic link in the main ThirdParty folder to the VTK library in the ParaView folder. Then you might need to change the MESA and VTK versions in some config files and/or in some files in the ThirdParty folder. Then you need to make MESA with the ThirdParty make MESA file, and make VTK with the ThirdParty make VTK file. I think there is an example you can use for this. ParaView is not built on the slave nodes.

Follow the headnode instructions concerning cfmesh and the OpenFOAM bashrc. Follow the build instructions, except don't make ParaView. Follow the same test instructions, except don't do parafoam (since paraview isn't installed).

Benchmarks and conclusions

Someone on CFDOnline created a convenient benchmark for OpenFOAM based on the motorBike tutorial. I downloaded this and ran it on various node configurations. Remember to change the controlDict's streamlines stuff (mentioned above). I had a very nice, multi-paragraph, well-laid out instructions, presentation, and analysis of the results, but again, it got deleted by the fucking autosave. The main conclusion was that my results make sense compared to the prior results, and that the AMD EPYCs are awesome for CFD due to their high memory bandwidth.

I made sure to set all of the slave node's BIOS to performance, but I didn't see a setting for that with the X10DAi. Turns out it's hidden until you select custom in power management. Link. Doing the things in that link improved iterations/second by 22%.

Next Steps

Now that OpenMPI and OpenFOAM are working on individual nodes, the next steps will be getting them working over ethernet, then Infiniband.

Making a RAID1 volume on Centos7 with mdadm that stays persistent upon reboot

The normal online guides didn’t work for me (example links 1 2). I could create one, and it would work fine. But the raid array, which I wasn’t even booting off of, would fail during boot, which caused a bunch of other failures, which would cause CentOS to boot into emergency mode. No way to even get the gui to boot (I tried). When I commented out the raid array line in fstab and rebooted, CentOS would boot fine, but there was no sign that the previously perfectly working raid array…no recognized superblocks, nothing, so I couldn’t assemble or start it again. Here’s the process that eventually worked.
  1. lsblk –figure out which disks you want to raid 
  2. If you already have a raid running that you want to kill and remake, follow this guide
  3. Delete partitions on the disks using ?? or the disk utility. 
  4. Run the command sudo dd if=/dev/zero of=/dev/sda bs=1M count=500000 , where sda is the disk you want to over write with zeros, and count is the size of the device in MB. The goal is to wipe out any metadata (filesystems, etc…) left on the disk from previous fails. You can kill it after a few minutes…just need to wipe first few sectors usually 
  5. Contents in the disk utility should say “unknown” now. If it says free space or unallocated space, go back and run the zeroing command longer 
  6. Now create the partitions. There are pros and cons to making the raid of partitions or the actual disks. I used partitions. 
    1. link
    2. sudo parted /dev/sda
      1. (parted) print – to check to see if a partition exists. If these are blank drives there should be no partition. If a partition exists, go back to main step 2. If a file system exists, go back to main step 3.
      2. (parted) mklabel gpt – sets disk to GUID partition label. 
      3.  (parted) print – check to see if a GPT partition label was created. 
      4. (parted) mkpart primary 0% 100% - Create a single primary partition aligning sectors and using all available space on drive. 
      5. (parted) set 1 raid on – Marks partition of type "Linux raid". 
      6. (parted) align-check optimal 1 – Checks for alignment of partition “1” we just created. 
      7. (parted) print – Final check to determine if the primary partition was created properly with the appropriate partition label and marked for software raid. 
      8. (parted) quit 
    3. sudo gdisk -l /dev/sda (checks partition information)
    4. Repeat above for other disk (sdb for me) 
  7. Create raid array: sudo mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1 
    1. Make sure to use sda1, not sda, unless you are doing this on the drives and not the partitions. 
    2. Commands for checking progress of syncing (will take a few hours): 
      1. cat /proc/mdstat 
      2. mdadm –detail /dev/md0 
    3. You may not have to wait for it to finish to continue, but I do to be safe. 
  8. Make filesystem and mount: 
    1. Could create logical volumes before creating the files system and mounting, but don’t have to: link 
    2. sudo mkfs.ext4 –F /dev/md0 
    3. sudo mkdir /data 
    4. sudo chown (username) /data (use your username if you don’t want to have to sudo everytime you copy something to this, otherwise leave this step out) 
    5. sudo chmod 775 /data (use whatever you need here) 
    6. sudo mount /dev/md0 /data 
    7. Try writing something to the raid. “touch test” or something like that. 
  9. Create mdadm.conf file, which is used at boot to build the array: 
    1. mdadm --detail --scan > /etc/mdadm.conf 
    2. I did not use this one: mdadm --examine --scan --config=mdadm.conf >> /etc/mdadm.conf 
    3. explanation  of differences
    4. Note: > : if file exists, it will be replaced. >>: if file exists, it will be appended 
    5. possibly useful link 
  10. Edit fstab so it is persistent across reboots 
    1. vi /etc/fstab 
    2. add line: /dev/md0 /data ext4 defaults 0 0 
  11.  Need to rebuild initial ramdisk image (initramfs). Create backup first, then rebuild. 
    1. link 1 
    2. link 2
    3. cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
    4. ll /boot/initramfs-$(uname -r).img*
    5. dracut -f
  12. Reboot. If it works, YAY! If it doesn’t, try again!

Friday, May 25, 2018

Homelab Cluster: Growing pains. NVMe boot with the X10DAi

This post is entirely about trying to get an NVMe SSD to boot from an X10DAi motherboard.

Apparently installing to and booting from an NVMe with Linux is a common problem. The NVMe, motherboard BIOS, and OS all have to have compatible drivers for it to work. My workstation motherboard is a X10DAi, and the NVMe I'm trying to use is the Samsung 960 Evo. This FAQ says to just enable EFI Option ROM for the PCIe slot the NVMe drive (well, it attached to its adapter) is in, then boot from a UEFI dvd/usb installer and install. However, this did not work for me with CentOS 7.3 because CentOS couldn't see the NVMe. I activated a shell from the centos installer (cntl + alt + F2) and did "lspci", but it didn't detect the NVMe drive, which means the motherboard/BIOS isn't seeing it either. The BIOS was a version out of date (2.0a instead of 3.0a), so I downloaded the new BIOS image and followed the instructions to update it, which went smoothly. Loaded default BIOS settings, changed the EFI OPROM on appropriate slot again, then booted to the CentOS UEFI installer. But it still didn't see the NVMe drive. Back to the shell...nope, nothing. "modprobe nvme" doesn't help either. "lsblk" doesn't see it. Well this sucks. So either the motherboard doesn't work with NVMe's (despite the FAQ), the PCIe adapter is bad, the drive is bad, or the drive is incompatible with Linux (unlikely). To test to see if it's CentOS, I tried installing Ubuntu 18. It also did not see the drive. So then I took it out of the work station and put it in the desktop. The desktops motherboard has an M.2 PCIe drive slot, so it should be compatible with PCIe SSDs in PCIe slots. Bingo, showed up in Ubuntu 16.04 "lsblk" without having to do anything. "lspci" shows the Samsung driver. I was able to partition it and write files to it. So the drive and the adapter are fine. The X10DAi is not compatible with most NVMe PCIe SSDs for boot, despite what the FAQ says. It may only work with certain ones, or it may only boot with Windows, which is annoying. This thread suggests the BIOS might have to be modded, which I really want to avoid. A post in that thread said that the 950 Pro works with an X10 and the latest BIOS. This thread has an interesting post in reply to someone else saying a 950 Pro worked in their X10DAX (which is same line as my X10DAi):
Contrary to nearly all other NVMe SSDs the Samsung 950 Pro has an NVMe Option ROM in the box. That is why you can boot off that SSD in LEGACY mode.
Generally you need a suitable NVMe EFI module within the mainboard BIOS, if you want to be able to boot off an NVMe SSD in UEFI mode.
I guess that explains it. The fancy Intel drives they suggest buying problem have the NVMe Option ROM, too. Here's another good post, this time about the X10DRi-T. He managed to get SM to send him a BIOS with NVMe support for the 960 Evo. Unfortunately, it seems they haven't updated the X10DAi's BIOS with the same code. Moving forward, I'll either have to:
  1. Mod BIOS
  2. Get a Samsung 950 Pro
  3. Get a different motherboard, like an ASUS Z10PE-D8/D16
I looked into the ASUS Z10PE-D8/D16. According to a few threads on servethehome, it turns out that unless you buy one from after Nov. 2015, they do not support dual E5 V4 CPUs, even with a BIOS update. Turns out they a chip has to be replaced on the motherboard to enable dual E5 V4's. Since most of the used boards are probably produced before that, and given my luck so far with this project, I think I won't do that. Unfortunately, the X10DRi-t's are expensive. None of the other dual SM X10 motherboards are confirmed to work with the 960 Evo or other consumer NVMe SSDs. That pretty much leaves getting a Samsung 950 Pro. For now, I'll have to use a regular SATA SSD for the OS. So I installed the OS, did updates, etc. I also contacted Supermicro Support to see if they could do anything.

Then I had a derp moment. I only had one CPU installed, which meant that the PCIe slot I was trying to use for the NVMe was not active. ** ** ***** * * * **** * (Oops). I put it in another slot and booted to the SATA SSD I had already installed CentOS on. "lsblk" showed the nvme, and lspci showed the samsung driver, but oddly for the PM961. I suppose the PM961, like the 950 Pro, is supported then? Maybe the PM951 is, too? Who knows. I shut down, disconnected the SATA drive, put the installer CentOS USB back in, booted to UEFI installer, which also saw the drive. I then installed CentOS 7 minimal to it, shut down, and rebooted. However, the NVMe drive was not listed as a boot option. So close!

So I tried again, but this time with an extra USB drive inserted. I did custom standard partitioning and placed the /boot and /boot/efi partitions (1 GiB each) only on the USB drive, but placed everything else only on the NVMe drive (root 50GiB, swap 8GiB, home ~407 GiB), switched the USB drive to the boot device, installed, and rebooted to the USB drive. THIS WORKED. CentOS minimal booted from the USB to the NVMe. Heck yes. It's ugly, but it works, which is what matters. I'll probably get a low-profile 4-8GB USB 3.0 drive for the boot partition drive and just leave it in my computer forever.

Hopefully I'll be able to enable SATA RAID in the BIOS so I can use the hardware RAID controller for the storage drives. There seems to be some incompatibility with the RAID setting (it's on AHCI now).

Updates:

I've been in contact with Supermicro support. The X10DAi simply doesn't have the right code to talk to most consumer PCIe SSD's. I'd have to purchase a custom BIOS from them to enable booting from the 960 Evo.

That's kind of irrelevant though because I happened upon a crazy good deal for a Z10PE-D8 manufactured after Oct 2015, meaning it has the correct BIOS chip to handle dual v4 Xeons. So I bought it. It also has a PCIe 3.0 x4 M.2 slot that works with NVMe's, as well as 7 x16 slots. So overall, major upgrade. I'll be selling the X10DAi. 

Saturday, May 19, 2018

A few more successful prints

I designed and printed a couple things.

1. A better filament guide for the Wanhao i3.


It's shorter, so it aligns the filament with the extruder better. First attempt interfered with the spool holder, but I was able to modify it. The thingiverse model has been corrected. I made it so it would fit the teflon tube holder that fixes the coiling problem I mentioned earlier.

2. A SSD adapter for a Supermicro drive tray. I needed this for the homelab cluster.


This one is a bit tricky. Ideally, I'd have plastic on both sides and/or the bottom of the drive, but the way HDD's fit in the trays and the lack of space in the tray bays prevent this. Thus, the SSD is only held in by two side screws. Some support at the back helps keep it in place while being loaded. It works well, and is easy to adjust for thicker SFF drives. I thought about printing and selling them on eBay for less than the SM equivalent (MCP-220-00043-0N), but I'd barely break even.

My Phi adapters are selling ok on eBay. It's a small market, but I think my design is superior to all of the alternatives, so I should be able to make a tiny amount of money and hopefully partially pay off this printer.

Tuesday, May 15, 2018

Flashing Rebranded Mellanox Infiniband Cards and Homelab Infiniband, Part 3

I did the performance testing: Link. Bandwidth was about 3.3 GB/s , which is close to the theoretical maximum of 4GB/s (32 Gbit/s) of a QDR link. Primary conclusion: No significant difference between the firmware versions.

Next steps:
  1. Flash all cards with 2.11.2010 firmware
  2. Update IMPI and BIOS in other nodes, change BIOS settings to maximum performance
  3. Install all Infiniband cards
  4. Get OpenMPI and OpenFOAM working on one node. 
  5. Run OpenFOAM benchmarks
  6. Learn how to do network boot
    1. If this fails, mirror node installation to other nodes' ssd's
  7. Get clustered openmpi working over ethernet
  8. Get clustered OpenFOAM working over ethernet
  9. Get clustered OpenFOAM working over inifiniband

Monday, May 14, 2018

Flashing Rebranded Mellanox Infiniband Cards and Homelab Infiniband Part 2

When I last left off with this, I said I was going to purchase 4x more identical Sun QDR HCAs because I knew they worked with my Sun switch. So I did that. Specifically, I bought Sun/Oracle X4242A 375-3696-01 Rev. 51 Dual Port QDR Infiniband HCAs,which according to my research, are equivalent to a Mellanox MHQH29B. Turns out the cards had little stickers on them saying they were MHQH29B-XSR rev A3's, so my research was right.

This should be easy, right? I mean, they work with my desktop and the Sun switch at QDR speeds, what could go wrong? Hahaha...nope. Time for another installment of Infiniband Nightmares.

As a reminder, the server I want to put them in is a 4 node Supermicro 6027TR-HTR (motherboards: X9DRT-HF) server. I installed them all, but 3/4 caused boot to hang at post code 91, which is when the PCI stuff is loaded. Shit. The fourth one seems to let the node boot fine, though. Switching nodes/pci slots doesn't help, and I know all the pci slots work fine because I had the QLE7340's in them. "OK, so you have 3 dead cards."...except not. Here's the weird part: they all boot fine in my desktop (I7-5960x, X99-SLI motherboard) and are recognized by lspci and ibstat.

I thought it might be a BIOS problem, so I upgraded IMPI and the BIOS of one of the nodes. Unfortunately, that didn't help. 

I also tried a bunch of different BIOS settings, none of which helped (UPDATE: not true anymore, see bottom of this post). I tried taping the PCIe SMBus pins. Also didn't help. 

One last thing to try: card firmware. The only differences between the card that allowed boot and the 3/4 that did not are as follows:
  • GUIDs, MACs, Serial numbers (all duhs. Those will be different for every card)
  • Firmware Version. 2.11.2012 vs 2.11.2010
  • MIC Version: 1.5.0 vs. 1.2.0 (same order)
    • not sure what this
  • One line in the .ini files: "log2_uar_bar_megabytes = 7" vs. "sriov_en=true" (same order)
    • not sure what this does
  • The 2.11.2012 cards' firmware .bin file is about 12% larger. It's binary, so can't examine it.
The device ID (26428), PSID (SUN0170000009), HW revision, flint hardware info, mlxburn vpd (minus serial number), are all the same.

I did a ton of research about flashing firmware to rebranded Mellanox ConnectX-2 and ConnectX-3 HCAs. This relevant information for various HP, Dell, IBM, and Sun branded Mellanox cards. Here is a list of the links I found most useful for writing the following guides:
  • 1
  • 2 Look for post by TeeJayHoward. His website has some of the MHQH19B and MHQH29B firmware files. 
  • 3 Look for post by izx
  • 4 
  • 5 Mellanox guide
  • 6 Post I started to deal with this problem
  • 7 Connectx3 firmware post.
Potentially useful files:
  • Most of TeeJayHoward's relevant files mirrored
  • Sun 2.11.2010 firmware
  • Sun 2.11.2012 firmware
  • Mellanox MFT 4.9 (in case they take it down for some dumb reason)
  • CX3 firmware mirror.
  • See Mellanox's firmware download site for what they consider current (the available firmware probably is not actually current).
Unfortunately, Mellanox has taken down their custom firmware table, so the mlx files are no longer available. Thus, you're stuck with whatever bin Mellanox provides you on their firmware download pages. HP offers free downloads of their firmware through their firmware website. Not sure about Dell. Sun and IBM's firmwares are locked by behind expensive support contracts. Also unfortunately, the official Mellanox firmware isn't always the most updated. For example, for the MHQH29B, the firmware revision for download is 2.9.1000, which is actually pretty old. It's so old that you can't use RDMA with Windows. If you want to change the firmware of your HCA, you're left with only a few options:
  1. Download the Mellanox firmware bin for your specific card from their firmware download page and burn it to your Mellanox card. This will be a .bin file that only works with one specific PSID. You can burn this to a rebranded Mellanox card using the process described below.
  2. Hope somebody downloaded the custom newer firmware for your card from Mellanox's table and that they're hosting those files somewhere. For example, in link 2 above, TeeJayHoward is hosting the 2.10.720 firmware for all revisions of MHQH19B and MHQH29B's. You have to build and burn the firmware (see process below).
  3. Find the firmware you want from your brand, e.g. HP. They might offer a newer version of the firmware, and they might not. Even with a support contract, they likely cannot provide you a different firmware version. 
  4.  Find the branded firmware from someone else hosting it. For example, the 2.11.2010 sun firmware from a different server is hosted in the last post of this thread. This is exceedingly rare.
  5. Transfer the firmware from one card to another (see process below). This requires you to be lucky enough to already have a card with a working firmware version. This is what I ended up doing.
  6. Buy newer cards (what all of the companies want you to do)
There really isn't much else you can do. 

The process for flashing Mellanox firmware from the Mellanox firmware website to Mellanox cards is straightforward and explained in link 5 above. 

The process for flashing any compatible Mellanox firmware version to a Mellanox of re-branded card is more difficult. This mainly follows link 3 mentioned above. Link 1 is good to read if you use Windows. Note: This may brick your card. Use at your own risk. 
  1. Download and install MFT (see link 5 above for download and guide)
  2. Command: mst start
  3. Figure out your device. There will be two. a pci_crX and a pciconfX or something like that. You want to use the crX device unless otherwise noted. Use whole path, e.g. /dev/mst/mt26428_pci_cr0.
    Command: mst status
  4. Save basic info such as GUIDs, MACs, etc.:
    Command: flint -d (device) query full > flint_query.txt
  5. Save low-level flash chip info:
    Command: flint -d (device) hw query > flint_hwinfo.txt
  6. Save existing firmware. This is very helpful if you have card that works with your system and some that do not:
    Command: flint -d (device)  ri orig_firmware.bin
  7. Save existing FW configuration:
    Command: flint -d (device) dc orig_firmware.ini
  8. Save existing PXE ROM image (if any- mine didn't have this):
    Command: flint -d (device)  rrom orig_rom.bin
  9. Save existing PCI VPD (vital product data):
    Command: mlxburn -d (device pciconfX)  -vpd > orig_vpd.txt
  10. Now things are a little different. The link 3 guide shows you how to burn your own .bin from a mlx and ini file. The mlx is a multi-adapter file. You have to create a .bin file specific for your adapter. The mlx files used to be available, but now only specific bin files are available. However, if you do manage to obtain a mlx, then you also need to obtain the ini file corresponding to your specific adapter and burn the .bin.
    Example command: mlxburn -fw fw-ConnectX3-rel.mlx -conf MCX312A-XCB_A2-A6.ini -wrimage mlnx_firmware.bin
  11. Once you have the .bin, then you need to verify it (bootable, all pass):
    Command: flint -i mlnx_firmware.bin verify
  12. Then you need to double check the firmware version and PSID.
    Command: flint -i mlnx_firmware.bin query full
  13. Finally, burn the new firmware image. If you are flashing an identical card that has a different PSID, e.g. flashing a Mellanox card that has been rebranded as HP with Mellanox firmware, then you need the -allow_psid_change flag. Otherwise, you do not need it.
    Command: flint -d (device)  -i mlnx_firmware.bin -allow_psid_change burn
  14. Reboot and run the query full command again to make sure the flash worked. Also verify that the PSID has changed if you were cross-brand flashing. 
The process for flashing a branded firmware version to the same brand card is a simplified version of the above. The major differences are that you will most likely have the bin file pre-made, and you do not need the -allow_psid_change flag since you are not changing the PSID. This is what I did to downgrade my 2.11.2012 cards to 2.11.2010. 

UPDATE:

If you have the same problem I have, where a newer firmware version is preventing your server from booting, it may have something to do with BAR-space. In post 12 of link 6, Andreas mentions what the Sun release notes say about firmware 2.11.2012. The only difference is that the BAR-space has been increased from 8MB to 128MB. He also quotes what to do to make Sun servers boot with the new firmware. Unfortunately, my motherboard's BIOS does not have these settings. They sound similar to something I came across with the Xeon Phi's, though. To get a Xeon Phi Coprocessor to work, you need a motherboard with something like "“above 4G decoding”, "large PCI MMIO", or “large BAR support”. I tried enabling "above 4G decoding" under PCIe configuration in my BIOS, and it worked! The 2.11.2012 firmware card allowed boot, and ibstat showed link up with rate 40 (QDR). So if your motherboard allows for the adjustment of BAR size, then try that before messing with firmware flashing.

From that same thread: We've determined that these Sun firmware versions (2.11.2010, 2.11.2012) are the latest. It sounds like they created the 2.11.2012 version specifically for motherboards that allow for BAR-space increases, and that 2.11.2010 should be used for all normal motherboards. If we ignore the special large BAR-space firmwares, then Sun, like Mellanox, does not provide more than 1 firmware version per card type. It's also likely that the Sun numbering scheme is the same as Mellanox's, meaning that the firmware they provide is much newer than the version Mellanox does (2.9.1000). Also interesting is that the newer MHQH29C has the same Mellanox firmware listed (2.9.1000). The latest one HP lists for their equivalent to the MHQH29B/C cards is 2.9.1530. Digging through IBM's release documentation for their proprietary updater, their equivalent card's latest firmware seems to be 2.9.1000. The latest Mellanox firmware used commonly here to enable RDMA for Windows on these cards is 2.10.720. So Sun/Oracle's seems to be the most recent for the MHQH29B. Interesting.

Moving forward: I'm going to do back-to-back performance testing between cards with 2012 firmware with above 4G decoding enabled, and between cards with 2010 firmware (no above 4G decoding). Then I will either upgrade or downgrade all of the cards so they are equivalent.

One thing is bothering me, though. My desktop does not support above 4G decoding. In fact, that's why I didn't try that setting in my SM BIOS in the first place. So why does a 2.11.2012 firmware card work in it? The only thing I can think of is that it inherently allows 128MB BAR support.

Updating the IMPI and BIOS of a Supermicro X9 Motherboard

I have a 4 node Supermicro 6027TR-HTR (motherboards: X9DRT-HF) server, and I thought I needed to update IMPI and the BIOS for it. The following is a process for doing that, but there are multiple ways.

Supermicro SMT AETN X9 (there are many different versions, I've only verified this process on mine) IMPI update: 
  1. Hook an ethernet cable up to the IMPI LAN port and connect it to your computer.
  2. Boot server into BIOS (hit "DEL") 
  3. Go to IMPI tab and note down the IP address info. If it's set to DHCP, set it to static, and enter in the IP, subnet, and gateway (make sure you use 3 digits for each entry, so add extra 0's). If you had to change the IP info, save changes and reset (reboot). 
  4. On your computer, setup the ethernet to work with the static IP info that you just noted down from the server. You'll need to set an IP on the same subnet, use the same subnet and gateway.
  5. Go to a browser on your computer and type in the static IP. The IMPI login screen should show up. Log in with the username and password. The default is ADMIN/ADMIN. 
  6. Now you should see the browser interface for your server. You can do lots of things with this, but the thing we want to do is update the IMPI firmware.
  7. Go to your supermicro motherboard's web page and download the most current SMT or IMPI firmware. In this zip, there should be instructions for your operating system. For mine, there was a word document with pictures of how to use the browser to update the firmware. The following steps are the text versions of this.
  8. First, check the "Firmware Revision". Mine was 2.26. The firmware folder I downloaded was SMT_X9_352, which means it's firmware version 3.52, so mine was definitely out of date.
  9. Go to Maintenance->Firmware Update. 
  10. Enter update mode.
  11. Browse to the downloaded firmware file. Mine was SMT_X9_352.bin. Upload.
  12. Click OK and upload firmware
  13. Uncheck the preserve configuration box. This is apparently important.
  14. Click start upgrade. 
  15. During this process, the browser will lose connection. The IMPI system will reboot, though not the server. The browser will not come back up, though, because the static IP was reset to DHCP. After a few minutes, shut down server manually.
  16. Go to Step 2. Repeat steps 3, 5, 6, and 8. You should see new firmware in the BIOS and in the Web-GUI. 
Now that IMPI is updated, you can update the BIOS. 
  1. Go to your motherboard's webpage and download the latest BIOS package. In mine, there was a text document with instructions. In that text document, there was a warning that IMPI firmware revision must be greater than 2.0 or higher before upgrading the BIOS. 
  2. I used RUFUS to create a DOS bootable USB device and copied over all of the files that came with the zip download.
  3. Boot the server with this USB drive. It should boot to a DOS prompt. Type "DIR" to make sure all of the bios files you copied over are there.
  4. Type "AMI.bat BIOSNAME.XXX" to start the BIOS Update. 
  5. When it is complete (you will get C:\> DOS prompt again), shutdown server, unplug AC, clear the CMOS (pull battery, short jumper, put battery back), plug in AC, power on.
  6. Go to BIOS, load default settings, save and reset.
Done.

Friday, May 11, 2018

Xeon Phi Co-processor Testing, part 2

I purchased ~140 7-series Xeon Phi Co-processors from a reseller who got them from a laboratory that was liquidating them. I tested them in a similar manner to part 1, but with an ASUS Z10PA-D8 instead of an HP DL380p server. Unfortunately, work stations don't have cooling. That's why I designed and tested the fan adapters, which are now for sale on eBay.

New single Phi adapter design. The blowers are a little longer,
but they're flatter than the axial fan single Phi adapters.

It took about 5 days to test them all, primarily because of how long it takes the ASUS computer to boot. I couldn't figure out how to make it boot faster unfortunately.

Ghetto test setup

By the end, I had the rhythm down:

1. Install power cables
2. Install fan adapter
3. Plug in Phi
4. Boot
5. Check lspci. If this fails, shutdown, put card in for parts box.
6. If pass lspci, run firmware update bash script. This has all the commands to update firmware and reboot.
7. Once rebooted, run post firmware update bash script. This starts mpss, runs miccheck, etc.
8. Wrap up card and put in working box.

I tried to test two at once with a dual Phi setup. It wasn't too difficult to get them installed with the fan adapter I designed, but handling two Phi's at once was unwieldy, so I went back to doing one at a time.

Some cards prevented computer power on, which was kind of scary...means there was a bad power fault somewhere. Some would be recognized by lspci, but would fail a firmware update...these usually had the F2 post code error, which is related to the memory system. I couldn't figure out how to fix that. A bunch just weren't recognized by lspci. But the majority were in good working condition, so I got lucky there.

Here's the homelabsales post and table if you're interested in buying one or more. I managed to sell about 40 of them in the ~1.5 weeks I had in FL, mostly broken ones to a museum in Oregon, haha. I put them in storage bins inside so they won't corrode in the FL humidity. I'll be listing them again a few months from now.