- Headnode only, n=20: 1.12 iter/s
- Compute node only, n=20: 1.015 ips
- head+node002, n=40, 1Gbe: 1.75 ips
- head+node002, n=40, QDR infiniband: 2.18 ips
- all 5, n=100, 1Gbe: 1.56 ips
- all 5, n=100, QDR infiniband: 5.24 ips
You can see that the 1Gb ethernet link is definitely the bottleneck. In fact, it's so restrictive that using 5 nodes or more actually hurts performance. My guess is that the maximum performance with the 1Gbe link is probably about 3 nodes. The QDR Infiniband link is a different story entirely. It shows perfect scaling (sum of the headnode + X compute node ips) up to 5 nodes, and it'd probably continue to show excellent scaling up to many more, particularly for larger meshes.
Feels good man...
Update (3 moths later): The n=100 result is not realistic. Coincidentally, a (corrected method) n=108 case with FDR Infiniband ended up with almost the same iter/s (5.29), so just imagine the caption replaced. See this post for an explanation.
Still have some stuff to do:
- Clean up the wiring
- Get everything situated in the soundproof cabinet
- Fix the heat extraction system if it isn't sufficient
- Fix the RAID1 data array in the headnode so it stops failing
- Compile these blog posts into step-by-step guides
- Use the cluster
No comments:
Post a Comment