Search This Blog

Sunday, February 3, 2019

More fun with systemd services

I obtained some software that I wanted to install on the cluster. This software requires the installation of a license server as a systemd service. Unfortunately, following the instructions that came with this software resulted in a failed installation of the license server. I'm going to detail the steps I took to troubleshoot this problem here.

These commands were very useful for figuring out what was going on:
  • systemctl stop [or start or status] (name).service
  • journalctl -u (name).service
  • ps aux | grep (command)
  • gedit /var/log/messages
  • Also just running the command with its arguments outside of the service
The license server came with a .sh install script. This particular script had very limited debug help info. It installs the program, then creates the systemd service file in /usr/lib/systemd/system and then creates a symlink from there to /etc/systemd/system/multi-user.target.wants/. It then attempts to start the service, which failed.

As with all debugging with stuff that involves networking, I started by disabling the firewall: systemctl stop firewalld

The first errors I got were related to missing /lib/ld-lsb.so.3. This is because the license server is a 32bit program, and CentOS 7 is 64bit. The 32bit libraries aren't installed by default. Theoretically, you should be able to just install redhat-lsb.i686, but this caused a bunch of conflicts on my system that were unresovable, particularly with the nvidia drivers. Instead of that, I did: yum provides /lib/ld-lsb.so.3 , then read the output to figure out which package contains that. It will be different for different systems. For me, I then had to install: yum install redhat-lsb-core-4.1-27.el7.centos.1.i686 . That should install a bunch of other packages, including libstdc++.i686, glibc.i686, and libgcc.i686. No unresovable dependencies, yay. This solved that problem.

The next error is because the systemd service file that the install sh script creates was not formatted correctly. This took a few hours to figure out. In a previous post, I mentioned creating a service file for a simple script that runs at start up to force the cpu mode to be "performance". The service file was a "oneshot" type, i.e. it runs once and is done, and the ExecStart command was a simple call to a bash script. In this service file, the type is "forking" because it creates more processes, and the ExecStart command requires multiple arguments. It turns out that command lines in systemd service files interpret spaces as separate commands, which means that if there is a space between the command and the argument, it will think the argument is another command. So how do you run commands with arguments in a service file? Well, whoever wrote the install script for this program thought that you could just enclose the whole command in quotes, which honestly makes a lot of sense...but nope, can't do that either because the quotes are just removed by the interpreter. Here's what you actually have to do in the [Service] block:
Environment='ARGS=-c whatever -l whatever2'
ExecStart=/absolute/path/to/command $ARGS
It has to be in that exact format. Adding quotes after the = around the arguments causes failure, quotes around the command fails, etc. Another option is to create an EnvironmentFile with the text in the ' ' in it, but that's a waste for only a couple arguments.

The next problem was a host communication error. This was uncovered by attempting to ping localhost, 127.0.0.1, and "headnode", which is the name of the node I was attempting to install this software one. I had my network adapters switched off. I could still ping localhost and 127.0.0.1, but due to the way my /etc/hosts file was setup for openmpi, I can't have "headnode" as an alias. Instead, "headnode" is an alias for the first node of the 192.168.2.X subnet, which is what I have openmpi use for administrative messages, and what I use for ssh, internode commands, etc. Because I had the network adapter associated with that subnet disabled, ping couldn't resolve "headnode". Also, because this license server uses the name of the computer as the host name, it has to be resolvable. When I switched that adapter on, headnode was then ping-able and that solved the communication error.

Finally, it's time to add the firewall back in. I stopped the license server service, turned on the firewall, and surprisingly the license server service didn't fail. I think it's using the subnet that "headnode" is on for communication, which I had to whitelist for slurm/openmpi (talked about in a previous post) because there was no way to specify a port range. This isn't necessarily a security problem because that subnet is on an adapter isolated from the internet (private internal network). Looking at the license server log, it listens on one fixed tcp port (though doesn't mention other networking info) and has another process that spawns with a random tcp port. This means that, if I hadn't already whitelisted that subnet, I'd probably have to for this because I don't think there's a way to specify the currently random tcp port. I ran the commands netstat -plnt and lsof -c (program) -a -i. Examining the results showed that it was listening on :::XXXXX (where XXXXX is the port number). The :: means the whole ipv6 space, and since Linux is "single stack" and maps between ipv4 and ipv6, it also means the whole ipv4 space. Looking at the lsof outputs, it has established connections to its spawned process on localhost:XXXXX as well as other ports. After some digging, it seems that firewalld does not police the loopback interface (lo) by default, i.e. it's not even assigned to a zone, so any localhost (or 127.0.0.1) communications will not be subject to firewall rules. THAT is why turning the firewall on didn't affect the license server program, because it only had localhost connections. It also means that no requests to the ports its listening on from external sources should get through the firewall, except for the whitelisted subnet (which is the compute node's subnet).

The next step was to install the program that used the license server. The installation seemed to go fine, but when I tried to run the program, it said that it could not find the license. During installation, I specified the license server as 127.0.0.1, which should be localhost. I tried changing it to localhost, too, but that didn't work. I had to change it to the (port)@localhost in order to get it to work, where (port) is the listening port number discovered above, e.g. 11111@localhost. For the compute nodes, I expect that I will have to use (port)@headnode because the license server will be on the headnode. Anyways, netstat and lsof don't show any new connections, so I guess it just connects to the server once at startup to check to make sure the license is there.

Hopefully this helps someone in the future with debugging.

I created an environment module for the program that uses the license server. I decided not to enable the license server systemd service, which means I'll have to start it before launching the program. If I was planning to use the program a lot, then I would enable it.

It turns out that the version of the software I have is not multi-node (MPI) capable. It has a remote server option that's convenient if you don't want to run on the computer you launch from, but it can't link multiple computers/servers into a cluster. I think I'll just leave it on the headnode for now.



No comments:

Post a Comment