Interviewing for a successor

Posted on Fri 11 May 2018 in work

I left my job at the ICG in March 2018. One of my last tasks there was helping in searching for a successor for my position whom I could hand over my responsibilities with as little worries as possible. I updated the same job posting that had been used to announce the opening when I applied and updated it with new phrasing. I wanted to emphasize that a lot of learning can be done on the job. Experience in the comprehensive list of open source technologies the institute uses was a definite plus but I was certain that a minimum of understanding of Linux, good written and spoken English as well as the willingness to learn were enough to grow into the job. After all, usually people apply who do not have all qualifications matching your list but some that are not on the list and help them anyway.

I wanted to make sure that we had as much of an objective method to judge the applicants as possible — therefor I put together a questionnaire containing two real life scenarios as well as a short list of bonus points. These questions were discussed with the applicants and I decided which topics were sufficiently answered. I held the entire technical part of each interview.

I want to point out that my goal was not — as some of my colleagues joked — to create a test which one could “pass” or “fail”. I simply wanted to measure applicants by a more meaningful measure than “they were good” or “they were ok”. I had the hope that my scenarios would give us a heads-up whose technical knowledge was better if applicants were subjectively close to each other.

Section 1 - VM diagnosis & rescue

You have a physical machine running a Hypervisor (e.g. Xen) and a virtual machine running a Debian based Linux distribution (e.g. Ubuntu). You notice that the VM has stopped checking in with your monitoring solution. What do you do?

- contact via SSH
- check if the machine is listening (e.g. `ping`, `nmap`)
- check if the machine is running (e.g. `xl list`, `xl top`)
- send out notice that you're working on said machine (*bonus*)

The initial step of the diagnosis is for steps one can take really quickly. I accepted solutions that did not name command line utilities suggested if they served a similar purpose (e.g. VBoxManage would be fine). Bonus questions give additional points that can raise the score above the maximum points of a given question.

You have established that the machine is indeed not running. When you tried to restart the machine via the hypervisor, it is showing activity in the hypervisor output but it is neither accessible remotely (via SSH) nor does it show up in the monitoring solution. What are your next steps?

- check log files
  - host logs => there is nothing relevant in them
  - guest logs
  - centralized logging solution (*bonus*)
- try starting the machine with more verbose output from the hypervisor (*bonus*)
- check with some tool that displays screen of VM (e.g. VNC with SSH forwarding, `virt-manager`)

The second step is trying to figure out the cause of the issue after having verified the issue in step one.

You realize that the machine is not booting. It looks like a problem with GRUB but you are not entirely sure. You’d like to access the guest logs, just to be sure. The guest’s entire disk is a LVM logical volume mounted directly into the VM by the hypervisor. How do you proceed?

- find a tool to mount the logical volume on the host
  - read-only (*bonus*)
  - `kpartx` (*bonus*)
- check the logs in `/var/log/syslog` and similars in `/var/log`. Check `/var/log/dpkg.log`.

Step number 3 is to make reasonably sure it is a problem that has surfaced due to a problem with GRUB and has not been triggered by something else entirely.

The chance that it is a GRUB problem is more likely than ever. How do you proceed to try and fix the VM?

- boot from ISO (or remount read-write on host)
- `boot-repair` (*bonus*)
- reinstall GRUB

The last step of the first scenario deals with an actual attempt at fixing the VM. The infrastructure at ICG is built in a way that makes repairs more feasible than spinning up and configuring new machines without data loss.

Open question: What do you think could be the cause of such an issue?

No points were given for this question, but I noted down what the applicants came up with and commented on the likeliness of their thoughts, so they had some immediate feedback.

Section 2 - Server best practices

You have a service that you need to provide to the whole internet (or rather, your colleagues who are currently abroad). It has at least one component accessible by a web browser and one more component (e.g. SSH, IMAP, POP) that needs to be protected. How would you make reasonably sure that things are protected?

- protect the web service with a TLS certificate [and encryption]
- redirect port 80 to 445 to always enforce encryption
- implement a rate limit against brute force attacks (e.g. `fail2ban`, builtin software)
- have the server update the software on its on (or have a way to be notified of updates, e.g. mail, RSS)
- implement a backup strategy [and test it]
- provide VPN access or suggest using TU VPN and restrict firewall settings (*bonus*)
- **set up monitoring for aforementioned things**

The server best practices section was my attempt to get a feel for what the applicant knows about operations. While the previous scenario revolved around troubleshooting, this one is focused on knowledge and understanding of running servers in production. This was a question where I almost always received additional answers to the ones I hoped for.

Section 3 - Short questions

Do you have any experience with:

- Git
- Continuous integration (e.g. GitLab CI, Jenkins)
- Configuration management (e.g. Puppet, Chef, Salt)
- standard monitoring tools (e.g. Nagios, Sensu, Elastic products)
- NFS and auto-mounting
- web servers (e.g. Apache, Nginx)
- debugging software not written by you (e.g. Python code that shipped with your distribution)

This last section of questions aims to establish which topics the applicant might need training in order to fully understand and utilize existing ICG infrastructure.

Conclusion

After careful review of all applicants and their technical skills and demonstrated understanding of systems in use I gave an informed recommendation on whom to hire. I had the — very short — opportunity to introduce my successor to the most critical systems. For everything else they will have to rely on the documentation I wrote, their team members and their own skillset.

I certainly wish them all the best.