Improving our Xen Usage

Posted on Tue 31 May 2016 in work • Tagged with Institute for Computer Vision and Computer Graphics

The software we use at ICG for virtualizing servers is Xen. That's fine because it has not made my life unnecessarily hard yet. There are however, some things that could be done better - especially while handling things with puppet.

How it used to work

When I initially found the infrastructure for our Xen guests, the configuration files were located in a subdirectory of /srv that turned out to be an NFS share (mounted with hard because that's the default). This was the same for all our Xen hosts apart from one which had local configurations, but symlinked to a similar folder.

Inside these folders were many previous configuration files of VMs that had been retired long ago which made finding the currently used files somewhat of an annoying task.

The main reason I chose to rework this was the NFS mount - when the host providing the NFS share wouldn't reboot during a standard maintenance I had no configuration for any guest on all but one of our Xen hosts. That was an inconvenient situation I hoped to avoid in the future.

How it works right now

One of my issues with the previous solution was that it left important configuration files in a non-standard path instead of /etc/…. Furthermore I wanted to use version control (Git, to be precise) in order to keep the directory clean and current while also maintaining the file histories.

I integrated everything into our Git versioned Puppet code by writing a class which installs the xen-hypervisor-4.4-amd64 package, maintains the /etc/xen/xl.conf and /etc/xen/xend-config.sxp files as well as the directory /etc/xen/domains - the latter of which is a single flat directory where I keep all Xen guest configuration files.

The files are named according to a special syntax so that it's possibly to see with a glance where the domains are supposed to be run. (e.g. 02-example_domain.cfg)

While further improving our Xen hosts with additional monitoring, unattended upgrades and optimizing the DRBD running on some of them I soon found out that this solution wasn't great either. The flat directory prevented me from writing simple Puppet code to use Xen's /etc/xen/auto directory to have certain guests automatically started (or resumed, depending on circumstances) on boot of the host.

How the suggested solution looks like

Since Puppet is not a scripting language it's often that your established way of thinking (mine being, "I know, I'll use a 'for' loop") can't solve the problem and you either have to research new ways of working around the problem or finding idiomatic ways to solve it.

I needed a way to make sure the right Xen configurations would end up in each host's /etc/xen/auto without them trying to start configurations for other hosts. Given the naming scheme this could be as easy as the following snippet.

# NOTE: untested and only here for illustration purposes
#       You need to get the host number from somewhere
#       but that wouldn't be the main issue.

exec
{
  'link-xen-configurations':
  refreshonly => true,
  path        => '/usr/bin/find /etc/xen/domains -type f -name "NUMBER-*.cfg" | /usr/bin/xargs -I FILENAME -n1 -t ln -f -s FILENAME /xen/auto/FILENAME',
  user        => 'root',
}

Of course you would need to remove existing links to files first and using execs is a messy business after all. Besides - something I hadn't touched yet - there are also VM configurations that have two prefixes to signify on which hosts they can be run due to DRBD (e.g. 01-03-other_example.cfg) syncing their contents on a block level between two hosts.

Given this it's even more complex to build such a system well in a way that won't break in spectacular fashion the first time you look away after a deploy.

My plan is to create host-specific folders in our Puppet code and have Puppet symlink those since using the $::hostname variable provided by Puppet's Facter makes this extremely easy. In addition, disentangling the multiple-host configurations will be necessary - this will avoid having DRBD capable hosts starting the same VM at the same time. I might combine this with changing the device specified in the Xen configurations.

-disk = ["phy:/dev/drbd13,xvda,w"]
+disk = ["drbd:myresource,xvda,w"]

This will direct Xen to put the DRBD resource named 'myresource' into the Primary role, and configure it as device xvda in your domU. ~/etc/xen/scripts/block-drbd (slightly changed to use whole disk instead of partition)

The interesting thing here is that the resource will automatically become primary when the Xen domain is started - there is no need to automatically become primary on startup on a particular node with DRBD itself - this will be done on demand as soon as a Xen guest requires it.

In time - with DRBD 9 - it might even be reasonable to have all VM hosts be able to run all guests due to cluster-mode block syncing.


On changing hard disks

Posted on Fri 22 January 2016 in work • Tagged with Institute for Computer Vision and Computer Graphics

Now, I might have mentioned in the past that despite me working as a system administrator, I dislike working with actual hardware and prefer to touch machines as soon as they are SSH-ready or at most as soon as an operating system can be installed.

  • This post has been updated once.

Well, let's assume for a moment that a disk needs changing and neither the senior admin in my current job nor my predecessor are available. Let's assume that this has happened twice already and led to rather amusing stories; both times.

first time's always fun

The first time I was to change a disk I had help from my colleague Daniel Brajko who accompanied me to the server room, but let's start at the beginning.

I noticed that something was up and a disk had an error when I wrote my script to check the output of the RAID controllers' status information to get notified automatically when something was wrong. I decided to tackle this task since it was one important piece of work that my senior admin had assigned me during his absence.

After checking the serial and the disk size of the faulty drive when headed to the storage space and picked several disks since we were not sure which one was to go into exactly that server. Actually, at that time we were also not sure because some of the disks were not labelled with their disk size (looking at you, Seagate). With the disks and more equipment in a backpack, we ventured to the server room which is conveniently located within walking distance of our office.

We only came as far as the server room door, though. Neither my employee card nor the one of my colleague was authorized to enter, even though at least he's been in this job for over a year. Great. Alright, the helpful thing was that authorization had not yet been transferred from my predecessor to me yet and he still worked at our institute in a different position. He knew us and lent us the card in order to change disks as he clearly recognized the need for such maintenance. I had a bad feeling the whole time that someone would "catch" us and we'd have to explain why we were using this card in an extremely awkward situation.

With this card - impersonating our former colleague - we ventured into the server room, only to find that the machine in question was in our secondary server room - the one who is multiple blocks away. Alright, this wasn't going to be easy.

So we packed everything back up and walked to the secondary building. Daniel had only ever been there once, I had never been there. The building has two basement levels which are not particularly well lit nor particularly easy to find your way around in. I wouldn't necessarily call it a maze but it's certainly not far from that. After 15 minutes of running around without any clue we surrendered and went up to the ground floor to consult one of the university's information terminals to find our own server room. A glorious day, let me tell you.

After finding our server room and granting ourselves access with the borrowed card we entered the room, looked for our server cabinet (of course it was the only unlabelled one) and well… uhm. That was the point were we Daniel pointed out that, yes, we do need the keychain that I told him to leave behind because, "I already have everything we need".

And back we went. *sigh*. After fetching the keychain we also borrowed my predecessor's bike as well as another one and went back, back into the basement, changed the drive which was relatively painless after realizing we only had one disk with the correct storage capacity with us and returned.

And that's how it took two sysadmins a whole afternoon to change a damaged disk. After that episode we phoned the person in charge and got ourselves assigned the server room access permissions. but...

second time you're quickly done

Today this little e-mail arrived. That's the second time it did and I always like when my efforts pay out. :)

RAID status problematic.

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache AVrfy
------------------------------------------------------------------------------
u0    RAID-6    DEGRADED       -       -       256K    5587.88   RiW    ON     

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     931.51 GB   1953525168    [REDACTED]            
p1     OK               u0     931.51 GB   1953525168    [REDACTED]            
p2     DEVICE-ERROR     u0     931.51 GB   1953525168    [REDACTED]        
p3     OK               u0     931.51 GB   1953525168    [REDACTED]            
p4     OK               u0     931.51 GB   1953525168    [REDACTED]      
p5     OK               u0     931.51 GB   1953525168    [REDACTED]            
p6     OK               u0     931.51 GB   1953525168    [REDACTED]            
p7     OK               u0     931.51 GB   1953525168    [REDACTED]  

Okay. So. Senior admin is absent again, disk fails again. This time Daniel is also not there. "Fine," I tell myself, it will be painless this time. I was so, so wrong.

After making a quick joke with the researchers that maybe they should go home early because if I fail when replacing the disk, we won't have any e-mail service I grabbed the keys and a replacement disk - this time I couldn't find one with the right storage capacity again, but I got smarter and made an educated guess based on 5 of 8 characters of the serial number matching. I headed to the next building, ran into the admin from the other institute and joked if they also had "those mean things lacking a storage capacity description". He helpfully stated that they use the same model and they were 1 TB models which gave me some relief. After opening our server racks and checking all devices in there I came to a terrible realization: Of course I was in the wrong building. Again. (This time I made a list of all our devices in this building for our internal docs.)

Alright, back up from the basement, notified the office that the keychain has not gone missing and I'm taking it to go to the other building. I walked through the cold winter air to the other building, entered the basement and found the server room on the first try. This is a thing that tends to happen. If I am ever required to find my way to a place by myself I will keep finding the way there in the future. Anyway, so I hold my card to the scanner and… nothing happens. I cursed, waited a bit and tried again. Again, nothing. There's an emergency contact on the door and after returning to the ground floor in order to have cellphone reception I called that, we had a longer conversation and obviously I didn't receive all the permissions I should have gotten when the issue arose the first time. Shall we say I was a little annoyed that not both permissions had been transferred from my predecessor directly to me?

Update: It turns out I am again to blame for something, as I did have the permissions. However, I didn't know that the card activation via sensor only works for the building you last checked in at the sensor. So, believing my card is supposed to work after having just been at one sensor I obviously didn't visit the sensor at the other building.

After managing emergency access I scoured the room for our server rack. I panicked a little when there was nothing where I remembered seeing it last time. I mean, yes, it had been a while but my memory for locations is pretty accurate and I don't think anyone would've moved the machines without the admins noticing. Good thing no one else was in the room since I must've looked like a burglar using my iPhone's flashlight to search the empty server cabinet where our machines were supposed to be. Then I noticed that there were indeed machines. It was just that both were in real slim chassis and they were located in the topmost and bottommost slots. In addition one was turned off and so I missed both when looking less carefully. Oh, yeah. Our stuff was in the only unlabelled rack, because of course it still was. I really hope the people in charge don't have microphones there since I might've been swearing quite a lot.

The rest was easy work. Change the disk, make sure the RAID recognizes the new drive and, pack everything up and go home.

I'm morbidly curious what surprises the next drive change will offer me.

PS: Yes, labelling our rack is on top of my TODOs.


Unattended-Upgrades patch for Remove-unused-dependencies backported to Trusty, Precise

Posted on Wed 06 January 2016 in work • Tagged with Institute for Computer Vision and Computer Graphics

As of today my contribution to unattended-upgrades has been backported into Ubuntu Trusty Tahr and Ubuntu Precise Pangolin, which are both LTS version currently in use. I'm probably more proud of myself than I should be but it was a great feeling to be of help to a global community and prevent further issues with automatic updates making systems unusable.

I will be removing the manually patched packages at the ICG soon and look forward to not maintaining a fork of the software for internal use as that tended to eat up valuable time from other projects.


Media Recap 2015 - II

Posted on Wed 02 December 2015 in media recap

After watching TotalBiscuit's video for There Came an Echo I wasn't really into playing the game, but when the soundtrack went up at Big Giant Circles I couldn't pass it up. Bought it some time ago and still love it, especially "Ignite Defense" and "LAX" (those are great Audiosurf tracks, BTW).

You should really listen to the soundtrack.

Video Games

  • Audiosurf 2 (Steam, formerly Early Access)
  • Deponia (Steam) - Hard to like the game given that none of its characters is written in a likable way. It does contain some rememorable scenes though. "Rufus has stolen the screws from the children's merry-go-round."
  • Dragon Age 2 (Xbox 360) - Playing again with all the DLC to show the girlfriend how constrained the team was making this, as well as how great the dialogue was.
  • Guild of Dungeoneering (Steam) - Yes, you can actually sell games based on their trailer soundtrack.
  • Halo 1 (Xbox One, Master Chief Collection) - Bought this one together with my Xbox One in order to relive the old times. Have fond memories of ploughing through the actual Halo 1 with Martin.
  • Halo Spartan Ops (Xbox One, Master Chief Collection) - Unless I got something wrong this seems to be the multiplayer replacement for Firefight. I loved ODST's Firefight and am deeply disappointed by this. I used Firefight as a kind of training ground for the campaign, but the Spartan Op I played solo was boring.
  • Ironcast (Steam) - I couldn't resist buying a new "match 3" game, especially one with elements of a roguelike. It was marked down during the Steam Exploration sale. Like this one quite a lot. I wish I had found the 'skip' button in dialogues earlier though. I accidentally clicked away quite some choices.
  • Kingdom (Steam) - Beautiful indie title which is deeper than one would expect at first sight.
  • Kingdom Hearts Re:Chain of Memories HD (Playstation 3) - Last time I played this game was a pirated version of the Game Boy Advance edition some years ago. Still, the later boss fights were as tough as I remembered them and I tended to switch off the PS3 due to rising anger at least one per boss fight in the mid-section.
  • Life is Strange (Xbox One) - While I am not as heavily into Life is Strange as my girlfriend, I can acknowledge it for the interesting and original game that it is. Its contemporary theme struck a nerve for the both of us.
  • Rune Factory 4 (Nintendo 3DS)
  • Secret Files 3 (Steam) - Disappointing. Feels incomplete, almost like sentence.
  • Starbound (Steam, Early Access)
  • Startopia (Steam) - Felt nostalgic. Initially played this title years ago when I borrowed it from Lukas.
  • Terraria (Steam) - Terraria has arrived on the Mac. I don't need to say more.
  • The Witcher 3 (Xbox One) - Holy… I adore the Witcher books and absolutely, wholeheartedly recommend The Witcher 3 to anyone on the look for a gritty, mature and sarcastic fantasy adventure RPG. I played this with the girlfriend on a completionist run. The game and its awesome first expansion Hearts of Stone kept us busy from June to November.

Books

It's a Witcher book aside from the main five books which was an entertaining read. Then there's Jennifer Estep's series about an assassin which had the usual fault of her books: She is explaining everything in every book in minuscule detail again even though many people will either still remember or read the books a binge.

Books by Richard Schwartz

I bought a stack of books from the friend of a friend who wanted to clear house. Those turned out to be very entertaining fantasy novels by Richard Schwartz. I haven't read the single titles in this universe yet, but due to traveling I spent some iTunes credits on the later books in order to avoid packing more. I even turned on data roaming and bought one book in the train in Germany - that should give you a good impression how much I enjoyed the series so far.

  • Das erste Horn
  • Die zweite Legion
  • Das Auge der Wüste
  • Der Herr der Puppen
  • Die Feuerinseln
  • Der Kronrat
  • Die Rose von Illian
  • Die weiße Flamme
  • Das blutige Land
  • Die Festung der Titanen
  • Die Macht der Alten

Movies

I've suggested watching the Fast and the Furious movies since I like them and in turn I watched the Harry Potter ones since I didn't know them. Due to conflicts of time and interest we haven't seen the last two Potters yet.

I recommend Inside Out. I can't remember when I've had such a nice time in cinema. It's easily my favorite movie of the year. Yeah, don't go and watch Minions - it's disappointing and weak.

  • Fast and the Furious, The
  • Fast and the Furious, The: Tokyo Drift
  • Fracture (Netflix, DE: Das perfekte Verbrechen)
  • Harry Potter and the Philosopher's Stone
  • Harry Potter and the Chamber of Secrets
  • Harry Potter and the Prisoner of Azkaban
  • Harry Potter and the Goblet of Fire
  • Harry Potter and the Order of the Phoenix
  • Harry Potter and the Half-Blood Prince
  • Inside Out (cinema, DE: Alles steht Kopf)
  • Jumper (Netflix)
  • Minions (cinema)
  • Transporter, The (Netflix)
  • V for Vendetta (Netflix)
  • xXx (Netflix)

Videos on Netflix

The Netflix series consumption has been more or less the same. Some Grimm, some Sherlock, a lot of Elementary. I have also checked out a documentary series about famous chefs which proved to be interesting.

Presentations

Podcasts


Using Continuous Integration for puppet

Posted on Sun 01 November 2015 in work • Tagged with Institute for Computer Vision and Computer Graphics

I'll admit the bad stuff right away. I've been checking in bad code, I've had wrong configuration files on our services and it's happened quite often that files referenced in .pp manifests have had a different name than the one specified or were not moved to the correct directory during refactoring. I've made mistakes that in other languages would've been considered "breaking the build".

Given that most of the time I'm both developing and deploying our puppet code I've found many of my mistakes the hard way. Still I've wished for a kind of safety net for some time. Gitlab 8.0 finally gave me the chance by integration easy to use CI.

Getting started with Gitlab CI

  1. Set up a runner. We use a private runner on a separate machine for our administrative configuration (puppet, etc.) to have a barrier from the regular CI our researchers are provided with (or, as of the time of this writing, will be provided with soonish). I haven't had any problems with our docker runners yet.
  2. Enable Continuous Integration for your project in the gitlab webinterface.
  3. Add a gitlab-ci.yml file to the root of your repository to give instructions to the CI.

Test setup

I've improved the test setup quite a bit before writing this and aim to improve it further. I've also considered making the tests completely public on my github account, parameterize some scripts, handle configuration specific data in gitlab-ci.yml and using the github repository as a git submodule.

before script

In the before_script section which is run in every instance immediately before a job is run, I set some environment variables and run apt's update procedure once to ensure only the latest versions of packages are installed when packages are requested.

before_script:
  - export DEBIAN_FRONTEND=noninteractive
  - export NOKOGIRI_USE_SYSTEM_LIBRARIES=true
  - apt-get -qq update
  • DEBIAN_FRONTEND is set to suppress configuration prompts and just tell dpkg to use safe defaults.
  • NOKOGIRI_USE_SYSTEM_LIBRARIES greatly reduces build time for ruby's native extensions by not building its own libraries which are already on the system.

Optimizations

  • Whenever apt-get install is called, I supply -qq and -o=Dpkg::Use-Pty=0 to reduce the amount of text output generated.
  • Whenever gem install is called, I supply --no-rdoc and --no-ri to improve installation speed.

Puppet tests

All tests which I consider to belong to puppet itself run in the build stage. As is usual with Gitlab CI, only if all tests in this stage pass, the tests in the next stage will be run. Given that sanity checking application configurations which puppet won't be able to apply doesn't make a lot of sense, I've moved those checks into another stage.

I employ two of the three default stages for gitlab-ci: build and test. I haven't had the time yet to build everything for automatic deployment after all tests pass using the deploy stage.

puppet:
  stage: build
  script:
    - apt-get -qq -o=Dpkg::Use-Pty=0 install puppet ruby-dev
    - gem install --no-rdoc --no-ri rails-erb-lint puppet-lint
    - make libraries
    - make links
    - tests/puppet-validate.sh
    - tests/puppet-lint.sh
    - tests/erb-syntax.sh
    - tests/puppet-missing-files.py
    - tests/puppet-apply-noop.sh
    - tests/documentation.sh

While puppet-lint exists as .deb file, this installs it as a gem in order to have Ubuntu docker containers running the latest puppet-lint.

I use a Makefile in order to install the dependencies of our puppet code quickly as well as to create symlinks to simplify the test process instead of copying files around the test VM.

libraries:
  @echo "Info: Installing required puppet modules from forge.puppetlabs.com."
  puppet module install puppetlabs/stdlib
  puppet module install puppetlabs/ntp
  puppet module install puppetlabs/apt --version 1.8.0
  puppet module install puppetlabs/vcsrepo

links:
  @echo "Info: Symlinking provided modules for CI."
  ln -s `pwd`/modules/core /etc/puppet/modules/core
  ln -s `pwd`/modules/automation /etc/puppet/modules/automation
  ln -s `pwd`/modules/packages /etc/puppet/modules/packages
  ln -s `pwd`/modules/services /etc/puppet/modules/services
  ln -s `pwd`/modules/users /etc/puppet/modules/users
  ln -s `pwd`/hiera.yaml /etc/puppet/hiera.yaml

As you can see, I haven't had the chance to migrate to puppetlabs/apt 2.x yet.

puppet-validate

I use the puppet validate command on every .pp file I come across in order to make sure it is parseable. It is my first line of defense given that files which are not even able to make it pass the parser are certainly not going to do what I want in production.

#!/bin/bash
set -euo pipefail

find . -type f -name "*.pp" | xargs puppet parser validate --debug

puppet-lint

While puppet-lint is by no means perfect, I like to make it a habit to enable linters for most languages I work with in order for others to have an easier time reading my code should the need arise. I'm not above asking for help in a difficult situation and having readable output available means getting help for your problems will be much easier.

#!/bin/bash
set -euo pipefail

# allow lines longer then 80 characters
# code should be clean of warnings

puppet-lint . \
--no-80chars-check \
--fail-on-warnings \

As you can see I like to consider everything apart from the 80 characters per line check to be a deadly sin. Well, I'm exaggerating but as I said, I like to have things clean when working.

erb-syntax

ERB is a Ruby templating language which is used by puppet. I have only ventured into using templates two or three times, but that has been enough to make me wish for extra checking there too. I initially wanted to use rails-erb-check but after much cursing rails-erb-lint turned out to be easier to use. Helpfully it will just scan the whole directory recursively.

#!/bin/bash
set -euo pipefail

rails-erb-lint check

puppet-missing-files

While I've used puppet-lint locally previously it caught fewer errors than I would've liked due to it not checking whether files I sourced for files or templates existed. I was negatively surprised upon realizing that puppet validate didn't do that either, so I slapped together my own checker for that in Python.

Basically the script first builds a set of all .pp files and then uses grep to check for lines specifying either puppet: or template( which are telltale signs for files or templates respectively. Then each entry of said entry is verified by checking for its existence as either a path or a symlink.

#!/usr/bin/env python2
"""Test puppet sourced files and templates for existence."""

import os.path
import subprocess
import sys


def main():
    """The main flow."""

    manifests = get_manifests()
    paths = get_paths(manifests)
    check_paths(paths)


def check_paths(paths):
    """Check the set of paths for existence (or symlinked existence)."""

    for path in paths:
        if not os.path.exists(path) and not os.path.islink(path):
            sys.exit("{} does not exist.".format(path))


def get_manifests():
    """Find all .pp files in the current working directory and subfolders."""

    try:
        manifests = subprocess.check_output(["find", ".", "-type", "f",
                                             "-name", "*.pp"])
        manifests = manifests.strip().splitlines()
        return manifests
    except subprocess.CalledProcessError, error:
        sys.exit(1, error)


def get_paths(manifests):
    """Extract and construct paths to check."""

    paths = set()

    for line in manifests:
        try:
            results = subprocess.check_output(["grep", "puppet:", line])
            hits = results.splitlines()

            for hit in hits:
                working_copy = hit.strip()
                working_copy = working_copy.split("'")[1]
                working_copy = working_copy.replace("puppet://", ".")

                segments = working_copy.split("/", 3)
                segments.insert(3, "files")

                path = "/".join(segments)
                paths.add(path)

        # we don't care if grep does not find any matches in a file
        except subprocess.CalledProcessError:
            pass

        try:
            results = subprocess.check_output(["grep", "template(", line])
            hits = results.splitlines()

            for hit in hits:
                working_copy = hit.strip()
                working_copy = working_copy.split("'")[1]

                segments = working_copy.split("/", 1)
                segments.insert(0, ".")
                segments.insert(1, "modules")
                segments.insert(3, "templates")

                path = "/".join(segments)
                paths.add(path)

        # we don't care if grep does not find any matches in a file
        except subprocess.CalledProcessError:
            pass

    return paths

if __name__ == "__main__":
    main()

puppet-apply-noop

In order to perform tests on the most common tests in puppet world, I wanted to test every .pp file in a module's tests directory with puppet apply --noop, which is a kind of dry run. This outputs information what would be done in case of a real run. Unfortunately this information is highly misleading.

#!/bin/bash
set -euo pipefail

content=(core automation packages services users)

for item in ${content[*]}
do
  printf "Info: Running tests for module $item.\n"
  find modules -type f -path "modules/$item/tests/*.pp" -execdir puppet apply --modulepath=/etc/puppet/modules --noop {} \;
done

When run in this mode, puppet does not seem to perform any sanity checks at all. For example, it can be instructed to install a package with an arbitrary name regardless of the package's existence in the specified (or default) package manager.

Upon deciding this mode was not providing any value to my testing process I took a stab at implementing "real" tests instead by running puppet apply instead. The value added by this procedure is mediocre at best, given that puppet returns 0 even if it fails to apply all given instructions. Your CI will not realize that there have been puppet failures at all and happily report your build as passing.

puppet provides the --detailed-exitcodes flag for checking failure to apply changes. Let me quote the manual for you:

Provide transaction information via exit codes. If this is enabled, an exit code of ´2´ means there were changes, an exit code of ´4´ means there were failures during the transaction, and an exit code of ´6´ means there were both changes and fail‐ ures.

I'm sure I don't need to point out that this mode is not suitable for testing either given that there will always be changes in a testing VM.

Now, one could solve this by writing a small wrapper around the puppet apply --detailed-exitcodes call which checks for 4 and 6 and fails accordingly. I was tempted to do that. I might still do that in the future. The reason I didn't implement this already was that actually applying the changes slowed things down to a crawl. The installation and configuration of a gitlab instance added more than 90 seconds to each build.

A shortened sample of what is done in the gitlab build:

  • add gitlab repository
  • make sure apt-transport-https is installed
  • install gitlab
  • overwrite gitlab.rb
  • provide TLS certificate
  • start gitlab

Should I ever decide to implement tests which really apply their changes, the infrastructure needed to run those checks for everything we do with puppet in a timely manner would drastically increase.

documentation

I am adamant when it comes to documenting software since I don't want to imagine working without docs, ever.

In my Readme.markdown each H3 header is equivalent to one puppet class.

This test checks whether the amount of documentation in my preferred style matches the amount of puppet manifest files (.pp). If the Readme.markdown does not contain exactly the same amount of ### headers as there are puppet manifest files then it counts as a build failure since someone obviously missed to update the documentation.

#!/bin/bash
set -euo pipefail

count_headers=`grep -e "^### " Readme.markdown|wc -l|awk {'print $1'}`
count_manifests=`find . -type f -name "*.pp" |grep -v "tests"|wc -l|awk {'print $1'}`

if test $count_manifests -eq $count_headers
  then printf "Documentation matches number of manifests.\n"
  exit 0
else
  printf "Documentation does not match number of manifests.\n"
  printf "There might be missing manifests or missing documentation entries.\n"
  printf "Manifests: $count_manifests, h3 documentation sections: $count_headers\n"
  exit 1
fi

Application tests

As previously said I use the test stage for testing configurations for other applications. Currently I only test postfix's /etc/aliases file as well as our /etc/postfix/forwards which is an extension of the former.

applications:
  stage: test
  script:
      - apt-get -qq -o=Dpkg::Use-Pty=0 install postfix
      - tests/postfix-aliases.py

Future: There are plans for handling both shorewall as well as isc-dhcp-server configurations with puppet. Both of those would profit from having automated tests available.

Future: The different software setups will probably be done in different jobs to allow concurrent running as soon as the CI solution is ready for general use by our researchers.

postfix-aliases

In order to test the aliases, an extremely minimalistic configuration for postfix is installed and the postfix instance is started. If there is any output whatsoever I assume that the test failed.

Future: I plan to automatically apply both a minimal configuration and a full configuration in order to test both the main server and relay configurations for postfix.

#!/usr/bin/env python2
"""Test postfix aliases and forwards syntax."""

import subprocess
import sys


def main():
    """The main flow."""
    write_configuration()
    copy_aliases()
    copy_forwards()
    run_newaliases()


def write_configuration():
    """Write /etc/postfix/main.cf file."""

    configuration_stub = ("alias_maps = hash:/etc/aliases, "
                          "hash:/etc/postfix/forwards\n"

                          "alias_database = hash:/etc/aliases, "
                          "hash:/etc/postfix/forwards")

    with open("/etc/postfix/main.cf", "w") as configuration:
        configuration.write(configuration_stub)


def copy_aliases():
    """Find and copy aliases file."""

    aliases = subprocess.check_output(["find", ".", "-type", "f", "-name",
                                       "aliases"])
    subprocess.call(["cp", aliases.strip(), "/etc/"])


def copy_forwards():
    """Find and copy forwards file."""

    forwards = subprocess.check_output(["find", ".", "-type", "f", "-name",
                                        "forwards"])
    subprocess.call(["cp", forwards.strip(), "/etc/postfix/"])


def run_newaliases():
    """Run newaliases and report errors."""

    result = subprocess.check_output(["newaliases"], stderr=subprocess.STDOUT)
    if result != "":
        print result
        sys.exit(1)

if __name__ == "__main__":
    main()

Conclusion

While I've ran into plenty frustrating moments, building a CI for puppet was quite fun and I'm constantly thinking about how to improve this further. One way would be to create "real" test instances for configurations, like "spin up one gitlab server with all its required classes".

The main drawback in our current setup was two-fold:

  1. I haven't enabled more than one concurrent instances of our private runner.
  2. I haven't considered the performance impact of moving to whole instance testing in other stages and parallelizing those tests.

I look forward to implementing deployment on passing tests instead of my current method of automatically deploying every change in master.


Notes

  • Build stages do run after each other, however, they do not use the same instance of the docker container and therefore are not suited for installing prerequisites and running tests in different stages. Read: If you need an additional package in every stage, you need to install it during every stage.
  • If you are curious what the set -euo pipefail commands on top of all my shell scripts do, refer to Aaron Maxwell's Use the Unofficial Bash Strict Mode.
  • Our runners as of the time of this writing use buildpack-deps:trusty as their image.