Reading recommendations (2016-07-18)

Posted on Mon 18 July 2016 in reading recommendations

It's kind of amazing that there's still some time left between work, playing Black Desert Online and doing household chores. I have a little idea about these link posts in the back of my head but don't know yet how much coding effort is required to make that work, so I'll not go into specifics just yet. Let's get to it.

  • DroidJack Uses Side-Load…It's Super Effective! Backdoored Pokemon GO Android App Found by Proofpoint Staff (via Polygon.com RSS)
    In the craze that is Pokémon GO and its staggered release over the world, it is not suprising to see criminals jumping to exploit the people's impatience. Personally, I've seen more than a few players last week in Graz even though the game was released in most of Europa on Saturday. A DDOS immediately followed the release and prevented my girlfriend and me from trying ourselves.

  • The UX Secret That Will Ruin Apps For You by Mark Wilson (via mjtsai.com RSS)
    While I can imagine that checks that take - relatively speaking - too little time have the potential to make people feel insecure I think that's just the assumption that we have been grown accustomed to by using slower internet connections and systems (e.g. without Solid State Drives) for years. Now, if things are done in an instant, it 'seems wrong'. As someone working in tech, it's a different thing because there's a lot more understanding for how fast computers have become and how much you can optimize a problem.

  • STARCRAFT: GHOST: WHAT WENT WRONG by Patrick Stafford (RSS, Polygon)
    I really wanted Starcraft: Ghost to become a real thing. The Starcraft lore is overall very good and would've provided ample room for such a stealth based game with its distinct Ghost units. It's saddening to read about the multiple failures of a project of such potential.

  • The Psychological Benefits of Writing Regularly by Gregory Ciotti (RSS, Lifehacker)
    I can attest to that. While I do love writing in general, there's writing which feels nice and writing that is a drag. Writing technical documentation is a kind of let down - you have to be precise, think hard whether what you write is understandable to your target audience. Prose on the other hand feels like what I imagine painting would be for an artist. I'll just grab the (virtual) pen and let loose.

  • Answer to 'In a nutshell, why do a lot of developers dislike Agile?' by Miles English (via Fefes Blog RSS)
    Have you ever wondered why such a lot of things seem to go wrong when developing software and planning is not done before, but during the project?

  • I'm a black ex-cop, and this is the real truth about race and policing by Redditt Hudson (via Fefes Blog RSS)
    Horrifying read. These problems in America's police force are of nightmarish dimensions. Abuse of power in many, many forms. Reminds me of a tweet I read recently which discussed new gun regulations for officers in another country. A commenter added 'that they act like they were ashamed of gun use'. Well, yes. In countries other than the US, guns are not glorified. They are to be used with caution and preferably not at all by police force.

  • TA Top Five: Main Menu Themes by Marc Hollinshead (RSS, TrueAchievements)
    TA's nice feature on video game menu music has some gems. Didn't know the Dark Souls III one before and was surprised. Oblivion's theme and Mass Effect's theme were immediately recognizable to me; having played many hours of either.

I picked 7 links for some additional commentary. Further links which were candidates can be found below for archival purposes.


Sidenotes.


Building and Deploying a C++ library with GitLab

Posted on Thu 14 July 2016 in work • Tagged with Institute for Computer Vision and Computer Graphics

I've already written once before that I like working with GitLab's Continuous Integration (CI) technology. I've now had the chance to set up a project for one of our research teams using GitLab CI and it's been a true success.

In short: We're using GitLab CI to build and deploy a C++ library. We are downloading its dependencies, compiling them, compiling our library, creating Debian packages and installing them on the 6 servers we use for heavy-duty computing.

  • This writeup contains notes on vagrant, docker, gitlab-ci-multi-runner, fpm and cmake.
  • I worked with Christian Mostegel and Markus Rumpler on this project.

Sections:

  • Preparation
  • Automatic Building
  • Docker
  • GitLab Runners
  • GitLab CI
    • GitLab CI: Details on Jobs
    • GitLab CI: Details on Script
    • GitLab CI: Details on FPM
    • GitLab CI: Building the library
  • Automatic Deployment
  • Deployment: Sudoers
  • Deployment: Jobs
  • Summary
  • What went wrong

Preparation

We already had a GitLab instance that I'm very fond of and some knowledge on how to set up automatic builds from previous projects.

First we needed to verify that the library would build at all under the given conditions (Debian Jessie, amd64, specified dependencies). To ensure this, we used Vagrant to create a virtual machine whose configuration in terms of installed development packages was similar our local environment.

Using this VM as testbed we prepared the download and build of the dependencies of which Debian packages didn't readily exist in the configuration the researchers specified by writing a simple shell script.

Next we tried to build the library in this machine and added the required packages bit by bit after verifying what was really needed to build. These packages were collected since they would be the base of the Docker image we would build later to speed up the CI runs.

The final result of our preparation was a Vagrantfile which set up the machine and compiled as well as packaged our library with a simple vagrant up.

Automatic Building

Docker

The next step was to build the Docker image. This was fairly simple given that we already relied on two other automatically built Docker images from previous projects. We created another repository on GitHub to link with Docker Hub and waited for everything to be built (of course it didn't work perfectly and there was quite a lot of iteration in about every step I mention).

We typically build our images on top of buildpack-deps given that it's an official, somewhat slim, development oriented image. Here's the Dockerfile currently in use during writing this article:

FROM buildpack-deps:jessie
MAINTAINER Alexander Skiba <alexander.skiba@icg.tugraz.at>

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    cmake \
    freeglut3 \
    freeglut3-dev \
    gcc \
    git \
    g++ \
    libatlas-dev \
    libatlas-base-dev \
    libboost-all-dev \
    libblas-dev \
    libcgal-dev \
    libdevil-dev \
    libeigen3-dev \
    libexiv2-dev \
    libglew-dev \
    libgoogle-glog-dev \
    liblapack-dev \
    liblas-dev \
    liblas-c-dev \
    libpcl-dev \
    libproj-dev \
    libprotobuf-dev \
    libqglviewer-dev \
    libsuitesparse-dev \
    libtclap-dev \
    libtinyxml-dev \
    mlocate \
    ruby \
    ruby-dev \
    unzip \
    wget \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/* \
  && gem install --no-rdoc --no-ri fpm

GitLab Runners

After the image was built, we set up GitLab runners on each of our computing machines so that we could also use their cores and memory to speed up building the project itself. On each of these machines two runners were configured - one with shell, the other with docker as executor. Here's an example /etc/gitlab-runner/config.toml from production.

concurrent = 1

[[runners]]
  name = "example1"
  url = "REDACTED"
  token = "REDACTED"
  executor = "docker"
  [runners.docker]
    tls_verify = false
    image = "icgoperations/3dlib"
    privileged = false
    disable_cache = false
    volumes = ["/cache"]
  [runners.cache]
    Insecure = false

[[runners]]
  name = "example1-shell"
  url = "REDACTED"
  token = "REDACTED"
  executor = "shell"
  [runners.ssh]
  [runners.docker]
    tls_verify = false
    image = ""
    privileged = false
    disable_cache = false
  [runners.parallels]
    base_name = ""
    disable_snapshots = false
  [runners.virtualbox]
    base_name = ""
    disable_snapshots = false
  [runners.cache]
    Insecure = false

GitLab CI

Next we converted the previously written Bash script into the format needed by .gitlab-ci.yml. In the *-build jobs we download, build and package our dependencies.

stages:
  - dependencies
  - build
  - deploy

ceres-build:
  stage: dependencies
  script:
    - export I3D_CERES_VERSION=1.11.0
    - wget --quiet https://github.com/ceres-solver/ceres-solver/archive/$I3D_CERES_VERSION.tar.gz
    - mkdir ceres-source ceres-build ceres-install
    - tar xvfz $I3D_CERES_VERSION.tar.gz -C ceres-source --strip-components=1
    - cmake -Bceres-build -Hceres-source
    - make -j$(nproc) -C ceres-build
    - make -C ceres-build install DESTDIR=../ceres-install
    - bash .gitlab_build_files/build_ceres_debian_pkg.sh
  artifacts:
    paths:
    - i3d-ceres_*_amd64.deb
  tags:
    - linux,debian-jessie

opencv-build:
  stage: dependencies
  script:
    - export I3D_OPENCV_VERSION=2.4.10
    - wget --quiet https://github.com/Itseez/opencv/archive/$I3D_OPENCV_VERSION.tar.gz
    - mkdir opencv-source opencv-build opencv-install
    - tar xvfz $I3D_OPENCV_VERSION.tar.gz -C opencv-source --strip-components=1
    - cmake -Bopencv-build -Hopencv-source -DCMAKE_BUILD_TYPE=RELEASE -DBUILD_DOCS=OFF -DBUILD_PERF_TESTS=OFF -DBUILD_JPEG=ON -DBUILD_PNG=ON -DBUILD_TIFF=ON -DBUILD_opencv_gpu=OFF -DWITH_FFMPEG=OFF
    - make -j$(nproc) -C opencv-build
    - make -C opencv-build install DESTDIR=../opencv-install
    - bash .gitlab_build_files/build_opencv_debian_pkg.sh
  artifacts:
    paths:
    - i3d-opencv_*_amd64.deb
  tags:
    - linux,debian-jessie

GitLab CI: Details on Jobs

Since this can be a little overwhelming, I'd like to explain one section in detail. I'll write GitLab terms italicized.

ceres-build:
  stage: dependencies
  script:
    - export I3D_CERES_VERSION=1.11.0
    - wget --quiet https://github.com/ceres-solver/ceres-solver/archive/$I3D_CERES_VERSION.tar.gz
    - mkdir ceres-source ceres-build ceres-install
    - tar xvfz $I3D_CERES_VERSION.tar.gz -C ceres-source --strip-components=1
    - cmake -Bceres-build -Hceres-source
    - make -j$(nproc) -C ceres-build
    - make -C ceres-build install DESTDIR=../ceres-install
    - bash .gitlab_build_files/build_ceres_debian_pkg.sh
  artifacts:
    paths:
    - i3d-ceres_*_amd64.deb
  tags:
    - linux,debian-jessie

Here ceres-build is the unique name for the job (one step of a build process) - it's the identifier of one build unit.

The stage describes where in the build pipeline (all jobs combined for one project) you want this job to be executed. The default stages are build => test => deploy. We defined our stages to be dependencies => build => deploy before for this project, so ceres-build is executed in the first stage. Jobs in the same stage will be run in parallel, if possible (e.g. you have more than one runner, or at least one runner set to use concurrent builds).

After a build has finished running, you may choose not to discard all files automatically when using Docker and opt to keep some selected ones. This is done by specifying artifacts. We specified that all files matching i3d-ceres_*_amd64.deb shall be kept - a matching example would be i3d-ceres_1.11.0_amd64.deb. We also experimented with the newly released option of automatically expiring artifacts after a given amount of time has passed but were not convinced this was mature enough just yet.

The tags section helps GitLab Runner decide which machine to run the job on. We tagged our runners with linux and debian-jessie before. By using the same tags in .gitlab-ci.yml we make sure that one of the prepared machines is used.

The script section consists of a list of shell commands to run for the build. If one exists with a different code than 0, the build is assumed to have failed, otherwise it's good.

GitLab CI: Details on Script

# Use an environment variable to avoid hardcoding the version number
export I3D_CERES_VERSION=1.11.0

# Download the given release directly from the GitHub project releases
wget --quiet https://github.com/ceres-solver/ceres-solver/archive/$I3D_CERES_VERSION.tar.gz

# Prepare the directories we are going to be working in
mkdir ceres-source ceres-build ceres-install

# Unpack the archive and strip the folder containing the version number
tar xvfz $I3D_CERES_VERSION.tar.gz -C ceres-source --strip-components=1

# Configure Makefiles with given Build directory and Home (source) directory
cmake -Bceres-build -Hceres-source

# Build the project with the same number of threads as the machine has cores
make -j$(nproc) -C ceres-build

# "Install" the finished product locally into a given directory
make -C ceres-build install DESTDIR=../ceres-install

# Package the result with FPM (see FPM section)
bash .gitlab_build_files/build_ceres_debian_pkg.sh

GitLab CI: Details on FPM

As you might have already seen, I use long option names where possible to improve readability. In the following fpm command there are some short names however.

  • t is for type of target
  • s is for type of source
  • C is for source directory

You can find even more parameters in the fpm wiki.

#! /usr/bin/env bash

fpm \
-t deb \
-s dir \
-C ceres-install \
--name "i3d-ceres" \
--version $I3D_CERES_VERSION \
--license "BSD" \
--vendor "ICG TU Graz" \
--category "devel" \
--architecture "amd64" \
--maintainer "Aerial Vision Group <aerial@icg.tugraz.at>" \
--url "https://aerial.icg.tugraz.at/" \
--description "Compiled Ceres solver for i3d library" \
--depends cmake \
--depends libatlas-dev \
--depends libatlas-base-dev \
--depends libblas-dev \
--depends libeigen3-dev \
--depends libgoogle-glog-dev \
--depends liblapack-dev \
--depends libsuitesparse-dev \
--verbose \
.

After running FPM we have a nice, installable Debian package.

GitLab CI: Building the library

While building our own library is similar to the previously shown build, there are some differences worth mentioning.

  • There is no need to download/fetch/checkout the source, GitLab Runner already does that automatically.
  • The version which is later used for building the package includes the current date, time and the commit hash (example:2016-07-10~1138.ea246ba) This ensures that the package version is both ever increasing and easily resolved into the source commit.
  • We install the .deb packages built in the previous stage. This is simple since GitLab CI makes artifacts of the previous stage available to the current stage automatically. Notice we also avoid specifying a version number by using a wildcard.
  • While I prefer not to jump around with cd during the build process and suggest to use flags instead, we had some trouble making that work with our library; so the cd statements stuck around.
icg3-build:
  stage: build
  script:
    - export I3D_CORE_VERSION="$(date +%Y-%m-%d~%H%M)"."$(git rev-parse --short HEAD)"
    - dpkg -i i3d-*.deb
    - mkdir icg3d-build icg3d-install
    - cd icg3d-build
    - cmake -DUSE_CUDA=OFF -DAPP_SFM=OFF -DWITH_CGAL=ON -DWITH_QGL_VIEWER=ON -DWITH_QT=ON -DCORE_WITH_LAS=ON ..
    - cd ..
    - make -j$(nproc) -C icg3d-build/ICG_3DLib
    - make -C icg3d-build/ICG_3DLib install DESTDIR=../../icg3d-install
    - bash .gitlab_build_files/build_icg3d_debian_pkg.sh
  artifacts:
    paths:
    - i3d-core_*_amd64.deb
  tags:
    - linux,debian-jessie

Automatic Deployment

Automatic deployment was a greater issue than automatic building due to security and infrastructure considerations.

Since we already used Puppet, one idea was to use Puppet to ensure => latest on the packages we were building ourselves. However, the apt provider needs a repository and we were not sure the dpkg provider supported versioned packages and using latest. In order to use our repository we would have had to set it up to automatically sign new packages. Furthermore we would've had to run apt-get update against every server on every machine virtually all the time which is bad practice since we're not using an Apt Proxy or similar.

Another idea was to have locally scheduled execution via cron but that amounted to the same thing in my opinion.

Essentially I disliked any solution based on polling since this meant additional waiting time for the researchers with each build. When doing everything via the GitLab CI system, they would be able to configure notifications when all servers have received the newest build.

Deployment: Sudoers

However, to install something one needs sudo rights. Essentially, one would have to configure a special /etc/sudoers.d entry for gitlab-runner to be able to install the packages that we build previously and store with the artifacts feature. The required entry looked like this for our machines:

gitlab-runner machine[1-6]= NOPASSWD: /usr/bin/dpkg -i i3d-core_*_amd64.deb i3d-opencv_*_amd64.deb i3d-ceres_*_amd64.deb

For user gitlab-runner allow the command /usr/bin/dpkg -i i3d-core_*_amd64.deb i3d-opencv_*_amd64.deb i3d-ceres_*_amd64.deb on the machines machine[1-6] without asking for a password.

Additonally the shell runners are private runners and must be explicitly whitelisted by an admin or the owner for use on other projects. (read: should we get another project which needs automatic deployment)

Deployment: Jobs

Deployment jobs at the ICG work by specifying the host to which one needs to deploy as tag. In case something unstable hits a development branch, we only deploy the master branch. However, compilation and packaging is enabled for all branches. If needed, one could download those package via the GitLab web interface. The unfortunate side-effect of this is that the approch scales really, really bad. We have this section 6 times in own CI file - once for each machine we deploy to.

deploy-machine1:
  stage: deploy
  script:
    - sudo dpkg -i i3d-core_*_amd64.deb i3d-opencv_*_amd64.deb i3d-ceres_*_amd64.deb
  tags:
    - machine1
  only:
    - master

Summary

It is possible to build a pipeline which gets, compiles and packages dependencies, builds and packages your own C/C++ project and deploys it to multiple machines with GitLab CI. In our case, the whole process takes about 8 minutes or less, depending on which machines are picked for the build process and if a new version of the Docker image has been build and must be fetched first.

If you had the need to distribute to more machines but immediate feedback is not that important, uploading to the import folder of an automated Debian repository (e.g. Reprepro) should scale really well.

Building for other linux platforms (e.g. Ubuntu instead of Debian) should be easily solved via Docker images, while different architectures (e.g. i386 instead of amd64) would require another host or a VM. The build process could even be modified to build for Windows or macOS hosts with prepared VMs or hosts. We currently don't have any experience with either, though.


What went wrong

As I said, not everything went right from the beginning. I thought it might be interesting to add some notes on possible issues.

  • The .debs ended up empty. => The source directory was empty to begin with since make install had been forgotten.
  • The .debs were extraordinarily large. => The wrong source directory was chosen - probably the *_build folder instead of *_install.
  • Cmake didn't pick up the Build/Home directory and refused to run. => There mustn't be any spaces between -B and the target directory.
  • Cmake refused to acknowledge options. => There mustn't be any spaces between -D and the option; there must be a = between the option and its value.
  • Jobs were picked up by the wrong runner and failed. => Improve labeling of different private and public runners so that e.g. linux & debian-jessie are only labels of the Docker runner, not the Shell one; configure runners as private runners to not pick up unlabeled jobs.

Reading recommendations (2016-07-11)

Posted on Mon 11 July 2016 in reading recommendations

While I'm waiting for review of another article containing information from a project done at my current employer, I wanted to share some reading recommendations for articles I came across recently.

I'm not yet sure if this is going to be become a regular thing, but it might - sharing links via Twitter or other platforms is something I tend to dislike more and more while other low-friction options have let me down in the past, so the blog seems the most likely home for such content.

Besides the link I'll try to include the author name or nickname and how I found the interesting piece - maybe a little description to each of them where it feels appropriate.

  • Start Every Meeting with a Personal Check-in by Mathias Meyer (RSS, Travis CI blog)
    I find this concept to be very smart but am reluctant to suggest such a thing for the fear of being ridiculed. Being honest I wouldn't care too much about others' feelings in general but being forwarned that I shouldn't expect them to be at their best when they're feeling low is a major selling point for me.

  • Riding Immortal on the Seeking Road by ~Saint Arthur (RSS, I'm a candle blog, Fallen London universe)
    A chronicle of what is in my opinion Fallen London's most fascinating piece of lore. Seeking the Name, from its inception via the hiatus to its conclusion in 2016. If you're not into Fallen London or soaking up details in wikis about things you like, you will want to skip this one.

  • Answer to What lead to the Ottoman Empire decriminalizing homosexuality in 1858? Was there a lot of opposition and controversy around this? on Reddit by ~PaxOttomanica (newsletter, Reddit upvoted weekly)
    Interesting, detailed answer that mentions a fact we currently seem to be forgetting: Sometimes, laws need to be adjusted and modernized in order to avoid criminalizing everyone for something that has become common practice.

  • Once Upon A Time in the Valley by various Twitter users (unearthed from the depths of my Instapaper, originally almost certainly via Twitter)
    A humorous, cynic critique from InfoSec about how silly the Silicon Valley startup culture can be perceived. Good for a laugh or two.

  • GamerGate is killing video games by ~Zennistrad, ~segoli, Jay Rachel Eddin (unearthed from the depths of my Instapaper, originally via Twitter)
    Although GG is thankfully a thing of the past, this is a very thoughtful piece on the effect of GamerGate on video game archiving and the perception of video games in academia. I think the original article I read in the past was from ~Zennistrad, but this one has additional commentary and the one from ~Zennistrad appears to offline.

  • Hiring in tech should prioritize skill, not charisma by Thomas H. Ptacek (unearthed from the depth of my Instapaper, probably via Twitter)
    Interestingly, although I currently don't plan of changing employers soon, I come back to the idea what I'd try to do differently when interviewing and choosing my successor. The potential scenarios are all interesting.

  • Reverse Turing testing tech support by Rob Graham (unearthed from the depth of my Instapaper, original via RSS, Errata Security blog)
    A critical experience report about getting tech support from official sources and how researching yourself safes time if you already have a clue what you're looking for. Might make you cringe and smile at the same time.

7 is a good number. Let's stick with that. Good night.


Improving our Xen Usage

Posted on Tue 31 May 2016 in work • Tagged with Institute for Computer Vision and Computer Graphics

The software we use at ICG for virtualizing servers is Xen. That's fine because it has not made my life unnecessarily hard yet. There are however, some things that could be done better - especially while handling things with puppet.

How it used to work

When I initially found the infrastructure for our Xen guests, the configuration files were located in a subdirectory of /srv that turned out to be an NFS share (mounted with hard because that's the default). This was the same for all our Xen hosts apart from one which had local configurations, but symlinked to a similar folder.

Inside these folders were many previous configuration files of VMs that had been retired long ago which made finding the currently used files somewhat of an annoying task.

The main reason I chose to rework this was the NFS mount - when the host providing the NFS share wouldn't reboot during a standard maintenance I had no configuration for any guest on all but one of our Xen hosts. That was an inconvenient situation I hoped to avoid in the future.

How it works right now

One of my issues with the previous solution was that it left important configuration files in a non-standard path instead of /etc/…. Furthermore I wanted to use version control (Git, to be precise) in order to keep the directory clean and current while also maintaining the file histories.

I integrated everything into our Git versioned Puppet code by writing a class which installs the xen-hypervisor-4.4-amd64 package, maintains the /etc/xen/xl.conf and /etc/xen/xend-config.sxp files as well as the directory /etc/xen/domains - the latter of which is a single flat directory where I keep all Xen guest configuration files.

The files are named according to a special syntax so that it's possibly to see with a glance where the domains are supposed to be run. (e.g. 02-example_domain.cfg)

While further improving our Xen hosts with additional monitoring, unattended upgrades and optimizing the DRBD running on some of them I soon found out that this solution wasn't great either. The flat directory prevented me from writing simple Puppet code to use Xen's /etc/xen/auto directory to have certain guests automatically started (or resumed, depending on circumstances) on boot of the host.

How the suggested solution looks like

Since Puppet is not a scripting language it's often that your established way of thinking (mine being, "I know, I'll use a 'for' loop") can't solve the problem and you either have to research new ways of working around the problem or finding idiomatic ways to solve it.

I needed a way to make sure the right Xen configurations would end up in each host's /etc/xen/auto without them trying to start configurations for other hosts. Given the naming scheme this could be as easy as the following snippet.

# NOTE: untested and only here for illustration purposes
#       You need to get the host number from somewhere
#       but that wouldn't be the main issue.

exec
{
  'link-xen-configurations':
  refreshonly => true,
  path        => '/usr/bin/find /etc/xen/domains -type f -name "NUMBER-*.cfg" | /usr/bin/xargs -I FILENAME -n1 -t ln -f -s FILENAME /xen/auto/FILENAME',
  user        => 'root',
}

Of course you would need to remove existing links to files first and using execs is a messy business after all. Besides - something I hadn't touched yet - there are also VM configurations that have two prefixes to signify on which hosts they can be run due to DRBD (e.g. 01-03-other_example.cfg) syncing their contents on a block level between two hosts.

Given this it's even more complex to build such a system well in a way that won't break in spectacular fashion the first time you look away after a deploy.

My plan is to create host-specific folders in our Puppet code and have Puppet symlink those since using the $::hostname variable provided by Puppet's Facter makes this extremely easy. In addition, disentangling the multiple-host configurations will be necessary - this will avoid having DRBD capable hosts starting the same VM at the same time. I might combine this with changing the device specified in the Xen configurations.

-disk = ["phy:/dev/drbd13,xvda,w"]
+disk = ["drbd:myresource,xvda,w"]

This will direct Xen to put the DRBD resource named 'myresource' into the Primary role, and configure it as device xvda in your domU. ~/etc/xen/scripts/block-drbd (slightly changed to use whole disk instead of partition)

The interesting thing here is that the resource will automatically become primary when the Xen domain is started - there is no need to automatically become primary on startup on a particular node with DRBD itself - this will be done on demand as soon as a Xen guest requires it.

In time - with DRBD 9 - it might even be reasonable to have all VM hosts be able to run all guests due to cluster-mode block syncing.


On changing hard disks

Posted on Fri 22 January 2016 in work • Tagged with Institute for Computer Vision and Computer Graphics

Now, I might have mentioned in the past that despite me working as a system administrator, I dislike working with actual hardware and prefer to touch machines as soon as they are SSH-ready or at most as soon as an operating system can be installed.

  • This post has been updated once.

Well, let's assume for a moment that a disk needs changing and neither the senior admin in my current job nor my predecessor are available. Let's assume that this has happened twice already and led to rather amusing stories; both times.

first time's always fun

The first time I was to change a disk I had help from my colleague Daniel Brajko who accompanied me to the server room, but let's start at the beginning.

I noticed that something was up and a disk had an error when I wrote my script to check the output of the RAID controllers' status information to get notified automatically when something was wrong. I decided to tackle this task since it was one important piece of work that my senior admin had assigned me during his absence.

After checking the serial and the disk size of the faulty drive when headed to the storage space and picked several disks since we were not sure which one was to go into exactly that server. Actually, at that time we were also not sure because some of the disks were not labelled with their disk size (looking at you, Seagate). With the disks and more equipment in a backpack, we ventured to the server room which is conveniently located within walking distance of our office.

We only came as far as the server room door, though. Neither my employee card nor the one of my colleague was authorized to enter, even though at least he's been in this job for over a year. Great. Alright, the helpful thing was that authorization had not yet been transferred from my predecessor to me yet and he still worked at our institute in a different position. He knew us and lent us the card in order to change disks as he clearly recognized the need for such maintenance. I had a bad feeling the whole time that someone would "catch" us and we'd have to explain why we were using this card in an extremely awkward situation.

With this card - impersonating our former colleague - we ventured into the server room, only to find that the machine in question was in our secondary server room - the one who is multiple blocks away. Alright, this wasn't going to be easy.

So we packed everything back up and walked to the secondary building. Daniel had only ever been there once, I had never been there. The building has two basement levels which are not particularly well lit nor particularly easy to find your way around in. I wouldn't necessarily call it a maze but it's certainly not far from that. After 15 minutes of running around without any clue we surrendered and went up to the ground floor to consult one of the university's information terminals to find our own server room. A glorious day, let me tell you.

After finding our server room and granting ourselves access with the borrowed card we entered the room, looked for our server cabinet (of course it was the only unlabelled one) and well… uhm. That was the point were we Daniel pointed out that, yes, we do need the keychain that I told him to leave behind because, "I already have everything we need".

And back we went. *sigh*. After fetching the keychain we also borrowed my predecessor's bike as well as another one and went back, back into the basement, changed the drive which was relatively painless after realizing we only had one disk with the correct storage capacity with us and returned.

And that's how it took two sysadmins a whole afternoon to change a damaged disk. After that episode we phoned the person in charge and got ourselves assigned the server room access permissions. but...

second time you're quickly done

Today this little e-mail arrived. That's the second time it did and I always like when my efforts pay out. :)

RAID status problematic.

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache AVrfy
------------------------------------------------------------------------------
u0    RAID-6    DEGRADED       -       -       256K    5587.88   RiW    ON     

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     931.51 GB   1953525168    [REDACTED]            
p1     OK               u0     931.51 GB   1953525168    [REDACTED]            
p2     DEVICE-ERROR     u0     931.51 GB   1953525168    [REDACTED]        
p3     OK               u0     931.51 GB   1953525168    [REDACTED]            
p4     OK               u0     931.51 GB   1953525168    [REDACTED]      
p5     OK               u0     931.51 GB   1953525168    [REDACTED]            
p6     OK               u0     931.51 GB   1953525168    [REDACTED]            
p7     OK               u0     931.51 GB   1953525168    [REDACTED]  

Okay. So. Senior admin is absent again, disk fails again. This time Daniel is also not there. "Fine," I tell myself, it will be painless this time. I was so, so wrong.

After making a quick joke with the researchers that maybe they should go home early because if I fail when replacing the disk, we won't have any e-mail service I grabbed the keys and a replacement disk - this time I couldn't find one with the right storage capacity again, but I got smarter and made an educated guess based on 5 of 8 characters of the serial number matching. I headed to the next building, ran into the admin from the other institute and joked if they also had "those mean things lacking a storage capacity description". He helpfully stated that they use the same model and they were 1 TB models which gave me some relief. After opening our server racks and checking all devices in there I came to a terrible realization: Of course I was in the wrong building. Again. (This time I made a list of all our devices in this building for our internal docs.)

Alright, back up from the basement, notified the office that the keychain has not gone missing and I'm taking it to go to the other building. I walked through the cold winter air to the other building, entered the basement and found the server room on the first try. This is a thing that tends to happen. If I am ever required to find my way to a place by myself I will keep finding the way there in the future. Anyway, so I hold my card to the scanner and… nothing happens. I cursed, waited a bit and tried again. Again, nothing. There's an emergency contact on the door and after returning to the ground floor in order to have cellphone reception I called that, we had a longer conversation and obviously I didn't receive all the permissions I should have gotten when the issue arose the first time. Shall we say I was a little annoyed that not both permissions had been transferred from my predecessor directly to me?

Update: It turns out I am again to blame for something, as I did have the permissions. However, I didn't know that the card activation via sensor only works for the building you last checked in at the sensor. So, believing my card is supposed to work after having just been at one sensor I obviously didn't visit the sensor at the other building.

After managing emergency access I scoured the room for our server rack. I panicked a little when there was nothing where I remembered seeing it last time. I mean, yes, it had been a while but my memory for locations is pretty accurate and I don't think anyone would've moved the machines without the admins noticing. Good thing no one else was in the room since I must've looked like a burglar using my iPhone's flashlight to search the empty server cabinet where our machines were supposed to be. Then I noticed that there were indeed machines. It was just that both were in real slim chassis and they were located in the topmost and bottommost slots. In addition one was turned off and so I missed both when looking less carefully. Oh, yeah. Our stuff was in the only unlabelled rack, because of course it still was. I really hope the people in charge don't have microphones there since I might've been swearing quite a lot.

The rest was easy work. Change the disk, make sure the RAID recognizes the new drive and, pack everything up and go home.

I'm morbidly curious what surprises the next drive change will offer me.

PS: Yes, labelling our rack is on top of my TODOs.