On changing hard disks
Posted on Fri 22 January 2016 • Tagged with Institute for Computer Vision and Computer Graphics, Work
Now, I might have mentioned in the past that despite me working as a system administrator, I dislike working with actual hardware and prefer to touch machines as soon as they are SSH-ready or at most as soon as an operating system can be installed.
- This post has been updated once.
Well, let’s assume for a moment that a disk needs changing and neither the senior admin in my current job nor my predecessor are available. Let’s assume that this has happened twice already and led to rather amusing stories; both times.
first time’s always fun
The first time I was to change a disk I had help from my colleague Daniel Brajko who accompanied me to the server room, but let’s start at the beginning.
I noticed that something was up and a disk had an error when I wrote my script to check the output of the RAID controllers’ status information to get notified automatically when something was wrong. I decided to tackle this task since it was one important piece of work that my senior admin had assigned me during his absence.
After checking the serial and the disk size of the faulty drive when headed to the storage space and picked several disks since we were not sure which one was to go into exactly that server. Actually, at that time we were also not sure because some of the disks were not labelled with their disk size (looking at you, Seagate). With the disks and more equipment in a backpack, we ventured to the server room which is conveniently located within walking distance of our office.
We only came as far as the server room door, though. Neither my employee card nor the one of my colleague was authorized to enter, even though at least he’s been in this job for over a year. Great. Alright, the helpful thing was that authorization had not yet been transferred from my predecessor to me yet and he still worked at our institute in a different position. He knew us and lent us the card in order to change disks as he clearly recognized the need for such maintenance. I had a bad feeling the whole time that someone would “catch” us and we’d have to explain why we were using this card in an extremely awkward situation.
With this card - impersonating our former colleague - we ventured into the server room, only to find that the machine in question was in our secondary server room - the one who is multiple blocks away. Alright, this wasn’t going to be easy.
So we packed everything back up and walked to the secondary building. Daniel had only ever been there once, I had never been there. The building has two basement levels which are not particularly well lit nor particularly easy to find your way around in. I wouldn’t necessarily call it a maze but it’s certainly not far from that. After 15 minutes of running around without any clue we surrendered and went up to the ground floor to consult one of the university’s information terminals to find our own server room. A glorious day, let me tell you.
After finding our server room and granting ourselves access with the borrowed card we entered the room, looked for our server cabinet (of course it was the only unlabelled one) and well… uhm. That was the point were we Daniel pointed out that, yes, we do need the keychain that I told him to leave behind because, “I already have everything we need”.
And back we went. *sigh*. After fetching the keychain we also borrowed my predecessor’s bike as well as another one and went back, back into the basement, changed the drive which was relatively painless after realizing we only had one disk with the correct storage capacity with us and returned.
And that’s how it took two sysadmins a whole afternoon to change a damaged disk. After that episode we phoned the person in charge and got ourselves assigned the server room access permissions. but…
second time you’re quickly done
Today this little e-mail arrived. That’s the second time it did and I always like when my efforts pay out. :)
RAID status problematic. Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-6 DEGRADED - - 256K 5587.88 RiW ON Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 931.51 GB 1953525168 [REDACTED] p1 OK u0 931.51 GB 1953525168 [REDACTED] p2 DEVICE-ERROR u0 931.51 GB 1953525168 [REDACTED] p3 OK u0 931.51 GB 1953525168 [REDACTED] p4 OK u0 931.51 GB 1953525168 [REDACTED] p5 OK u0 931.51 GB 1953525168 [REDACTED] p6 OK u0 931.51 GB 1953525168 [REDACTED] p7 OK u0 931.51 GB 1953525168 [REDACTED]
Okay. So. Senior admin is absent again, disk fails again. This time Daniel is also not there. “Fine,” I tell myself, it will be painless this time. I was so, so wrong.
After making a quick joke with the researchers that maybe they should go home early because if I fail when replacing the disk, we won’t have any e-mail service I grabbed the keys and a replacement disk - this time I couldn’t find one with the right storage capacity again, but I got smarter and made an educated guess based on 5 of 8 characters of the serial number matching. I headed to the next building, ran into the admin from the other institute and joked if they also had “those mean things lacking a storage capacity description”. He helpfully stated that they use the same model and they were 1 TB models which gave me some relief. After opening our server racks and checking all devices in there I came to a terrible realization: Of course I was in the wrong building. Again. (This time I made a list of all our devices in this building for our internal
Alright, back up from the basement, notified the office that the keychain has not gone missing and I’m taking it to go to the other building. I walked through the cold winter air to the other building, entered the basement and found the server room on the first try. This is a thing that tends to happen. If I am ever required to find my way to a place by myself I will keep finding the way there in the future. Anyway, so I hold my card to the scanner and… nothing happens. I cursed, waited a bit and tried again. Again, nothing. There’s an emergency contact on the door and after returning to the ground floor in order to have cellphone reception I called that, we had a longer conversation and obviously I didn’t receive all the permissions I should have gotten when the issue arose the first time. Shall we say I was a little annoyed that not both permissions had been transferred from my predecessor directly to me?
Update: It turns out I am again to blame for something, as I did have the permissions. However, I didn’t know that the card activation via sensor only works for the building you last checked in at the sensor. So, believing my card is supposed to work after having just been at one sensor I obviously didn’t visit the sensor at the other building.
After managing emergency access I scoured the room for our server rack. I panicked a little when there was nothing where I remembered seeing it last time. I mean, yes, it had been a while but my memory for locations is pretty accurate and I don’t think anyone would’ve moved the machines without the admins noticing. Good thing no one else was in the room since I must’ve looked like a burglar using my iPhone’s flashlight to search the empty server cabinet where our machines were supposed to be. Then I noticed that there were indeed machines. It was just that both were in real slim chassis and they were located in the topmost and bottommost slots. In addition one was turned off and so I missed both when looking less carefully. Oh, yeah. Our stuff was in the only unlabelled rack, because of course it still was. I really hope the people in charge don’t have microphones there since I might’ve been swearing quite a lot.
The rest was easy work. Change the disk, make sure the RAID recognizes the new drive and, pack everything up and go home.
I’m morbidly curious what surprises the next drive change will offer me.
PS: Yes, labelling our rack is on top of my TODOs.