sábado, 6 de agosto de 2011

Unix and the Hole Hawg

In his article "The Hole Hawg of Operating systems", Neal Stephenson compares Unix to an industrial-strength drill, which is "too powerful and too expensive for the homeowner", that is "like the genie of the ancient fairy tales, who carries out his master's instructions literally and precisely and with unlimited power, often with disastrous, unforeseen consequences".

Our reader Paul F. commanded the "Unix genie" to look for log files older than 90 days and erase them. Just a simple "find" command:


find / -name *log* -atime 90 -exec rm -rf {} \;


This command looks for files or directories that would contain "log" in their names, that haven't been accessed in 90 days, and applies the command "rm -rf" on each of them. Files will be erased, and directories will have each of its files erased recursively before being removed themselves.

Our reader issued this command as part of his routine in the night shift, so he did this and went to sleep, without knowing the consequences. When the day shift came to the office and started work, all hell broke loose.

It turns out that the server where the command was executed, was a database server, which happens to have very important files that have "log" in their names: the redo logs. And the command, which started its search for something-log files in the root directory, found those files and utterly destroyed them. The database server uses these files to log its actions before updating the data files and recover from possible errors. When these files don't exist, the database can't operate at all. And if the database can't operate, the users can't operate.

And when the users arrived in the morning and tried to use their precious database, they got errors instead of their data. So they called their boss. Their boss called his boss, and that boss called the vice president of operations, who demanded the head of whoever was responsible for paralyzing the company.

Actually, Paul F. didn't paralyze the company -only a few users were involved- and the vice president of operations was told that the database server involved was not a critical server, was only a backup server and the actual data was safe elsewhere. So the head of Paul F. was spared after all. And it started a good discussion about disaster recovery should this ever happen in a real server.

Morale of the story: never, EVER do finds with the rm command starting in the root directory, for it will have unforeseen consequences.

jueves, 4 de agosto de 2011

Unix Horror Stories: The good thing about Unix, is when it screws up, it does so very quickly

The project to deploy a new, multi-million-dollar commercial system on two big, brand-new HP-UX servers at a brewing company that shall not be named, had been running on time and within budgets for several months. Just a few steps remained, among them, the migration of users from the old servers to the new ones.

The task was going to be simple: just copy the home directories of each user from the old server to the new ones, and a simple script to change the owner so as to make sure that each home directory was owned by the correct user. The script went something like this:


#!/bin/bash

cat /etc/passwd|while read line
do
USER=$(echo $line|cut -d: -f1)
HOME=$(echo $line|cut -d: -f6)
chown -R $USER $HOME
done


As you see, this script is pretty simple: obtain the user and the home directory from the password file, and then execute the chown command recursively on the home directory. I copied the files, executed the script, and thought, great, just 10 minutes and all is done.

That's when the calls started.

It turns out that while I was executing those seemingly harmless commands, the server was under acceptance test. You see, we were just one week away from going live and the final touches were everything that was required. So the users in the brewing company started testing if everything they needed was working like in the old servers. And suddenly, the users noticed that their system was malfunctioning and started making furious phone calls to my boss and then my boss started to call me.

And then I realized I had thrashed the server. Completely. My console was still open and I could see that the processes started failing, one by one, reporting very strange messages to the console, that didn't look any good. I started to panic. My workmate Ayelen and I (who just copied my script and executed it in the mirror server) realized only too late that the home directory of the root user was / -the root filesystem- so we changed the owner of every single file in the filesystem to root!!! That's what I love about Unix: when it screws up, it does so very quickly, and thoroughly.

There must be a way to fix this, I thought. HP-UX has a package installer like any modern Linux/Unix distribution, that is swinstall. That utility has a repair command, swrepair. So the following command put the system files back in order, needing a few permission changes on the application directories that weren't installed with the package manager:


swrepair -F


But the story doesn't end here. The next week, we were going live, and I knew that the migration of the users would be for real this time, not just a test. My boss and I were going to the brewing company, and he receives a phone call. Then he turns to me and asks me, "What was the command that you used last week?". I told him and I noticed that he was dictating it very carefully. When we arrived, we saw why: before the final deployment, a Unix administrator from the company did the same mistake I did, but this time, people from the whole country were connecting to the system, and he received phone calls from a lot of angry users. Luckily, the mistake could be fixed, and we all, young and old, went back to reading the HP-UX manual. Those things can come handy sometimes!

Morale of this story: before doing something on the users directories, take the time to see which is the User ID of actual users - which start usually in 500 but it's configuration-dependent - because system users's IDs are lower than that.

Send in your Unix horror story, and it will be featured here in the blog!

Greetings,
Agustin