The OS Quest Trail Log #39: Long Lingering Mistake Edition

Today wasn’t a good day for site uptime stats. After about 2 months of continuous server uptime there were a couple planned and one unplanned outages. I decided to upgrade from Ubuntu 8.04 to Ubuntu 8.10. Yea, I know 9.04 was just released but I don’t want to be that bleeding edge on my server and 8.10 had some features that would make my life a bit easier. Besides, the path to 9.04 goes through 8.10 anyway.

The upgrade itself was relatively painless and the downtime was limited to about 10 minutes plus a second quick reboot later on. The real problem came later and was actually unrelated to the upgrade, although I spent a lot of time thinking it was which extended the outage.

Ubuntu 8.04 to 8.10 Upgrade

The Ubuntu upgrade was simple. I’m running Ubuntu 8.04 Server with is a Long Term Support (LTS) edition. So the steps were:

  1. Make sure update-core-manager is installed: sudo aptitude install update-core-manager
  2. Edit the update manager configuration to allow the upgrade to the LTS edition (typically LTS editions default to not upgrading since the point is a long term install with minimal changes). So I run sudo nano /etc/update-manager/release-upgrades and change the “prompt” setting to be Prompt=normal
  3. Then I run the upgrade: sudo do-release-upgrade

I follow the instructions as the install progresses. The upgrade completes and I reboot about 40 minutes after I issued the command. Apache and my sites were down for about 10 minutes of that time. I was asked a couple questions during the install and accepted the defaults. Basically I didn’t replace any configuration files with those included in the new installs.

Everything was fine after the reboot.

Because I run on a virtual private server (vps) the kernel wasn’t upgraded so that comes next.

Kernel Upgrade

This was the easy part. All I had to do was open a support ticket with Slicehost and ask them to upgrade the kernel. They got back to me within minutes letting me know the kernel version and that there would be a reboot. I confirmed it was OK and the work was done a few minutes later. It was done in less than 20 minutes after I submitted the request and the down time was about a minute for the reboot.

Everything was fine after the reboot

And Then The Problems…

I’ve been tweaking Apache to see what the performance impact will be. I made a change and restarted Apache, it didn’t start. Naturally I looked at the just changed settings and backed them off. Still won’t start. So then I start looking at something that might have changed due to the upgrade, despite everything working after the reboots.

Finally I checked the Apache log and noticed that there were errors that “rotatelogs” logs couldn’t be found. I use it to manage the logs and it’s worked in the past. And it’s still there. Then I noticed the leading “/” was missing from the path in the log files. Sure enough, it wasn’t in the configuration file either. Stick it in and all is well.

I’d rarely restarted Apache in the past, the reboots were the first time since the server started. I’d do reloads, but no restart. This time I used the “restart” command and got the error. I figure the context was OK on the reboots so the lack of an explicit path went unnoticed all this time.

A relatively easy fix but about 20 minutes of down time before I found it. Problems are supposed to be caused by the last thing changes, not long lingering mistakes just waiting to explode on the scene with the right trigger.