Results matching “time capsule” from Bill's Words

launchd Vagaries

|

A few evenings ago, I was the most frustrated person on the face of the planet. I can tell you this with some assurance because I have a meter on my Dashboard that tells me so. It measures my typing pace and the frequency of command line errors and... I digress.

The root cause of my frustrations was a file named #org.clamav.freshclam.plist# (whose origins I'm still unclear on, though I assume it's a leftover emacs temporary file or something). Every time the system booted, launchd tried to load that plist file, located in the directory /System/Library/LaunchDaemons/, and the net result was that postfix wouldn't start because amavisd wouldn't start. Why amavisd wouldn't start, I still don't know, because it seems to me it shouldn't depend on freshclam to run. But I couldn't tell what was going on by looking at any of the logs because there is not enough information shown.

And therein lies the heart of the matter: launchd, though it doesn't operate quite silently, doesn't have a debug mode. Furthermore, even in normal operation, it doesn't quite give enough information to make for productive bug hunting. I thought of these problems after the frustration level quieted down the following morning when I figured out what was going on.

First, What was going on?

What I knew was that postfix wasn't launching and neither was amavisd. I also knew from these entries in the system log

Oct 8 19:26:03 shr-g5 org.clamav.freshclam[3087]: ERROR: Incorrect argument format for option <strike>-checks (</strike>c) Oct 8 19:26:03 shr-g5 org.clamav.freshclam[3087]: ERROR: Can't parse command line options Oct 8 19:26:03 shr-g5 com.apple.launchd[1] (org.clamav.freshclam[3087]): Exited with exit code: 1 Otc 8 19:26:03 shr-g5 com.apple.launchd[1] (org.clamav.freshclam): Throttling respawn: Will start in 10 seconds that there was a problem with the file /System/Library/LaunchDaemons/org.clamav.freshclam.plist (or so I thought). I have mentioned this problem before, as well as the fix.

Great! I know how to fix that! So, armed with the confidence similar to that elicited by carrying a howitzer into a rock/paper/scissors match, I edited the file at fault because I had indeed messed around in Server Admin and knew that SA rewrites that file and screws it up again.

So I was perplexed when I kept seeing these errors over and over and over again. Stranger still, I could do a

sudo launchctl unload org.clamav.freshclam.plist followed by a sudo launchctl load org.clamav.freshclam.plist and freshclam was happy! I could then manually launch amavisd and postfix and life was good, until I rebooted and the same frickin' problem reappeared.

And that left me extremely frustrated, even more so than usual because I had a rented Bobcat-style digger sitting out in the driveway consuming my money without being productive for me at all. Eventually, I threw in the towel and started digging, which turns out to be very therapeutic.

[The paragraph that belongs here was removed because you really don't want to hear me gripe about restoring a MacOS X Server box from a Time Machine backup and all of the problems that that process entails. I've mentioned that before. Suffice it to say, I did so, it worked, but the problem remained.]

The following morning, again, with the digger outside consuming money at an alarming rate, I gave myself until 9am to figure out the problem. After that, my users, a.k.a. family, would have to live without E-mail until the digger was returned around 5pm. Fortunately, I did a ls without my usual -la and saw the aforementioned #'d file. I guessed that it was the cause of all my troubles, moved it out of the way, and the machine booted happily with no launchd errors.

So that leads me to the second part of my treatise, namely, What would have made this experience much less frustrating?

I have three suggestions. The first is that launchd should report the path to and name of the file it's parsing so that I would have seen immediately where the suspect file is. This information is especially important as there are several locations that launchd looks for plist files.

The second is that lauchd should tell me what command it tried that caused the problem. That's just good debugging practice.

Resulting logs would look like this in Bill's World:

Oct 8 19:26:03 shr-g5 org.clamav.freshclam[3087]: ERROR: While processing '/System/Library/LaunchDaemons/#org.clamav.freshclam.plist#' Oct 8 19:26:03 shr-g5 org.clamav.freshclam[3087]: ERROR: Issued command '/usr/bin/freshclam (rest of options from plist file here)' Oct 8 19:26:03 shr-g5 org.clamav.freshclam[3087]: ERROR: Incorrect argument format for option <strike>-checks (</strike>c) Oct 8 19:26:03 shr-g5 org.clamav.freshclam[3087]: ERROR: Can't parse command line options Oct 8 19:26:03 shr-g5 com.apple.launchd[1] (org.clamav.freshclam[3087]): Exited with exit code: 1

The third thing I'd like to see is more info output by launchctl list. As it is, it doesn't show any info about what's loaded except what the base name of the plist file is. No information about the issued command, no information about the path to the plist file, names of plist files which were tried and found wanting, nothing else is available that would have made this debugging process any easier. And that would be helpful, too.

launchd is a really cool new thing, but it lacks some of the polish and usability features that the more mature things have in place already. These are my suggestions which, I think, will help improve launchd and make it a cool, mature thing.

In my original posting, I gripe that Time Capsule makes a royal mess of things for a MacOS X Server box. And it does, still, unless you do some system file editing.

But I’ve had a few thoughts since then, especially since I just went through this whole scenario again last night. This time, I was better prepared, though, as the system file editing I describe in that article saved my bacon.

Or my users’ IMAP data. Bacon, IMAP data… if only IMAP data tasted as good… mmm… bacon…

Sorry. Back to reality.

Apple has chosen to keep several /var directories out of the Time Machine backups because they change rapidly, there’s a lot of data, and it would fill up the backup disk PDQ. Wisely, they also include some preferences in the exclusions file to take care of this little problem, namely the keys “PathsExcluded”, “ContentsExcluded”, and “FileContentsExcluded”.

Entire directories can be excluded with the first, the contents of directories can be excluded with the second (i.e., it preserves the top directory structure, but doesn’t backup any files or subdirectories), and the third backs up all of the subdirectory structures, but still no files. Googing these three terms yields nothing, so I assume I have a correct understanding of what they do.

If I do understand these correctly, then there’s a fourth kind that needs to be created, namely something which means “backup only the file permissions, names, ACLs, etc., but don’t backup the data in the files.” This key would allow the /var/log directory backup to maintain zero-length backup files (hence, they never change) but allow not-so-smart software which doesn’t/can’t create its own logfiles when missing (Apache, I’m lookin’ at you) to use them when restored from backup.

Maybe one of these keys means exactly that, but I’ll be darned if I know which one, and, like I said, there doesn’t seem to be any documentation on it.

I have a 1TB Apple Time Capsule which I say by way of introduction and not as a means to lord over you, dear reader, that I’ve got something that you might not have. I use this Time Capsule to backup my MacOS X Server.

According to the product page, here’s what Time Capsule can do:

Time Capsule is a revolutionary backup device that works wirelessly with Time Machine in Mac OS X Leopard. It automatically backs up everything, so you no longer have to worry about losing your digital life.

and

Backing up is something we all know we should do, but often don’t. And while disaster is a great motivator, now it doesn’t have to be. Because with Time Capsule, the nagging need to back up has been replaced by automatic, constant protection. And even better, it all happens wirelessly, saving everything important, including your sanity.

The only requirement for your Mac is that it is running MacOS X 10.5.2 or later.

That’s because MacOS X 10.5 introduces Time Machine like this:

For the initial backup, Time Machine copies the entire contents of the computer to your backup drive. It copies every file exactly (without compression), skipping caches and other files that aren’t required to restore your Mac to its original state.

and

By default, Time Machine backs up everything on your Mac.

(Emphasis mine.)

So when I hooked it up to my MacOS X Server 10.5.4 server and turned on Time Machine (one click!… and some typing to set the password, but we’ll overlook that), I was pleased to see that, indeed, my server was backed up. I could browse the backup and see it was good. Cool!

But then the other day, disaster struck only hours before we were to leave for the airport: my server had crashed in such a way as to be unrecoverable. Things were grim, and the only thing I could do was to restart from the Server installation DVD and restore the machine from the Time Capsule. Well, it did that in about 9 hours and with some remote system administration from my beautiful wife, the machine was back up and running.

Except it wasn’t.

E-mail didn’t work. Web services wouldn’t start. All hell was breaking loose in the system logs as various services which were trying to start up just plain wouldn’t start up. Things which previously had a quiet existence on this machine were suddenly vociferously complaining about a plethora of problems. While each had its own gripe, most were unhappy about the nonexistence of a log directory, /var/log. “Huh?” I thought to myself, “I thought I read that Time Machine backed up my entire machine to get it back to its original state. What’d I miss? And why do I feel like it’s my fault all of a sudden?”

A little searching on the web reveals that there’s a list of stuff Time Machine doesn’t back up which, on a normal MacOS client machine might be OK, but for a server is disastrous. The list, stored at

/System/Library/CoreServices/backupd.bundle/Contents/Resources/StdExclusions.plist

has, among other things, these items which are excluded

/private/var/log

and

/private/var/spool

OK, log files don’t necessarily need to be backed up, maybe only the last one so you can see what happened before the crash would be nice. Nonetheless, if there are various services which need log files or log directories to exist to run, then something, somewhere must recreate these logfiles or the system never gets up and running, and the backup has, in fact, failed. Apache is quite content to gripe that it can’t make logfiles. Amavis, too, can’t do anything unless the directory is there. Sorting out which logs need to be where and who owns them and what their permissions should be took me the better part of two hours, and I’m know I don’t have them all right. (And that’s only for the few services I’m running. God only knows what I’m missing for the others.)

But… and this one is inexcusable… not backing up /var/spool, which includes /var/spool/imap which is where my IMAP users’ E-mail is stored!! is insane and has my blood boiling. This is an oversight which is completely uncharacteristic of Apple but for which there is no excuse.

The next four hours I spent trying to recover my IMAP users and getting Postfix to run were maddening. I had lost nearly a gigabyte of E-mail. Multiple IMAP directories had to be “reconstructed,” sometimes a success and sometimes not, according to the many webpages out there. Even Apple’s own webpage describing this process failed. (Something about partition “/var/spool/imap/user” not existing.) My users were similarly peeved. “You mean we bought that expensive Time Capsule instead of a simple external hard drive for mirroring and it didn’t work?!

Never mind the fact that it’s not the Time Capsule at fault, it’s Time Machine. They didn’t see it that way. They saw an expenditure that was unjustified because it simply didn’t work. I saw a maddening amount of work on a weekend, on my vacation, from 1000 miles away, because it simply didn’t work. And that’s just wrong, wrong, wrong.

Note to Steve Jobs You would be incensed, too.

My users and I are angry, and rightfully so. You, Apple, make a promise about a very important function and you don’t keep it. This is backup, for goodness’ sake! This is the kind of thing that has to work. If you say “backup,” it implies that it will, it shall, it must work.

And it didn’t.

Another note to Steve Jobs: If this had happened to you, you would have seen to it that somebody got yelled at and it would have been fixed immediately. Really. This is the kind of thing you hate. (I’ve read enough of Fake Steve Jobs to know how you think, man.)

Coming in an article “real soon now:” an article about how to recover from the various messes left behind by a Time Machine restore of MacOS X Server. (Just as soon as someone answers the question I pose here, that is.)