Should the “Reboot! Shut up and reboot!” theory be applied to programs?


Tech-savvy website Ars Technica recently invited comments on an interesting thought about programming.

“Should programs randomly fall on their swords?”

Actually, they didn’t quite put it like that – indeed, they didn’t make it clear whether programs ought to exit gracefully but needlessly after a random time, or whether they ought to be asynchronously killed off on a random basis by some monitor process.

→Such a monitor would be the opposite of a traditional watchdog, a process that keeps its eye on other programs and warns you when they break. This would be a process that breaks other programs, then tells you it’s done so.

But they did wonder about making programs exit even if they didn’t need or want to, for the greater good of the operating system as a whole.

My first reaction was, “Why not?”

There’s a school of thought that says a degree of unpredictability in software, especially long-running network software, can be very handy indeed.

Don’t wait, say, two seconds after a failed connection attempt so that you coincide precisely but permanently with a similar every-two-second problem in some other process. Wait two seconds plus a random interval that’s different every time.

Don’t arrange everything so predictably in memory that if there’s an exploitable bug, hackers can reliably work out where to poke their knitting needles. Mix things up a bit so an attacker has to guess, and might very well get it wrong.

And, of course, in anything cryptographic, good quality randomness is vital, lest you turn a problem that should be computationally infeasible into one that is merely difficult or time-consuming.

→Debian once removed code from its kernel because it looked unpredictable. It was supposed to be – it was part of the random number generator. After getting “fixed’ it became so predictable that cryptographic keys that should have been unguessable could be brute-forced in seconds or minutes.

Forcing programs to have a short outage every now and then is a bit like companies that require senior executives to use at least some of their annual vacation time each year in unbroken chunks.

Not only does it force the individual to take a much-needed rest, it also mitigates against corruption in the company by getting an alternative hand on the tiller every now and then.

By my second opinion was, “No way!”

Naturally, you should subject your code to randomly-generated failures as a regular and important part of testing. (You do test your software against the sort of error you might never have experienced in real life, such as “disk full,” don’t you?)

This is especially true for online software, which is frequently developed on a fast, reliable, state-of-the-art local area network, but deployed over slow, laggy, flaky links.

But deliberately breaking code just to make it restart, hopefully with any ills of the past behind it, could ironically make things worse.

That’s a little bit like pulling your car to the side of the road every few minutes to make sure the tyres don’t overheat: a useful precaution in an emergency where you know there’s a tyre fault, but a pointless waste of time if there isn’t.

In fact, you can argue that getting into the habit of random “corrective process termination” could actually mask the symptoms of a fault, or lead to known problems being mitigated by accident, and thus never getting proper corrective attention.

→Tech support staff don’t usually say “shut up and reboot” (with apologies to Dogbert) because it’s scientific. They say it because it isn’t scientific, but it very often works, and improves their call closure rates in the long run.

So randomly self-breaking programs sound a little bit like those rules that say things like, ‘You must change your password every 45 days.”

When an online service tells you that, are they implying that they actually get breached fairly frequently? That if they do get breached they probably won’t realise?

Actually, you should change your password if you think you need to.

And if you think you need to, you should change it then and there, rather than saying to yourself, “My next 45-day mandatory password update is coming in a while, so I’ll wait until then.”