NASA prepares for serious sysadmin work – reimaging Opportunity Rover out on MARS!

Admit it: whenever you apply a Patch Tuesday update that requires a restart, you suck in a great big breath just before you hit the [Reboot] button.

There’s always a palpable worry that things won’t come back up, and your Patch Tuesday will be followed by Sleepless Wednesday, Recrimination Thursday, and so on.

Having a bigger-than-usual lungful of air just when you reboot won’t actually help, of course, because you’re never going to be able to hold your breath clear through to Face-The-Music Friday.

But taking that oversized breath is a visceral reaction to the risk of a botched patch, and you’re welcome to it.

So spare a thought for the guys at NASA’s Jet Propulsion Laboratory (JPL) in Pasadena, California.

They’ve just announced that they’re planning to do a complete reformat and restore of one of their computer systems.

The difference is that their system isn’t in the server room down on Level 9.

It’s on board the Opportunity Rover, one of the two currently-functioning exploration vehicles roaming Mars, approximately 200,000,000km away.

→ Because the equator is approximately 40,000km around, terrestrial sysadmins can never be further than 20,000km from their most distant servers. As for the International Space Station, that’s a mere 400km out into space, easily close enough for social networking.

Reimaging troublesome computers is a popular IT technique, because it removes numerous unpredictable variables from any troubleshooting equation.

Malware infection? Gone. Inexpertly-edited configuration file? Restored. Inadvertently deleted system library? Recovered. Security sins of the recent past? Absolved, albeit that your stolen data remains stolen.

But reimaging isn’t without its own problems, for example:

  • It destroys forensic evidence that might help explain what went wrong in the first place.
  • It may reintroduce old security holes by reverting to software versions that have since been patched.
  • It treats only symptoms, not causes.
  • It can turn into a “silver bullet” that is treated as operationally normal instead of an emergency measure.

Apparently, in Neiman Marcus’s 2013 credit card breach, infected computers were reimaged regularly, but the crooks just broke back in the next day and reinstalled their malware.

The reimaging process in that environment, it seems, turned out to be a backward security step (listen to the podcast below, starting at 1’48”).

(Audio player not working? Download or listen on Soundcloud.)

Just why, then, are the JPL scientists adopting an approach that might easily be written off as unscientific?

It’s down to the flash memory.

Opportunity has just 256MBbyte of flash, a modest amount by today’s standards, but a healthy amount to send into space when the rover was launched more that a decade ago (XP was young, and Vista still many years away).

Flash memory has no moving parts, at least in the traditional sense.

Like regular memory, flash relies only on shovelling electrons around, forming pockets of electrical charge that signal the value of stored data bits.

What’s special about flash is that these charged pockets don’t dissipate when voltage is removed from the chip, so it retains its data when the power is turned off, just like a hard disk.

To achieve this result, flash chips use transistors called floating gates that are not electrically connected to the rest of the chip at all; they’re separated by a thin layer of insulation just 20 nanometres across.

With the right sort of voltage surge – “like the flash of a camera,” which is allegedly where the name came from – the circuitry can poke (or tunnel) electrons through this insulating layer, allowing the charge on each floating gate to be changed deliberately.

Over time, however, tunnelled electrons can cause the insulation to be physically altered and to break down.

Charge can then leak away from individual floating gates in the chip, causing those memory cells to change value unexpectedly and cause data errors.

(Metaphorically speaking, the floating gate no longer floats reliably on its nanometric sea of insulation.)

In other words, flash memory can, figuratively and literally, wear out.

That’s the best guess at what’s wrong on Opportunity, with a dozen unplanned system reboots in the past month.

The rover can run entirely from non-flash memory, but the faults and the reboots are getting in the way of science, because the flash is where experimental results are recorded for later transmission back to earth.

So JPL’s reformat-and-restore is not merely the “hit and hope” exercise that it may feel like when someone from your IT team reimages your laptop.

While they are reformatting the flash on board Opportunity, the JPL sysadmins can also test its performance, build up a map of unreliable areas, and avoid them in future.

That should give the flash a new lease of life, albeit with a slightly reduced capacity, allowing scientific and exploratory work to be resumed in full.

Anyway, if you live in the Pasadena area and you notice a sudden downward blip in barometric pressure in the next few days…

…that’ll be the JPL techies all taking big breaths at the same time.

Let’s hope the NASA geeks who coded the operational smarts into the Opportunity Rover aren’t like these users we jested about back at Sysmas:

(Check out more videos on our YouTube channel)

Image of Mars courtesy NASA/JPL/MSSS