Admit it: whenever you apply a Patch Tuesday update that requires a restart, you suck in a great big breath just before you hit the [Reboot] button.
There’s always a palpable worry that things won’t come back up, and your Patch Tuesday will be followed by Sleepless Wednesday, Recrimination Thursday, and so on.
Having a bigger-than-usual lungful of air just when you reboot won’t actually help, of course, because you’re never going to be able to hold your breath clear through to Face-The-Music Friday.
But taking that oversized breath is a visceral reaction to the risk of a botched patch, and you’re welcome to it.
So spare a thought for the guys at NASA’s Jet Propulsion Laboratory (JPL) in Pasadena, California.
They’ve just announced that they’re planning to do a complete reformat and restore of one of their computer systems.
The difference is that their system isn’t in the server room down on Level 9.
It’s on board the Opportunity Rover, one of the two currently-functioning exploration vehicles roaming Mars, approximately 200,000,000km away.
→ Because the equator is approximately 40,000km around, terrestrial sysadmins can never be further than 20,000km from their most distant servers. As for the International Space Station, that’s a mere 400km out into space, easily close enough for social networking.
Reimaging troublesome computers is a popular IT technique, because it removes numerous unpredictable variables from any troubleshooting equation.
Malware infection? Gone. Inexpertly-edited configuration file? Restored. Inadvertently deleted system library? Recovered. Security sins of the recent past? Absolved, albeit that your stolen data remains stolen.
But reimaging isn’t without its own problems, for example:
- It destroys forensic evidence that might help explain what went wrong in the first place.
- It may reintroduce old security holes by reverting to software versions that have since been patched.
- It treats only symptoms, not causes.
- It can turn into a “silver bullet” that is treated as operationally normal instead of an emergency measure.
Apparently, in Neiman Marcus’s 2013 credit card breach, infected computers were reimaged regularly, but the crooks just broke back in the next day and reinstalled their malware.
The reimaging process in that environment, it seems, turned out to be a backward security step (listen to the podcast below, starting at 1’48”).
(Audio player not working? Download or listen on Soundcloud.)
Just why, then, are the JPL scientists adopting an approach that might easily be written off as unscientific?
It’s down to the flash memory.
Opportunity has just 256MBbyte of flash, a modest amount by today’s standards, but a healthy amount to send into space when the rover was launched more that a decade ago (XP was young, and Vista still many years away).
Flash memory has no moving parts, at least in the traditional sense.
Like regular memory, flash relies only on shovelling electrons around, forming pockets of electrical charge that signal the value of stored data bits.
What’s special about flash is that these charged pockets don’t dissipate when voltage is removed from the chip, so it retains its data when the power is turned off, just like a hard disk.
To achieve this result, flash chips use transistors called floating gates that are not electrically connected to the rest of the chip at all; they’re separated by a thin layer of insulation just 20 nanometres across.
With the right sort of voltage surge – “like the flash of a camera,” which is allegedly where the name came from – the circuitry can poke (or tunnel) electrons through this insulating layer, allowing the charge on each floating gate to be changed deliberately.
Over time, however, tunnelled electrons can cause the insulation to be physically altered and to break down.
Charge can then leak away from individual floating gates in the chip, causing those memory cells to change value unexpectedly and cause data errors.
(Metaphorically speaking, the floating gate no longer floats reliably on its nanometric sea of insulation.)
In other words, flash memory can, figuratively and literally, wear out.
That’s the best guess at what’s wrong on Opportunity, with a dozen unplanned system reboots in the past month.
The rover can run entirely from non-flash memory, but the faults and the reboots are getting in the way of science, because the flash is where experimental results are recorded for later transmission back to earth.
So JPL’s reformat-and-restore is not merely the “hit and hope” exercise that it may feel like when someone from your IT team reimages your laptop.
While they are reformatting the flash on board Opportunity, the JPL sysadmins can also test its performance, build up a map of unreliable areas, and avoid them in future.
That should give the flash a new lease of life, albeit with a slightly reduced capacity, allowing scientific and exploratory work to be resumed in full.
Anyway, if you live in the Pasadena area and you notice a sudden downward blip in barometric pressure in the next few days…
…that’ll be the JPL techies all taking big breaths at the same time.
Let’s hope the NASA geeks who coded the operational smarts into the Opportunity Rover aren’t like these users we jested about back at Sysmas:
(Check out more videos on our YouTube channel)
Image of Mars courtesy NASA/JPL/MSSS
Why does this mention different Wondoze versions when NASA used VxWorks for Spirit, Opportunity and Curiosity? Actually VxWorks is used for many hard science space programs. Specifically it is a specially qualified version that is tested to death and run on radiation hardened processors in multiply redundant hardware. Nothing you’d ever use Windoze for. Talk to Wind River if you want a really reliable embedded O/S that does what it is supposed to when it is supposed to.
See my reply below to @microfish. That part of the article deals with a “when,” not a “what.” It’s pretty usual to pick a well-known event (whether you loved it or hated it) to help you with chronology, like “the Victorian era” to mean the back end of the 19th century; or “the year England last won the Football World Cup” for 1996; or “the date that Zimbabwe last thrashed Australia at cricket.” (Sorry, Aussies. That was yesterday. Last time before yesterday, 31 years ago π
As for vulnerabilities – Curiosity Rover ended up with a buffer overflow bug in its data compression code that had been there since “before Windows 95 came out”, to use another Microsoft release as a historical milestone.
No endorsement, expressed or implied.
The 1996 World Cup. Who could forget it…
Hey, don’t knock it, we won apparently ; )
Great plug for your company! I will definitely buy VxWorks from you when I purchase my next space probe.
Purchase? *Purchase*?
Private space exploration is all the rage these days (ask Elon Musk :-).
You don’t want to be *purchasing* a space probe, you want to be *building* one. Remember the old proverb: a journey of 150 million miles starts with a single rocket motor…
” is not merely the “hit and hope” exercise that it may feel like when someone from your IT team reimages your laptop.”
Well if you stopped going to questionable sites, and stopped letting your kids surf the web on your company laptop your IT team wouldn’t have to keep reimaging your virus-laden laptop.
Drive-by downloads don’t restrict themselves to ‘questionable’ websites.
JPL uses Windows on mission critical systems? Are you serious?
I don’t think the article suggests for a moment that any of the Mars Rovers uses Windows. (I don’t think you think it does, either π
It’s a historical reference aimed at helping readers decide how big or small 256MBbyte flash would have seemed back then. Like it or not, XP is more useful as a historical milestone than…well, than pretty much every other piece of software out there.