On Call Sure, you might use words like "boom" and "explode" when it comes to errors with your system. But could a whoopsie have the potential to render a chunk of a country uninhabitable? Welcome to On Call.
Our story comes from a reader Regomized as "Ellen" who spent the early part of the 1980s toiling away in the IT department of a company producing software responsible (in part) for running nuclear power stations.
A brand new system was in the process of being rolled out, which would keep track of which stations were online, how much power they could provide, and so on.
Commissioning was underway using a test rig connected to a new reactor under construction. "A team of Americans were in the building teaching the company's managers how to use their system," explained Ellen, "which was to sit on top of ours."
"They were providing the reactor control equipment that our system talked to."
It wasn't going well. Despite some lovely gear for the time (think a curved wooden desk with inset DEC Rainbow PCs in the control room and a bunch of VAX/VMS systems in a fail-safe cluster), there were problems getting the VAX to talk to the power station. The line was up, but there was no communication.
"Hair was being pulled out, time scales were collapsing, and I was getting white hair with stress," said Ellen, "not least because I was on a successful completion bonus."
As is so often the case, a seemingly inconsequential setting was changed and everything sprang to life. The equivalent of a ping was sent and the reactor responded: "Yes, I'm here."
"Strictly speaking, this was a reactor simulator," Ellen added, "a fact that will become important later."
However, for now, things were online, the software was working, and while there were only three days to plow through 10 days' worth of tests, the team at least had a fighting chance. Tests were set to execute sequentially overnight.
The Rainbow PC in the comms room would run them and dump the results to the printer. Ellen and co explained the approach to the site manager at the end of day meeting, which the Americans also attended. Another important point.
Yet despite the money-no-object approach, bizarrely there was no lock fitted to the computer room door. While nobody was supposed to touch the equipment, Ellen's team took no chances and stuck a cardboard box over the PC with the words "System Under Test - DO NOT TOUCH" scrawled over it.
"It was now well around 10:30pm," she recalled, "so the team and I set off for our hotel, aiming to reconvene at the 8:30am Morning Meeting."
Sadly, The Call would come in a good few hours before that morning meeting. This being before the days of the ubiquitous mobile phone, Ellen had a pager which chirped urgently at 6am. She had to attend the site NOW!
When she arrived, the tension in the atmosphere was palpable. Something had gone terribly, terribly wrong. The manager of site was also in attendance, as was the biggest of all cheeses - the Director of Power Generation.
"A sort of deathly silence fell over the room," she recalled, "the sort just before a public hanging takes place."
"Yes, I was nervous."
"It seemed that our software had experienced some sort of problem and as a result the reactor had gone offline, the control rods had slammed in, and it was now no more than an oversized kettle."
At this point we must remind readers that this was a simulator, not the real thing.
Had this been a real reactor, it would have taken months to recover, at a cost of millions of pounds.
And Ellen and her software were clearly to blame.
"No one could tell me what exactly what had happened. Just that it was my fault," she said.
Seeking to delay her execution, Ellen asked if she could review the output of the line printer to get an idea of what might have happened. The bosses agreed and gave her an hour's reprieve as she scuttled off to the comms room.
Upon entering the room (the one without a lock), she and the team were greeted by a scene of utter devastation. The box with the "Do Not Touch" lettering had been discarded. The test PC was in bits and the disk was missing entirely. The line printer had stopped mid-line when the PC had been attacked.
Alarming, but not something that would cause a reactor scram, just a delay in testing.
"I asked one of my team to connect the printer back to the VAX and dump the application logs for me," recalled Ellen. "He was told to bring them to me even if I was in a meeting - especially if I was in what was going to be a stressful meeting, to say the least."
The investigation continued and got stranger still. The other Rainbow PCs were all up and running. They shouldn't have been - Ellen's team had yet to commission them, merely setting them up for cable routing purposes. And yet there they were, humming away.
Ellen returned to the meeting with her findings. One of the US team was in attendance, and confessed to switching on the PCs.
"When asked why," said Ellen, "he responded that because us amateur-hour Brits were so far behind schedule he wanted to get started training the control room staff, so he wanted all the PCs booted and ready."
So... how was this achieved? The media to boot up the PCs was locked up in Ellen's safe store ("the back of my car," she confessed).
No problem. The US tech had simply grabbed the disk from the PC running the testing and copied it to the other computers. "Obviously it worked because they are all up and running," he said.
Suddenly, everything became clear. Had Ellen a Poirot-style mustache, some serious twirling would have been called for.
The logs arrived and were handed over. Ellen pretended to study them, but already knew what the evidence was going to show.
Ellen: "So, you cloned the disk on to all the PCs..."
US Engineer (proudly): "Yes, and saved several days".
Ellen (looking at the log): "And you went to one PC and asked for a reactor status from the power station."
US Engineer: "Yes, but it didn't work - your software is so full of bugs, it's total crap."
Ellen: "And when you cloned the disks, you changed the DECNet address on each PC?"
She, of course, knew that he hadn't. The log said as much.
US Engineer: "Err, no, what's that? Is it important?"
It was indeed.
The protocol used for communication was designed to avoid hacking. "There were multiple control commands," explained Ellen, "to eliminate any false commands that could, quite literally, cause a bomb to go off."
In this instance, all the PCs now had the same address, meaning that when communication was attempted (for example, a simple status request from the reactor), all manner of nonsense would bounce around the network. The reactor (or, to be clear, the simulation) software decided that something weird was happening and correctly triggered its safeties. In this case, an immediate shutdown.
An extended recovery time (had this been a real reactor) was of no consequence compared to safety in the face of what might be an attack.
"After explaining all this to the now-silent room," Ellen said, "I finished with telling the Director of Power Generation that it was not our fault."
"It was someone, mentioning no names, who had disassembled our equipment and had misused our software and hardware, all before we had handed it over. The system did exactly what it was supposed to."
"And whilst simulating a reactor scram was not part of the tests, we now knew it worked."
The US contractor did the equivalent of falling on his sword. Puce-faced, he left the room, was apparently fired the same day and packed on the next plane home.
Again, this was not a real reactor and Ellen knew that the team could get back online in a matter of hours. However, "I shamelessly lied through my teeth, told the assembled team it would take me at least two weeks to reassemble the equipment, recommission all of our test and control equipment, and that I was declaring force majeure as per the contract, but I would not report the damage back to my head office."
The room was filled with apologies and gratitude that she would not be taking the issue further and that there had been no unpleasantness. The time extension? No problem - it was granted.
The team finished well ahead of time and bonuses were dispensed all round.
"And that," she said, "is how I was accused of nearly wiping [region redacted] off the map."
Ever had your bottom rescued by a fail-safe? Or been called out at an ungodly hour to deal with someone else's mistake? Of course you have, and you should share your story with an email to On Call. ®
Thanks to Linux wunderkind Rudra Saraswat, not Canonical, this time
Move follows Databricks' donation of Delta Lake 2.0 to Linux Foundation
Webinar Say sayonara to SANs, hello to HCI by catching up on this webinar
Finally shift set for version 2.357 of developer automation platform
Already host your own file-sharing tool? Now you can add a web-based office suite on top
Stalled marketshare seems to be creeping upwards again in consumer, enterprise - but adoption still a slog