Can you
provide the synopsis, or must all others dig it up in places like the
comp.risk archives.
I read the report hosted here some years ago:
http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html . It's rather
long, but definitely worth the read. In brief:
-- The THERAC-25 was a radiation machine used for treating cancerous tumors;
it was designed by AECL and deployed during the late 1970s and 1980s
-- It was based off of earlier successful models, but had much more of the
safety control and interlocks implemented in software rather than hardware
(motivated by cost reductions and -- at least at some level -- a belief that
the software approach could improve safety).
-- Due to software bugs, the THERAC-25 provided fatal doses or radiation to
six individuals; some of them having rather painful deaths.
-- The THERAC-25 software was known for exhibiting slightly flaky and
non-intuitive behavior (error codes were usually give as numbers, but
originally the operators' manual didn't list what the various numbers meant),
so many operators didn't take "weird" misbehavior as seriously as they should
have initially; this arguably led to delays in AECL being contacted about
problems.
-- The direct cause of the deaths was due to software not always detecting
overdose levels and shutting off the radiation generator quicky enough, as
well as sometimes only indicating a "minor" problem (recovered from by
pressing a single key) when overdoses were detected.
-- The case of the machine inadvertently attempting to overdose in the first
place was due to bugs allowing the *displayed* dose parameters to potentially
be different than the *programmed* dose paramters: There was an 8 second
"window" while the control software was performing other tasks based on a mode
change wherein, if all the control parameters were changed on-screen and
"sent" the software wouldn't actually notice this fact and would keep using
whatever old parameters had previously been set... including dosage parameters
that were allowable when used in different modes with different
attenuators/scanning/etc., but deadly in the new mode.
-- AECL did extensive testing but initially was unable to reproduce the deadly
behavior that occurred in hospital settings; this was largely due to the
different usage patterns by test engineers at AECL vs. technicians at
hospitals (who, as regular/expert users, would enter data very quickly and
fall prey to the "8 second window" trap). As such, early on AECL firmly
denied that their machine could have been at fault for causing the deaths,
stating that "It [is] not possible for the THERAC-25 to overdose a patient."
Eventually one machine operator was able to cause the THERAC-25 to malfunction
repeadtedly at will, which caused AECL to change their tune.
-- Over time up until 1987, AECL made numerous software and hardware changes
(including adding hardware interlocks) to improve reliability and greatly
reduce the likelihood of it being a hazard to humans.
The report linked above points out that summaries like the one I just wrote
often suggest that the causes were more freestanding and simplistic than they
really were -- hence the suggestion to read the entire article. (One of the
things I identify with here is how you need to test with Real Users and not
just your internal techs/engineers -- Real Users are often far better at
exposing bugs, both initially when they "don't know what they're doing" and
try to do "non-sensical" things to your software, as well as later once they
become "expert users" and run through the software much more
quickly/cavalierly than internal testers usually do.)
---Joel