Software Disasters
ARNO HUETTER
About the Author
Arno Huetter
Arno wrote his first lines of code on a Sinclair ZX80 in
1984.
Over the years, he has been programming in C/C++,
Java and C#, and also did quite some database
development.
Today he is Development Lead at Dynatrace (APM
vendor).
OS/2 (1985-2001)
The PC World in 1985
 PC (with DOS) is clear market leader, but Apple Macintosh is the new cool thing
 Windows 1.0 merely a DOS GUI extension
 IBM‘s TopView has flopped (rudimentary shell that allowed for copy/paste between and
multitasking of DOS programs)
Windows 1.0
30 years of innovation
Enter OS/2
 OS/2 intended as the protected mode successor of DOS
 IBM decides to form another partnership with Microsoft
 The Plan:
 IBM programmers would develop significant parts
 Microsoft to be paid by kLOC contractor rates
 Must run on 286, compatible with TopView, run DOS programs in
„compatibility box“
 Presentation Manager should allow recompiled Windows applications to run
(never worked that way, required rewrite or VDM starting with OS/2 2.0)
1987: OS/2 1.0
1988: OS/2 1.1
1987 to 1991
 Marketing along with IBM‘s PS/2 platform (although PS/2 not required) leads to
customer confusion
 RAM prices shoot up in 1987 (USD 133 for 1MB), OS/2 requires 4MB compared to the
usual 1MB for DOS
 USD 340 for retail copy (DOS shipped for free with new PCs)
 USD 3,000 for OS/2 SDK
 No printer support except IBM printers, no drivers for common devices
 Missing guidance / support / ecosystem for 3rd party software vendors
 1989: OS/2 1.2 introduces HPFS, Ethernet, TCP/IP
 1990: Windows 3.0 takes off, IBM/Microsoft collaboration unravels
 1991: OS/2 1.3 turns out to be a modest success, but fades compared to Windows 3.x
1992: OS/2 2.0
1994: OS/2 Warp
1992 to 1994
 1992: Windows 3.1 released
 1992: OS/2 2.0, true 32bit operating system, taking full advantage of 386, and
technically ahead of Windows 3.x (preemptive multitasking, memory protection)
 Workplace with „object-oriented“ UI behavior and 32bit API
 Multiple DOS programs running side-by-side
 Windows 3.0/3.1 compatibility via VDM. Windows code included in OS/2
 Due to Windows compatibility, developers simply decided to develop for Windows only
(and could state „it runs on OS/2 as well“)
 OS/2 versions of Lotus 1-2-3 or Corel Draw sluggish compared to Windows
 1993: Mainframe market collapses. IBM CEO John Akers ousted, replaced by Louis
Gerstner. Gerstner turns struggling company around
 1994: OS/2 3.0 (Warp) introduced
 1994: Windows NT 3.5 introduced (modern, rock-solid, multi-core support)
1995 to 2001
 1995: Windows 95 hits market, becomes instant success
 IBM weak on marketing, hardly getting PC clone makers on board
 OS/2 sold mainly to corporate customers for networking environments, but finally loses
there as well to Windows NT
 Even IBM‘s „Mr. OS/2“, David Barnes, is quoted saying: „OS/2 is great, but then Sony‘s
Betamax was way better than VHS…“
 1996: OS/2 Warp 4 released, adds Java and speech recognition
 IBM finally stops development, but continues to sell OS/2 until 2001
 Gerstner quote #1: “The pro-OS/2 argument was based on technical superiority... What
my colleagues seemed unwilling or unable to accept was that the war was already over
and was a resounding defeat”
 Gerstner quote #2: “The battle between OS/2 and Microsoft Windows was draining tens
of millions of dollars, absorbing huge chunks of senior management’s time, and making
a mockery of our image.”
1998 to 2002: Netscape
 1998: Consensus: Netscape 4 code base is
pretty bad. So let’s do a complete rewrite! Mozilla
organization formed.
 Code base might have been bad, but it worked
quite well for most users (browser market share
at 50%)
 1999: Netscape acquired by AOL
 2000: Netscape 6 released. Wasn’t really ready,
fails miserably
 2002: Mozilla 1.0 released. First real release in
four years. Browser market share at 6%
 2003: AOL closes Netscape division, Mozilla
Foundation continues independently
 2004: Resurrection: Firefox 1.0 based on Mozilla
Ariane 5 (1996)
declare
vertical_veloc_sensor: float;
horizontal_veloc_sensor: float;
vertical_veloc_bias: integer;
horizontal_veloc_bias: integer;
...
begin
declare
pragma suppress(numeric_error, horizontal_veloc_bias);
begin
sensor_get(vertical_veloc_sensor);
sensor_get(horizontal_veloc_sensor);
vertical_veloc_bias := integer(vertical_veloc_sensor);
horizontal_veloc_bias := integer(horizontal_veloc_sensor);
...
exception
when numeric_error => calculate_vertical_veloc();
when others => use_irs1();
end;
end irs2;
Ariane 5 - Summary of Events
 64bit floating point to 16bit signed integer conversion
 Numeric overflow when horizontal velocity sensor value > 32768 (internal unit)
 Exception handling deactivated
 Redundant system contained different hardware but same software, hence ran into
same problem
 Unhandled exception triggered self destruction in order to avoid rocket breaking apart
 Code originated from Ariane 4, which was slower and flew at different angle
 Calculation not even needed during flight (just during prep), but still running
 USD 5 billion overall development costs
 USD 500 million for rocket + satellites
 Program delayed by years
2000 to 2005: FBI Virtual Case File
 Software system to manage all documents relating to cases
being investigated by the FBI
 Modern web interface for 22,000 users to replace previous
ACS system (which was obsolete already at introduction due
to outdated technology)
 Estimated completion time: 22 months
 Until 2005, 700,000 lines of code written, five different project
leads in charge
2000 to 2005: FBI Virtual Case File
 VCF turns out to be incomplete, inadequate and poorly
designed, essentially unusable under real-world conditions
 Even in rudimentary tests system did not comply with basic
requirements
 After having invested 170 Mio USD, the FBI decided to buy off-
the-shelf software instead
 Causes: No architecture blueprints, repeated changes in
specification, engineers with little or no computer science
training, code bloat, scope creep
2003: US Northeast Blackout
 Race condition in General Electric's Unix-based XA/21 energy
management system
 Bug stalls FirstEnergy's control room alarm system – operators
do not receive alerts any more
 Unprocessed events queued up and the primary server failed
within 30 minutes
 Applications automatically transferred to the backup server,
which itself failed
 Operator screen refresh rate drops from 1sec to 1min
 Operators hence dismiss a call about the tripping and
reclosure of a 345 kV shared line
 More lines to go offline in a chain reaction, undervoltage and
overcurrent interpreted as a short circuit
 30 minutes later 256 power plants are off-line, most due to
automatic protective controls
2005: WoW Glitch
 Game update on September 13th introduced new character
„Hakkar“
 Hakkar was able to inflict a disease „Corrupted Blood“ on
playing characters, draining their health points and finally
killing them
 Disease could be passed to other players
 Effect was meant to be localized to one game area
 Developers didn‘t consider WoW teleporting functionality
 Infected players teleported into other areas, soon leading to
corpses littering the streets
 Fortunately, player death is not permanent in WoW and
admins resetted the game
 (Virtual) death toll: unknown
2012: Knight Capital loses 440M USD
 August 12th: New Trading Software installed
 Administrator forgets to deploy on one out of eigth server
nodes
 New code repurposed a flag previously used for testing
scenarios
 On that one server node, old trading algorithm interprets flag
differently and starts buying and selling 100 different stocks
randomly without human verification
 NYSE has to suspend trade of several stocks
 Knight Capital loses 440 Mio USD in only 30 minutes, until
system is suspended
 Investors have to raise 400 Mio USD in order to rescue the
company
Source: http://www.typemock.com/software-bugs-infographic
Why do SW projects fail (IEEE)
 Unrealistic or unarticulated project goals
 Inaccurate estimates of needed resources
 Badly defined system requirements
 Poor reporting of the project's status
 Unmanaged risks
 Poor communication among customers, developers, and users
 Use of immature technology
 Inability to handle the project's complexity
 Sloppy development practices
 Poor project management
 Stakeholder politics
 Commercial pressures
Thank you!
Twitter: https://twitter.com/ArnoHu
Blog: http://arnosoftwaredev.blogspot.com

Software Disasters

  • 1.
  • 2.
    About the Author ArnoHuetter Arno wrote his first lines of code on a Sinclair ZX80 in 1984. Over the years, he has been programming in C/C++, Java and C#, and also did quite some database development. Today he is Development Lead at Dynatrace (APM vendor).
  • 3.
  • 4.
    The PC Worldin 1985  PC (with DOS) is clear market leader, but Apple Macintosh is the new cool thing  Windows 1.0 merely a DOS GUI extension  IBM‘s TopView has flopped (rudimentary shell that allowed for copy/paste between and multitasking of DOS programs)
  • 5.
    Windows 1.0 30 yearsof innovation
  • 6.
    Enter OS/2  OS/2intended as the protected mode successor of DOS  IBM decides to form another partnership with Microsoft  The Plan:  IBM programmers would develop significant parts  Microsoft to be paid by kLOC contractor rates  Must run on 286, compatible with TopView, run DOS programs in „compatibility box“  Presentation Manager should allow recompiled Windows applications to run (never worked that way, required rewrite or VDM starting with OS/2 2.0)
  • 7.
  • 8.
    1987 to 1991 Marketing along with IBM‘s PS/2 platform (although PS/2 not required) leads to customer confusion  RAM prices shoot up in 1987 (USD 133 for 1MB), OS/2 requires 4MB compared to the usual 1MB for DOS  USD 340 for retail copy (DOS shipped for free with new PCs)  USD 3,000 for OS/2 SDK  No printer support except IBM printers, no drivers for common devices  Missing guidance / support / ecosystem for 3rd party software vendors  1989: OS/2 1.2 introduces HPFS, Ethernet, TCP/IP  1990: Windows 3.0 takes off, IBM/Microsoft collaboration unravels  1991: OS/2 1.3 turns out to be a modest success, but fades compared to Windows 3.x
  • 9.
  • 10.
    1992 to 1994 1992: Windows 3.1 released  1992: OS/2 2.0, true 32bit operating system, taking full advantage of 386, and technically ahead of Windows 3.x (preemptive multitasking, memory protection)  Workplace with „object-oriented“ UI behavior and 32bit API  Multiple DOS programs running side-by-side  Windows 3.0/3.1 compatibility via VDM. Windows code included in OS/2  Due to Windows compatibility, developers simply decided to develop for Windows only (and could state „it runs on OS/2 as well“)  OS/2 versions of Lotus 1-2-3 or Corel Draw sluggish compared to Windows  1993: Mainframe market collapses. IBM CEO John Akers ousted, replaced by Louis Gerstner. Gerstner turns struggling company around  1994: OS/2 3.0 (Warp) introduced  1994: Windows NT 3.5 introduced (modern, rock-solid, multi-core support)
  • 11.
    1995 to 2001 1995: Windows 95 hits market, becomes instant success  IBM weak on marketing, hardly getting PC clone makers on board  OS/2 sold mainly to corporate customers for networking environments, but finally loses there as well to Windows NT  Even IBM‘s „Mr. OS/2“, David Barnes, is quoted saying: „OS/2 is great, but then Sony‘s Betamax was way better than VHS…“  1996: OS/2 Warp 4 released, adds Java and speech recognition  IBM finally stops development, but continues to sell OS/2 until 2001  Gerstner quote #1: “The pro-OS/2 argument was based on technical superiority... What my colleagues seemed unwilling or unable to accept was that the war was already over and was a resounding defeat”  Gerstner quote #2: “The battle between OS/2 and Microsoft Windows was draining tens of millions of dollars, absorbing huge chunks of senior management’s time, and making a mockery of our image.”
  • 14.
    1998 to 2002:Netscape  1998: Consensus: Netscape 4 code base is pretty bad. So let’s do a complete rewrite! Mozilla organization formed.  Code base might have been bad, but it worked quite well for most users (browser market share at 50%)  1999: Netscape acquired by AOL  2000: Netscape 6 released. Wasn’t really ready, fails miserably  2002: Mozilla 1.0 released. First real release in four years. Browser market share at 6%  2003: AOL closes Netscape division, Mozilla Foundation continues independently  2004: Resurrection: Firefox 1.0 based on Mozilla
  • 15.
  • 17.
    declare vertical_veloc_sensor: float; horizontal_veloc_sensor: float; vertical_veloc_bias:integer; horizontal_veloc_bias: integer; ... begin declare pragma suppress(numeric_error, horizontal_veloc_bias); begin sensor_get(vertical_veloc_sensor); sensor_get(horizontal_veloc_sensor); vertical_veloc_bias := integer(vertical_veloc_sensor); horizontal_veloc_bias := integer(horizontal_veloc_sensor); ... exception when numeric_error => calculate_vertical_veloc(); when others => use_irs1(); end; end irs2;
  • 18.
    Ariane 5 -Summary of Events  64bit floating point to 16bit signed integer conversion  Numeric overflow when horizontal velocity sensor value > 32768 (internal unit)  Exception handling deactivated  Redundant system contained different hardware but same software, hence ran into same problem  Unhandled exception triggered self destruction in order to avoid rocket breaking apart  Code originated from Ariane 4, which was slower and flew at different angle  Calculation not even needed during flight (just during prep), but still running  USD 5 billion overall development costs  USD 500 million for rocket + satellites  Program delayed by years
  • 19.
    2000 to 2005:FBI Virtual Case File  Software system to manage all documents relating to cases being investigated by the FBI  Modern web interface for 22,000 users to replace previous ACS system (which was obsolete already at introduction due to outdated technology)  Estimated completion time: 22 months  Until 2005, 700,000 lines of code written, five different project leads in charge
  • 20.
    2000 to 2005:FBI Virtual Case File  VCF turns out to be incomplete, inadequate and poorly designed, essentially unusable under real-world conditions  Even in rudimentary tests system did not comply with basic requirements  After having invested 170 Mio USD, the FBI decided to buy off- the-shelf software instead  Causes: No architecture blueprints, repeated changes in specification, engineers with little or no computer science training, code bloat, scope creep
  • 21.
    2003: US NortheastBlackout  Race condition in General Electric's Unix-based XA/21 energy management system  Bug stalls FirstEnergy's control room alarm system – operators do not receive alerts any more  Unprocessed events queued up and the primary server failed within 30 minutes  Applications automatically transferred to the backup server, which itself failed  Operator screen refresh rate drops from 1sec to 1min  Operators hence dismiss a call about the tripping and reclosure of a 345 kV shared line  More lines to go offline in a chain reaction, undervoltage and overcurrent interpreted as a short circuit  30 minutes later 256 power plants are off-line, most due to automatic protective controls
  • 22.
    2005: WoW Glitch Game update on September 13th introduced new character „Hakkar“  Hakkar was able to inflict a disease „Corrupted Blood“ on playing characters, draining their health points and finally killing them  Disease could be passed to other players  Effect was meant to be localized to one game area  Developers didn‘t consider WoW teleporting functionality  Infected players teleported into other areas, soon leading to corpses littering the streets  Fortunately, player death is not permanent in WoW and admins resetted the game  (Virtual) death toll: unknown
  • 23.
    2012: Knight Capitalloses 440M USD  August 12th: New Trading Software installed  Administrator forgets to deploy on one out of eigth server nodes  New code repurposed a flag previously used for testing scenarios  On that one server node, old trading algorithm interprets flag differently and starts buying and selling 100 different stocks randomly without human verification  NYSE has to suspend trade of several stocks  Knight Capital loses 440 Mio USD in only 30 minutes, until system is suspended  Investors have to raise 400 Mio USD in order to rescue the company
  • 24.
  • 25.
    Why do SWprojects fail (IEEE)  Unrealistic or unarticulated project goals  Inaccurate estimates of needed resources  Badly defined system requirements  Poor reporting of the project's status  Unmanaged risks  Poor communication among customers, developers, and users  Use of immature technology  Inability to handle the project's complexity  Sloppy development practices  Poor project management  Stakeholder politics  Commercial pressures
  • 26.
    Thank you! Twitter: https://twitter.com/ArnoHu Blog:http://arnosoftwaredev.blogspot.com