OPS Forum Embracing the future - a retrospective look 05.09.2008


Published on

Preparing for the future in ESA's Operations Directorate is more important than ever. A review of how we've handled change in the past to help better prepare for future transformations. Preparing for the future in ESA's Operations Directorate is more important than ever. In particular, we must prepare for changes in ESOC's workload during 2012-18 and cope with organisational changes such as financial reform happening at the Agency level. Coping with change is nothing new to ESOC – this ability has been there from the beginning. The speaker has at various points in his career been involved in transformations and he has selected four subjects for this forum, all of which bring useful lessons for the future.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

OPS Forum Embracing the future - a retrospective look 05.09.2008

  1. 1. Embracing the future - a retrospective look Michael Jones OPS-G Forum 5 th September 2008
  2. 2. Contents <ul><li>“ Governance” </li></ul><ul><li>2. Change can be slower than you think! </li></ul><ul><li>FFP contracts – the magic bullet? </li></ul><ul><li>The Black Swan: the improbable in operations </li></ul><ul><li>5. Software Dependability </li></ul>
  3. 3. Governance <ul><li>“ The use of institutions, structures of authority and even collaboration to allocate resources and coordinate or control activity [in society or the economy].” (Wikipedia) </li></ul>
  4. 4. Governance in the SSA Programme <ul><li>What is the SSA Programme? </li></ul><ul><ul><li>Provides a systematic capability for surveillance of man-made objects in the space around the earth; </li></ul></ul><ul><ul><li>provides warnings of collisions that may endanger space activities or even life on earth. </li></ul></ul><ul><li>Governance = making decisions on how the programme and deployed assets are to be run. </li></ul>
  5. 5. Data Systems Governance: Data Systems Task Force <ul><li>The Data Systems Task Force must : </li></ul><ul><li>“ Ensure the availability of adequate strategy and plans for the mission data infrastructure and monitor the execution of those plans in order to ensure timely availability of the mission data infrastructure.” </li></ul><ul><li>This means that the DSTF in effect carries out governance of the data systems infrastructure. </li></ul>
  6. 6. Conclusions on Governance <ul><li>Governance is a buzz word that you will continue to hear! </li></ul><ul><li>Recent or emerging examples are: </li></ul><ul><ul><li>Establishment of the ESA Security Office; </li></ul></ul><ul><ul><li>Software licence governance for ESA and Third Party Software. </li></ul></ul><ul><li>Mike Jones’s proposed definition of “governance” to fit its usage in ESA: </li></ul><ul><li>“ The process of making decisions, the oversight of the results of those decisions and also the oversight of organisations or structures of authority for decision making.” </li></ul>
  7. 7. Change can be slower than you think: Example 1: SCOS-2000 <ul><li>SCOS-2 (which became SCOS-2000) was a new MCS infrastructure developed from scratch. </li></ul><ul><li>A very brief summary of the timeline of the project up to 2002: </li></ul><ul><ul><li>Start of project as SCOS-2: 1992; </li></ul></ul><ul><ul><li>Version 1 (as SCOS-2) used for Huygens, MTP and Teamsat: late 1997; </li></ul></ul><ul><ul><li>Re-engineering of SCOS-2 (mainly TC chain): 1997-1998; </li></ul></ul><ul><ul><li>Parallel production of architectural designs for both SCOS-1 and SCOS-2 baselines for the Integral MCS: 2nd half of 1998; </li></ul></ul><ul><ul><li>Adoption of SCOS-2 as the Integral MCS baseline: January 1999; </li></ul></ul><ul><ul><ul><li>Integral was the first major ESA science spacecraft based on SCOS-2; </li></ul></ul></ul><ul><ul><li>SCOS-2 renamed SCOS-2000: 2000; </li></ul></ul><ul><ul><li>Supported INTEGRAL LEOP: 17th October 2002, using SCOS-2000 rel. 2.3. </li></ul></ul><ul><li>So it took 10 years to reach the point at which the new infrastructure became generally accepted – original plan was 5 years . </li></ul>
  8. 8. Change can be slower than you think: Example 1: SCOS-2000 - Conclusion <ul><li>Developing a new mission control system infrastructure from scratch is difficult and time consuming. </li></ul><ul><li>First lesson – try to avoid building new MCS infrastructures – “evolution, not revolution”. </li></ul><ul><li>Second lesson– if you have to do it, develop a simple version first. </li></ul>
  9. 9. Change can be slower than you think : Example 2: Intel/Linux MCS Infrastructure <ul><li>In 2001, it was decided to port SCOS-2000 to Linux. </li></ul><ul><li>Straightforward: by 2002 a SCOS-2000 version was available which could run on either SUN Solaris platforms or on LINUX. </li></ul><ul><li>Outside ESOC, S2K became popular as a licensable product and, with one exception, external (non-ESOC) projects using SCOS-2000 have been based on Linux. </li></ul><ul><li>At ESOC, the move to the Linux version proceeded cautiously in two stages: </li></ul><ul><ul><li>1. a pilot project with Linux server and SUN clients (Herschel Planck, S2K rel. 4); </li></ul></ul><ul><ul><li>2. a Linux transition project to install Linux clients in all the common areas. </li></ul></ul><ul><li>Stage 1 was successfully completed ca. 2006. Stage 2, started in 2007, has been completed for the MCR. </li></ul><ul><ul><li>Intel workstations for the remaining common areas will be procured this year together with a reserve of spares. </li></ul></ul><ul><li>We are now aiming at supporting Herschel Planck LEOP using the new Linux infrastructure installed by the LIT project. </li></ul>
  10. 10. Change can be slower than you think: C onclusion – Linux <ul><li>It has taken more than 6 years to reach the point of having a common Intel/Linux infrastructure. </li></ul><ul><li>Where you have a large installed park of workstations (ca. 1400 in this case) change is quite slow, since the missions already installed on the old platforms will not want to, or be able to, change. </li></ul>
  11. 11. FFP Contracts – the Magic Bullet? <ul><li>Before1996 most software development at ESOC was done under fixed-unit price conditions. </li></ul><ul><li>Implications: </li></ul><ul><ul><li>ESOC “owned the risk” for the software requirements and their implementation. </li></ul></ul><ul><ul><li>The contractor companies took no responsibility - they simply provided man-hours of staff. </li></ul></ul><ul><li>In 1996 firm-fixed price (FFP) contracts for development of spacecraft control systems and simulators were introduced with the new frame contracts. </li></ul><ul><li>Prime motivation: </li></ul><ul><ul><li>Move contract staff off-site to their own companies’ premises; </li></ul></ul><ul><ul><li>FFP regime much more suitable for off-site work. </li></ul></ul><ul><li>FFP became the rule for most work awarded under these frame contracts, achieving: </li></ul><ul><ul><li>Far more rigorous scrutiny of requirements by frame contractors; </li></ul></ul><ul><ul><li>Better competition; </li></ul></ul><ul><ul><li>Equitable risk sharing between ESA and its suppliers; </li></ul></ul><ul><ul><li>Formal change control (contract change notices - CCNs). </li></ul></ul>
  12. 12. FFP Contracts – the Magic Bullet? <ul><li>Firm-fixed price contracts have been rather successful for MCS, simulator and station back-end software. </li></ul><ul><li>But: </li></ul><ul><li>Firm Fixed Price does not mean </li></ul><ul><li>Firm Fixed Schedule ! </li></ul><ul><ul><li>Contractor can underestimate the work to be done. </li></ul></ul><ul><ul><li>Recent example: Herschel Planck MPS, where the cheapest offer was taken and the contractor had underestimated the budget by a factor of nearly 10. </li></ul></ul>
  13. 13. FFP Contracts – The Magic Bullet? Conclusions <ul><li>1. The lowest acceptable offer may not always be the right choice, particularly if the schedule is important. </li></ul><ul><ul><li>A careful evaluation of management plan and technical solution is needed to ensure that the schedule can be met. </li></ul></ul><ul><li>2. For schedule-critical developments, a look at more sophisticated techniques such as Earned Value Analysis may be needed. </li></ul>
  14. 14. Black Swans <ul><li>“ Black Swan” - title of a book by Nassim Nicholas Taleb. </li></ul><ul><li>A black swan is a large-impact, hard-to-predict, and rare event beyond the realm of normal expectations. </li></ul><ul><li>Comes from ancient Western conception that 'All swans are white'. </li></ul>
  15. 15. Black Swan: The Turkey Example
  16. 16. The Black Swan: How we deal with the unexpected in operations <ul><li>Try to make operations fully predictable: </li></ul><ul><li>Plan and prepare ground segments very carefully. </li></ul><ul><li>Technically validate them thoroughly. </li></ul><ul><li>Prepare procedures and plans for operations. </li></ul><ul><li>Operationally validate extensive simulations programme aimed at training all the teams and ensuring systems, documentation and operations staff all work together. </li></ul><ul><li>The operations validation also includes contingencies or anomaly cases to ensure the unexpected can be handled. </li></ul><ul><li>This is the discipline of Operations Engineering . </li></ul>
  17. 17. The Black Swan and Software <ul><li>ESOC uses systems containing lots of software. </li></ul><ul><li>In the real world much software is complex - </li></ul><ul><ul><ul><li>no single person can understand it completely . </li></ul></ul></ul><ul><li>“ Complex” in this case means “Big” - </li></ul><ul><ul><ul><li>complexity varies as a power of the size. </li></ul></ul></ul><ul><li>Behaviour of any complex software system cannot be fully understood - </li></ul><ul><ul><li>highly improbable or “black swan” events may occur. </li></ul></ul>
  18. 18. Black Swan: The MCS Incident during the MSG-1 LEOP (28 th August 2002) <ul><li>A number of the client workstations in the MCR, PSR and SSR suddenly became unusable - went to the SUN login. </li></ul><ul><li>Softcoor logged into the A server from the SSR and restarted the system. </li></ul><ul><li>This appeared to work, but then </li></ul><ul><ul><li>a SCOS-2000 communications task stopped processing on the server; </li></ul></ul><ul><ul><li>two telecommanding tasks (multiplexer and releaser) crashed. </li></ul></ul><ul><li>Attempts to switch clients to the redundant B server also failed. </li></ul><ul><li>Fortunately in the meantime the spacecraft was safe - </li></ul><ul><ul><li>despite the problems with the clients, telemetry was received and processed on both A and B servers. </li></ul></ul><ul><li>Softcoor then took the decision to move to a third chain, the C-system. </li></ul><ul><li>He was then able to logout all clients on the A and B chains and to restart the servers on both of them. </li></ul><ul><li>The systems were made available to the flight control and project teams about 20 minutes later. </li></ul>
  19. 19. Black Swan: MSG-1 Diagnosis and Conclusions <ul><li>Diagnosis: </li></ul><ul><li>The server had been started as foreground task remote from a SUN WS in the SSR – this created a dependency between the server and the SSR SUN WS. </li></ul><ul><li>For reasons unknown, this SUN had a problem and went to “login” status, resulting in the stopping of the server tasks started directly from this SUN. </li></ul><ul><li>There was an implementation error in the MISCdynamic server relating to CORBA event processing. </li></ul><ul><li>Problem resolution: </li></ul><ul><ul><li>Start the server as a background task. </li></ul></ul><ul><ul><li>Correct one CORBA call in the MISCdyn server. </li></ul></ul><ul><li>A full explanation of everything that happened was not possible - for example, why the SSR SUN went to “login” in the first place - since the logs were inadequate. </li></ul>
  20. 20. MSG-1 Incident: Discussion <ul><li>Problems in complex software cannot be excluded. </li></ul><ul><li>ESOC approach is very practical and sound: </li></ul><ul><ul><li>a software coordinator thoroughly familiar with the system; </li></ul></ul><ul><ul><li>assisted by a very qualified software support team; </li></ul></ul><ul><ul><ul><li>both fully involved in the sim campaign; </li></ul></ul></ul><ul><ul><li>Ensured quick recovery in MSG case. </li></ul></ul><ul><li>An operations engineering technique is applied to software engineering. </li></ul>
  21. 21. Software Dependability <ul><li>“ Software dependability” seeks to quantify how much we can rely on a software system to function as required. </li></ul><ul><li>However, it is impossible with any reasonable effort to ensure there are no errors in a large software system, e.g. </li></ul><ul><ul><li>SCOS-2000, which comprises several millions of lines of software code written since the mid-1990s. </li></ul></ul><ul><li>There is a widespread misapprehension that it is possible to quantify the errors in computer code. </li></ul>
  22. 22. Software Dependability: Example 1 – Misunderstanding Software Bugs <ul><li>“ Even if the tools are better, the number of bugs in newly written code has remained constant at around five per “function point”. . . Worse,. . . only about 85% of these bugs are eliminated before software is put into use .” [my underlining] ( Economist Technology Quarterly, March 6, 2008) </li></ul><ul><li>You can measure the number of bugs found before putting the software into use; </li></ul><ul><li>But you cannot know how many bugs remain, unless the software is very simple. </li></ul>
  23. 23. Black Swan: Example 2 – Misunderstanding Software Bugs <ul><li>It is impossible to demonstrate a negative proposition such as this: </li></ul><ul><li>e.g. no run-time errors. </li></ul><ul><li>Absence of evidence is not evidence of absence . </li></ul>“ The supplier shall verify the software code ensuring: . . . 7. absence of run-time errors; 8. absence of memory leaks . . .” (Source: ECSS-E-40C)
  24. 24. Black Swan: Conclusion on Example 2 There are known unknowns. That is to say, we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know. Donald Rumsfeld U.S. Secretary of Defense, 2001 to 2006
  25. 25. Software Dependability: Example 3 – Software Criticality <ul><li>ECSS-E-40C puts tailoring according to software criticality in a normative annex. For example the standard requires 100% path coverage testing for Class B criticality software. </li></ul><ul><li>Critique: </li></ul><ul><li>100% coverage testing is, in practice, impossible for very complex systems; </li></ul><ul><li>Even if you ensure 100% coverage testing, there is still no guarantee that the software is free from error. </li></ul><ul><li>Discussion: </li></ul><ul><li>For on-board software it is reasonable to take quite heavy measures in development are taken to ensure dependable software. </li></ul><ul><li>ECSS-E-40C </li></ul><ul><ul><li>Shows a very strong influence from on board development practice. </li></ul></ul><ul><ul><li>Does not take into account the impacts for ground software which typically are much bigger and more complex. </li></ul></ul>
  26. 26. Final Conclusions <ul><li>You can </li></ul><ul><ul><li>Have good governance; </li></ul></ul><ul><ul><li>Develop your ground systems in careful way, piloting new technology and taking plenty of time; </li></ul></ul><ul><ul><li>Ensure our industrial partners are fully motivated via competitive firm-fixed price contracts; </li></ul></ul><ul><li>But you can still be hit by unexpected problems in operations, especially in complex software. </li></ul><ul><li>The way to successfully tackle these unpredictable anomalies or incidents is to have a skilled team, fully familiar with the software and fully involved in the sims campaign. </li></ul>
  27. 27. Questions?