HAB Software Woes


Published on

My talk from the UKHAS 2012 conference about problems in HAB software.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

HAB Software Woes

  1. 1. HAB Software WoesJohn Graham-CummingSeptember 2012Or “My capsule didn‟t crash but my software did”
  2. 2. Background  > 30 years of programming experience  One HAB flight ◦ GAGA-1http://blog.jgc.org/2011/04/gaga-1-flight.htmlhttps://github.com/jgrahamc/gaga
  3. 3. Where‟s your flight‟scomplexity? Example: GAGA-1 ◦ One balloon, parachute, polystyrene box ◦ Many metres of cord attached with knots ◦ An off-the-shelf camera ◦ 2,836 lines of code ◦ Common to see defect rates of 2 to 4 per KLOC ◦ So GAGA-1 likely has 5 to 10 errors in it
  4. 4. Real Stuff Seen on HABflights Complete computer crash Altitude going negative Latitude and longitude garbled Cutdown triggered in back of car Long periods of no transmission Not setting the GPS up before launch Not turning the camera on Running out of camera disk space Altitude jumping around rhythmically
  5. 5. The Curse and Joy ofDeterminism Computers do what you tell them to ◦ Precisely what you tell them to ◦ Not what you think you told them to do A Curse ◦ Will do things you don‟t expect ◦ Will process bogus input without complaint The Joy ◦ Easy to test that it does what‟s expected
  6. 6. HAB Is A Harsh Environment Cold Vibration Stuff breaks in flight Software needs to be able to cope with failing hardware Very important to think about failure modes YOUR CODE IS ON ITS OWN OUT THERE
  7. 7. Deadly Sins The “It works!” Fallacy The Last Minute Change Being Far Too Clever Overlooking Odd Behaviour Copying Other People‟s Code Assuming Finding A Bug Solves The Problem
  8. 8. The “It works!” Fallacy If you‟re an inexperienced (and sometimes experienced) programmer… ◦ You hack some code together ◦ It works once ◦ You assume it will always work Only solution to this is ◦ Testing ◦ Paranoia
  9. 9. The Last Minute Change Never, ever change anything in code at the last minute no matter how simple. Example: HABE 1 ◦ Complete camera failure ◦ Maximum integer size in uBASIC on CHDK is 999,999 ◦ Last minute change of integer from 600,000 to 1,000,000 caused total failure
  10. 10. Being Far Too Clever  Example: GAGA-1 ◦ Entered the wrong value of 2 * pi in code to do GPS position conversion from radians to degrees ◦ Caught before flight because I verified the location of my own back garden ◦ Note to self: 2 * pi != 6.2818.https://github.com/jgrahamc/gaga/blob/master/gaga-1/flight/gaga1/gps.cpp#L113
  11. 11. Overlooking Odd Behaviour  Example: GAGA-1 ◦ In tests RTTY output was fine some of the time, garbled at other times ◦ Turned out to be interrupts from the GPS messing up the RTTY timing ◦ Solution: disable GPS serial interface while sending RTTY string  ALWAYS BE HONEST WITH YOURSELF ABOUT YOUR CODE  EXPECT THE SPANISH INQUISITION!https://github.com/jgrahamc/gaga/blob/master/gaga-1/flight/gaga1/tsip.cpp#L229
  12. 12. Copying Other People‟s Code  Don‟t do this, you have no idea what you are copying or who they copied it from  Better practice is to look at other people‟s code and… ◦ Write your own version ◦ That you understand ◦ That you are able to test ◦ Example: GAGA-1  Read lots of people‟s RTTY code, wrote my ownhttps://github.com/jgrahamc/gaga/blob/master/gaga-
  13. 13. APRS Tracker using copied code If the altitude in metres contained an 8 or a 9 the altitude reported would be wronghttp://sharon.esrac.ele.tue.nl/users/pe1rxq/aprstracker/aprstracker.html
  14. 14. Assuming Finding The BugSolves The Problem Just because you‟ve found A bug doesn‟t mean it was THE bug Lots of research in computer science shows bugs tend to cluster Example: CLOUD1, CLOUD2 ◦ Three bugs in printing latitude, longitude and altitude ◦ One fixed on CLOUD1, …
  15. 15. “The One Thing I Didn‟t Test” http://ukhas.org.uk/guides:common_coding_errors_payload_testing
  16. 16. Common problems with uC Lack of floating point support Small integers
  17. 17. You might never be agreat programmer…… but you can be aparanoid tester!
  18. 18. Good Things To Do No infinite loops Self-Checking Unexpected Error Handling Handle Exceptions Simulation Simplify, Simplify, Simplify Unit Test Write Log Files
  19. 19. No Infinite Loops Never sit in a loop waiting forever Example: ATLAS 3while (1) { // Make sure data is available to read if (Serial.available()) { b = Serial.read(); if(bytePos == 8){ navmode = b; return true; } bytePos++; } // Timeout if no valid response in 3 seconds if (millis() - startTime > 3000) { navmode = 0; return false; } }} https://github.com/jamescoxon/Atlas-Flight-Computer/blob/master/Atlas3/Atlas3_3.pde#L
  20. 20. Self-Checking -- Now enter a self-check of the manual mode settings log( "Self-check started" ) assert_prop( 49, -32764, "Not in manual mode" ) assert_prop( 5, 0, "AF Assist Beam should be Off" ) assert_prop( 6, 0, "Focus Mode should be Normal" ) assert_prop( 8, 0, "AiAF Mode should be On" ) assert_prop( 21, 0, "Auto Rotate should be Off" ) assert_prop( 29, 0, "Bracket Mode should be None" ) assert_prop( 57, 0, "Picture Mode should be Superfine" ) assert_prop( 66, 0, "Date Stamp should be Off" ) assert_prop( 95, 0, "Digital Zoom should be None" ) assert_prop( 102, 0, "Drive Mode should be Single" ) assert_prop( 133, 0, "Manual Focus Mode should be Off" ) assert_prop( 143, 2, "Flash Mode should be Off" ) assert_prop( 149, 100, "ISO Mode should be 100" ) assert_prop( 218, 0, "Picture Size should be L" ) assert_prop( 268, 0, "White Balance Mode should be Auto" ) assert_gt( get_time("Y"), 2009, "Unexpected year" ) assert_gt( get_time("h"), 6, "Hour appears too early" ) assert_lt( get_time("h"), 20, "Hour appears too late" ) assert_gt( get_vbatt(), 3000, "Batteries seem low" ) assert_gt( get_jpg_count(), ns, "Insufficient card space" )https://github.com/jgrahamc/gaga/blob/master/gaga-1/camera/gaga-1.lua#L96
  21. 21. Self-Checking  Example: ALTAS 3  Makes sure uBlox GPS will work at high altitude; fixes it if not if((count % 10) == 0) { digitalWrite(6, LOW); checkNAV(); delay(1000); if(navmode != 6){ setupGPS(); delay(1000); } checkNAV(); delay(1000); digitalWrite(6, HIGH); }https://github.com/jamescoxon/Atlas-Flight-Computer/blob/master/Atlas3/Atlas3_3.pde#L3
  22. 22. Unexpected Error Handling def temperature(): t = at.cmd( AT#TEMPMON=1 ) # Command returns something like: # # #TEMPMEAS: 0,28 # # OK # # So split on whitespace first to isolate the temperate 0,28 # and then split on comma to get the temperature w = t.split() if len(w) < 2: logger.log( "Temperature read returned %s" % t ) return -1000 m = w[1].split(,) if len(m) != 2: logger.log( "Temperature read returned %s" % t ) return -1000 else: return int(m[1])https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/util.py
  23. 23. Handle Exceptions  If your language can generate exceptions then you‟d better handle them!  Example: GAGA-1 ◦ Recovery computer used Python ◦ Exception could have killed it ◦ Global exception handler except: logger.log( "Caught exception in main loop: %s" % sys.exc_info()[1] )  Bonus: What‟s wrong with that code?https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/gaga-1.py#L144
  24. 24. Simulation Simulate a flight Example: UKHAS wiki has example of using a PC as a fake GPShttp://www.ukhas.org.uk/guides:common_coding_errors_payload_testing Example: GAGA-1 ◦ To test the embedded Telit module wrote modules that faked the entire Telit Python interface.https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/GPS.pyhttps://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/MDM.py
  25. 25. Simplify, Simplify, Simplify Make your code as simple as possible Never have „duplicated‟ or „copy and paste‟ code Break it up into small functions that you understand Make sure you understand the limitations of the functions you call
  26. 26. Unit Test Break your program up into small, separate functions Write tests that call that function and make sure it does what you expect. Lots of ways to do this ◦ Use something like cpptest ◦ ArduinoUnit ◦ Write your own test program
  27. 27. Unit Test Example In the bad APRS program Turn metres to feet code into a separate function: int m_to_f(int m) assertEquals(m_to_f(1000),3300) assertEquals(m_to_f(2000),6600) assertEquals(m_to_f(3000),9900) assertEquals(m_to_f(4000),13200) assertEquals(m_to_f(5000),16500) assertEquals(m_to_f(6000),19800) assertEquals(m_to_f(7000),23100) assertEquals(m_to_f(8000),26400) assertEquals(m_to_f(9000),29700) assertEquals(m_to_f(10000),33000)
  28. 28. Write Log Files Write detailed log files to non-volatile memory for post flight debugging Data sent via RTTY or APRS is limited Log exceptions and errors in detail Make sure you have a timestamp
  29. 29. Perform system testing Test your entire system before flight ◦ Put your tracker in the garden ◦ Get a GPS lock ◦ Listen to the RTTY on your radio ◦ Look at the decoded RTTY on your computer ◦ Test uploaded data on the tracker* ◦ *I didn‟t do that step, on the day people had to fix the tracker for me.