Make Sure Your Applications Crash

2,715 views
2,330 views

Published on

Presentation for PyCon 2012 about application reliability.

Published in: Technology, Automotive
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,715
On SlideShare
0
From Embeds
0
Number of Embeds
215
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Make Sure Your Applications Crash

  1. Make Sure Your Applications Crash Moshe Zadka
  2. True story
  3. Python doesnt crashMemory managed, no direct pointer arithmetic
  4. ...except it does C bugs, untrapped exception, infinite loops,blocking calls, thread dead-lock, inconsistent resident state
  5. Recovery is important"[S]ystem failure can usually be considered to be the result of two program errors[...] the second, in the recovery routine[...]"
  6. Crashes and inconsistent dataA crash results in data from an arbitrary program state.
  7. Avoid storageCaches are better than master copies.
  8. DatabasesTransactions maintain consistency Databases can crash too!
  9. Atomic operations File rename
  10. Example: Countingdef update_counter(): fp = file("counter.txt") s = fp.read() counter = int(s.strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("counter.txt.tmp", w) print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified # The following is an atomic operation os.rename("counter.txt.tmp", "counter.txt")
  11. Efficient caches, reliable masters Mark inconsistency of cache
  12. No shutdownCrash in testing
  13. AvailabilityIf data is consistent, just restart!
  14. Improving availability Limit impact Fast detection Fast start-up
  15. Vertical splittingDifferent execution paths, different processes
  16. Horizontal splittingDifferent code bases, different processes
  17. WatchdogMonitor -> Flag -> Remediate
  18. Watchdog principlesKeep it simple, keep it safe!
  19. Watchdog: Heartbeats## In a Twisted processdef beat(): file(beats/my-name, a).close()task.LoopingCall(beat).start(30)
  20. Watchdog: Get time-outsdef getTimeout() timeout = dict() now = time.time() for heart in glob.glob(hearts/*): beat = int(file(heart).read().strip()) timeout[heart] = now-beat return timeout
  21. Watchdog: Mark problemsdef markProblems(): timeout = getTimeout() for heart in glob.glob(beats/*): mtime = os.path.getmtime(heart) problem = problems/+heart if (mtime<timeout[heart] and not os.path.isfile(problem)): fp = file(problems/+heart, w) fp.write(watchdog) fp.close()
  22. Watchdog: check solutionsdef checkSolutions(): now = time.time() problemTimeout = now-30 for problem in glob.glob(problems/*): mtime = os.path.getmtime(problem) if mtime<problemTimeout: subprocess.call([restart-system])
  23. Watchdog: Loop## Watchdogwhile True: markProblems() checkSolutions() time.sleep(1)
  24. Watchdog: accuracy ofCustom checkers can manufacture problems
  25. Watchdog: reliability of Use cron for main loop
  26. Watchdog: reliability ofUse software/hardware watchdogs
  27. ConclusionsEverything crashes -- plan for it
  28. Questions?
  29. Welcome to the back-up slides Extra! Extra!
  30. Example: Counting on Windowsdef update_counter(): fp = file("counter.txt") s = fp.read() counter = int(s.strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("counter.txt.tmp", w) print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified os.remove("counter.txt") # At this point, the state is inconsistent* # The following is an atomic operation
  31. os.rename("counter.txt.tmp", "counter.txt")
  32. Example: Counting on Windows (Recovery)def recover(): if not os.path.exists("counter.txt"): # The permanent file has been removed # Therefore, the temp file is valid os.rename("counter.txt.tmp", "counter.txt")
  33. Example: Counting with versionsdef update_counter(): files = [int(name.split(.)[-1]) for name in os.listdir(.) if name.startswith(counter.)] last = max(files) counter = int(file(counter.%s % last ).read().strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("tmp.counter", w) print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified
  34. os.rename(tmp.counter, counter.%s % (last+1))os.remove(counter.%s % last)
  35. Example: Counting with versions (cleanup)# This is not a recovery routine, but a cleanup# routine.# Even in its absence, the state is consistentdef cleanup(): files = [int(name.split(.)[-1]) for name in os.listdir(.) if name.startswith(counter.)] files.sort() files.pop() for n in files: os.remove(counter.%d % n) if os.path.exists(tmp.counter): os.remove(tmp.counter)
  36. Correct orderingdef activate_due(): scheduled = rs.smembers(scheduled) now = time.time() for el in scheduled: due = int(rs.get(el+:due)) if now<due: continue rs.sadd(activated, el) rs.delete(el+:due) rs.sremove(scheduled, el)
  37. Correct ordering (recovery)def recover(): inconsistent = rs.sinter(activated, scheduled) for el in inconsistent: rs.delete(el+:due) #* rs.sremove(scheduled, el)
  38. Example: Key/value stores0.log: [add, key-0, value-0] [add, key-1, value-1] [add, key-0, value-2] [remove, key-1] . . .1.log: . . .2.log:
  39. ...
  40. Example: Key/value stores (utility functions)## Get the level of a filedef getLevel(s) return int(s.split(.)[0])## Get all files of a given typedef getType(tp): return [(getLevel(s), s) for s in files if s.endswith(tp)]
  41. Example: Key/value stores (classifying files)## Get all relevant filesdef relevant(d): files = os.listdir(d): mlevel, master = max(getType(.master)) logs = getType(.log) logs.sort() return master+[log for llevel, log in logs if llevel>mlevel]
  42. Example: Key/value stores (reading)## Read in a single filedef update(result, fp): for line in fp: val = json.loads(line) if val[0] == add: result[val[1]] = val[2] else: del result[val[1]]## Read in several filesdef read(files): result = dict() for fname in files: try: update(result, file(fname))
  43. except ValueError: passreturn result
  44. Example: Key/value stores (writer class)class Writer(object): def __init__(self, level): self.level = level self.fp = None self._next() def _next(self): self.level += 1 if self.fp: self.fp.close() name =%3d.log % self.currentLevel self.fp = file(name, w) self.rows = 0 def write(self, value):
  45. print >>self.fp, json.dumps(value)self.fp.flush()self.rows += 1if self.rows>200: self._next()
  46. Example: Key/value stores (storage class)## The actual data store abstraction.class Store(object): def __init__(self): files = relevant(d) self.result = read(files) level = getLevel(files[-1]) self.writer = Writer(level) def get(self, key): return self.result[key] def add(self, key, value): self.writer.write([add, key, value]) def remove(self, key): self.writer.write([remove, key])
  47. Example: Key/value stores (compression code)## This should be run periodically# from a different threaddef compress(d): files = relevant(d)[:-1] if len(files)<2: return result = read(files) master = getLevel(files[-1])+1 fp = file(%3d.master.tmp % master, w) for key, value in result.iteritems(): towrite = [add, key, value]) print >>fp, json.dumps(towrite) fp.close()
  48. Vertical splitting: Exampledef forking_server(): s = socket.socket() s.bind((, 8080)) s.listen(5) while True: client = s.accept() newpid = os.fork() if newpid: f = client.makefile() f.write("Sunday, May 22, 1983 " "18:45:59-PST") f.close() os._exit()
  49. Horizontal splitting: front-end## Process oneclass SchedulerResource(resource.Resource): isLeaf = True def __init__(self, filepath): resource.Resource.__init__(self) self.filepath = filepath def render_PUT(self, request): uuid, = request.postpath content = request.content.read() child = self.filepath.child(uuid) child.setContent(content)fp = filepath.FilePath("things")r = SchedulerResource(fp)s = server.Site(r)reactor.listenTCP(8080, s)
  50. Horizontal splitting: scheduler## Process twors = redis.Redis(host=localhost, port=6379, db=9)while True: for fname in os.listdir("things"): when = int(file(fname).read().strip()) rs.set(uuid+:due, when) rs.sadd(scheduled, uuid) os.remove(fname) time.sleep(1)
  51. Horizontal splitting: runner## Process threers = redis.Redis(host=localhost, port=6379, db=9)recover()while True: activate_due() time.sleep(1)
  52. Horizontal splitting: message queues No direct dependencies
  53. Horizontal splitting: message queues: sender## Process fourrs = redis.Redis(host=localhost, port=6379, db=9)params = pika.ConnectionParameters(localhost)conn = pika.BlockingConnection(params)channel = conn.channel()channel.queue_declare(queue=active)while True: activated = rs.smembers(activated) finished = set(rs.smembers(finished)) for el in activated: if el in finished: continue
  54. channel.basic_publish( exchange=, routing_key=active, body=el)rs.add(finished, el)
  55. Horizontal splitting: message queues: receiver## Process five# It is possible to get "dups" of bodies.# Application logic should deal with thatparams = pika.ConnectionParameters(localhost)conn = pika.BlockingConnection(params)channel = conn.channel()channel.queue_declare(queue=active)def callback(ch, method, properties, el): syslog.syslog(Activated %s % el)channel.basic_consume(callback, queue=hello, no_ack=True)channel.start_consuming()
  56. Horizontal splitting: point-to-point Use HTTP (preferably, REST)

×