Velocity 2012 - Learning WebOps the Hard Way

761 views

Published on

Working in Web Operations means dealing with production systems that in most cases needs to be operational 24×7x365.

To reach 99.99999% uptime, you must fail as little as possible.

This talk will go through a few real-world incidents and failures experienced by our small WebOps team, and outline what we are learning (the hard way), and how we’re trying to improve.

What could possibly go wrong? :-)

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
761
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Velocity 2012 - Learning WebOps the Hard Way

  1. 1. Learning Webops the Hard WayWhat could possibly go wrong? Cosimo Streppone WebOps Lead – Opera Software
  2. 2. The Hard Way ?
  3. 3. failure
  4. 4. teams organization mail webops sysadmin
  5. 5. 1a cascade of errors
  6. 6. ## Cyrus IMAPD annotation definitions file#/vendor/messagingengine.com/preview,message,string,backend,value.shared,
  7. 7. misplaced comma + fix didnt make it to master + unintended general rollout + parser choked on comma + fork with no rate limiting + fatal() dumped core + kernel.core_uses_pid = 1 + small SSD metadata partition + indexes corruption = massive outage (no data loss)
  8. 8. DORate limit fork of childrenTest disk full conditionsMaster your infrastructure
  9. 9. DO NOTUnderestimate Mighty CommaRollout everywhere at onceLeave your CI builds messy
  10. 10. read more“A cascade of errors”http://blog.fastmail.fm/2011/05/15/outage-report-a-cascade-of-errors/
  11. 11. 2magic numbers
  12. 12. physical bladecenters? LVS? network? kernel? solar storms? WTF?!? random failures in ourdefective cpus? infrastructure DDoS? Mayas? bnx2? traffic? recent deploys?
  13. 13. what we experiencedrandom performance degradationgeneral instabilitysteady increase of WTFs/min!
  14. 14. real problem● 2.6.32 = debian squeeze kernel● sched – find_busiest_group()● TSC register wraparound
  15. 15. Proof 64 2 = 208,49 10 92 · 86400 · 10
  16. 16. Subject: [PATCH] sched: avoid unnecessary overflow in sched_clockFrom: Salman Qazi <sqazi@google.com>Date: 2011-11-16 20:55:31In hundreds of days, the __cycles_2_ns calculation in sched_clockhas an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causingthe final value to become zero. We can solve this without losingany precision.We can decompose TSC into quotient and remainder of division by thescale factor, and then use this to convert TSC into nanoseconds.Reviewed-by: Paul Turner <pjt@google.com>Acked-by: John Stultz <johnstul@us.ibm.com>Signed-off-by: Salman Qazi <sqazi@google.com>--- arch/x86/include/asm/timer.h | 23 ++++++++++++++++++++++- 1 files changed, 22 insertions(+), 1 deletions(-) Patch #1, Nov 16th 2011diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.hindex fa7b917..431793e 100644--- a/arch/x86/include/asm/timer.h+++ b/arch/x86/include/asm/timer.h@@ -32,6 +32,22 @@ extern int no_timer_check; * (mathieu.desnoyers@polymtl.ca) * * -johnstul@us.ibm.com "math is hard, lets go shopping!"
  17. 17. --- a/arch/x86/kernel/tsc.c+++ b/arch/x86/kernel/tsc.c@@ -608,6 +608,8 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...) { unsigned long long tsc_now, ns_now, *offset; unsigned long flags, *scale;+ unsigned long long quot;+ unsigned long long rem; Patch #2, Mar 8th 2012 local_irq_save(flags); sched_clock_idle_sleep_event();@@ -620,7 +622,15 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...) if (cpu_khz) { *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;- *offset = ns_now - (tsc_now * *scale >> CYC2NS_SCALE_FACTOR);++ /*+ * Avoid premature overflow by splitting into quotient+ * and remainder. See the comment above __cycles_2_ns+ */+ quot = (tsc_now >> CYC2NS_SCALE_FACTOR);+ rem = tsc_now & ((1ULL << CYC2NS_SCALE_FACTOR) - 1);+ *offset = ns_now - (quot * *scale ++ ((rem * *scale) >> CYC2NS_SCALE_FACTOR)); }
  18. 18. 32 2 = 49,7 386400 · 10
  19. 19. svens explanation video
  20. 20. DO Be perseverant and creative :) Learn more about your kernel Improve tools to collect data
  21. 21. DO NOT Run servers continuously for more than 208 days?
  22. 22. 3#Leapocalypse
  23. 23. 23:59:60
  24. 24. t - 4y 2mFrom: Roman Zippel <zippel@linux-m68k.org>Date: Thu, 1 May 2008 04:34:41 -0700Subject: [PATCH] ntp: handle leap second via timerRemove the leap second handling from second_overflow(), which doesnt have tocheck for it every second anymore. With CONFIG_NO_HZ this also makes sure theleap second is handled close to the full second. Additionally this makes itpossible to abort a leap second properly by resetting the STA_INS/STA_DEL status bits.Signed-off-by: Roman Zippel <zippel@linux-m68k.org>Cc: john stultz <johnstul@us.ibm.com>Cc: Thomas Gleixner <tglx@linutronix.de>Signed-off-by: Andrew Morton <akpm@linux-foundation.org>Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>--- include/linux/clocksource.h | 2 + include/linux/timex.h | 1 + kernel/time/ntp.c | 133 +++++++++++++++++++++++++++++-------------- kernel/time/timekeeping.c | 4 +-
  25. 25. t – 9m
  26. 26. lie(t) = (1 – cos(πt / w)) / 2lie(s) t
  27. 27. T - 6mhttp://bit.ly/NmA47Ehttp://my.opera.com/marcomarongiu/blog/index.dml/tag/ntp
  28. 28. T – 1 monthpackage { ntpdate: ensure => installed; adjtimex: ensure => installed;}file { "/usr/local/bin/leap-adjust.pl": ensure => present, source => "puppet:///modules/ntp/leap-adjust.pl",}file { "/etc/cron.d/ntp-leap-second": ensure => present, source => "puppet:///modules/ntp/leap-crontab", require => [ Package["ntp"], Package["adjtimex"] ],}
  29. 29. T - 2dJune29th
  30. 30. T – 1 day June 30th 2012chaos begins
  31. 31. T - 8hhttp://bit.ly/PSBMRPhttp://serverfault.com/questions/403732/leapocalypse
  32. 32. the work around # date -s now
  33. 33. T + {1,2}m {August,September} 1st, 2012fake leap seconds
  34. 34. read moreA story of leaping seconds http://blog.fastmail.fm/2012/07/03/a-story-of-leaping-seconds/Tips and tricks to deal with leap seconds http://my.opera.com/marcomarongiu/blog/index.dml/tag/ntpServerfault question on random debian crashes http://serverfault.com/questions/403732/leapocalypseWired article about leap second problems http://www.wired.com/wiredenterprise/2012/07/leap-second-bug-wreaks-havoc-with-java-linux/
  35. 35. DOKeep your kernel updatedUse valuable external resources(serverfault etc...)
  36. 36. DO NOTUnderestimate theimportance of time
  37. 37. ¿questions?
  38. 38. failure lessons learned } expect assume prepare simulate failure measure embrace
  39. 39. ops lessons learnedDont repeat yourself (DRY)Always keep it simple (KISS)Separate ops team doesnt work wellPractice Continuous deployment. Now.Communication makes the differenceLearn your toolsMaster your infrastructureRTFM...
  40. 40. Thanks!@cstrepcosimo@opera.com

×