SlideShare a Scribd company logo
1 of 47
Learning Webops
  the Hard Way

What could possibly go wrong?

                    Cosimo Streppone
         WebOps Lead – Opera Software
The Hard Way ?
failure
teams organization

     mail




  webops    sysadmin
1
a cascade of errors
#
# Cyrus IMAPD annotation definitions file
#
/vendor/messagingengine.com/
preview,message,string,backend,value.shared,
misplaced comma                           +
 fix didn't make it to master             +
  unintended general rollout              +
    parser choked on comma                +
      fork with no rate limiting          +
        fatal() dumped core               +
         kernel.core_uses_pid = 1         +
           small SSD metadata partition   +
             indexes corruption           =
                massive outage (no data loss)
DO
Rate limit fork of children

Test disk full conditions

Master your infrastructure
DO NOT
Underestimate Mighty Comma

Rollout everywhere at once

Leave your CI builds messy
read more
“A cascade of errors”
http://blog.fastmail.fm/2011/05/15/outage-
report-a-cascade-of-errors/
2
magic numbers
physical bladecenters?
  LVS?                               network?
              kernel?
                                      solar storms?
  WTF?!?
           random failures in our
defective cpus?
                infrastructure
                                       DDoS?
                  Mayas?
     bnx2?                     traffic?
           recent deploys?
what we experienced

random performance degradation
general instability
steady increase of WTFs/min!
real problem
●
  2.6.32 = debian squeeze kernel
● sched – find_busiest_group()

● TSC register wraparound
Proof
            64
        2
                             = 208,49
 10                      9
2     · 86400 · 10
Subject:    [PATCH] sched: avoid unnecessary overflow in sched_clock
From:       Salman Qazi <sqazi@google.com>
Date:       2011-11-16 20:55:31

In hundreds of days, the __cycles_2_ns calculation in sched_clock
has an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing
the final value to become zero. We can solve this without losing
any precision.

We can decompose TSC into quotient and remainder of division by the
scale factor, and then use this to convert TSC into nanoseconds.

Reviewed-by: Paul Turner <pjt@google.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Salman Qazi <sqazi@google.com>
---
 arch/x86/include/asm/timer.h |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletions(-)

                                   Patch #1, Nov 16th 2011
diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index fa7b917..431793e 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -32,6 +32,22 @@ extern int no_timer_check;
  * (mathieu.desnoyers@polymtl.ca)
  *
  *        -johnstul@us.ibm.com "math is hard, lets go shopping!"
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -608,6 +608,8 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...)
 {
    unsigned long long tsc_now, ns_now, *offset;
    unsigned long flags, *scale;
+ unsigned long long quot;
+ unsigned long long rem;             Patch #2, Mar 8th 2012
    local_irq_save(flags);
    sched_clock_idle_sleep_event();
@@ -620,7 +622,15 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...)

    if (cpu_khz) {
        *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
-       *offset = ns_now - (tsc_now * *scale >> CYC2NS_SCALE_FACTOR);
+
+       /*
+        * Avoid premature overflow by splitting into quotient
+        * and remainder. See the comment above __cycles_2_ns
+        */
+       quot = (tsc_now >> CYC2NS_SCALE_FACTOR);
+       rem = tsc_now & ((1ULL << CYC2NS_SCALE_FACTOR) - 1);
+       *offset = ns_now - (quot * *scale +
+                  ((rem * *scale) >> CYC2NS_SCALE_FACTOR));
    }
32
     2
               = 49,7
           3
86400 · 10
sven's explanation video
DO
 Be perseverant and creative :)

 Learn more about your kernel

 Improve tools to collect data
DO NOT
 Run servers continuously
 for more than 208 days?
3
#Leapocalypse
23:59:60
t - 4y 2m
From: Roman Zippel <zippel@linux-m68k.org>
Date: Thu, 1 May 2008 04:34:41 -0700
Subject: [PATCH] ntp: handle leap second via timer

Remove the leap second handling from second_overflow(), which doesn't have to
check for it every second anymore. With CONFIG_NO_HZ this also makes sure the
leap second is handled close to the full second. Additionally this makes it
possible to abort a leap second properly by resetting the STA_INS/STA_DEL status bits.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/clocksource.h |    2 +
 include/linux/timex.h       |    1 +
 kernel/time/ntp.c           | 133 +++++++++++++++++++++++++++++--------------
 kernel/time/timekeeping.c   |    4 +-
t – 9m
lie(t) = (1 – cos(πt / w)) / 2
lie(s)




                                   t
T - 6m




http://bit.ly/NmA47E

http://my.opera.com/marcomarongiu/blog/index.dml/tag/ntp
T – 1 month
package {
    ntpdate: ensure => installed;
    adjtimex: ensure => installed;
}

file { "/usr/local/bin/leap-adjust.pl":
    ensure => present,
    source => "puppet:///modules/ntp/leap-adjust.pl",
}

file { "/etc/cron.d/ntp-leap-second":
    ensure => present,
    source => "puppet:///modules/ntp/leap-crontab",
    require => [ Package["ntp"], Package["adjtimex"] ],
}
T - 2d


June
29th
T – 1 day
  June 30th 2012



chaos begins
T - 8h




http://bit.ly/PSBMRP

http://serverfault.com/questions/403732/leapocalypse
the work around

 # date -s now
T + {1,2}m
 {August,September} 1st, 2012



fake leap seconds
read more
A story of leaping seconds
   http://blog.fastmail.fm/2012/07/03/a-story-of-leaping-seconds/


Tips and tricks to deal with leap seconds
   http://my.opera.com/marcomarongiu/blog/index.dml/tag/ntp



Serverfault question on random debian crashes
   http://serverfault.com/questions/403732/leapocalypse

Wired article about leap second problems
    http://www.wired.com/wiredenterprise/2012/07/leap-second-bug-
wreaks-havoc-with-java-linux/
DO
Keep your kernel updated

Use valuable external resources
(serverfault etc...)
DO NOT
Underestimate the
importance of time
¿questions?
failure lessons learned



             }
    expect
   assume
   prepare
  simulate       failure
  measure
  embrace
ops lessons learned
Don't repeat yourself (DRY)
Always keep it simple (KISS)
Separate ops team doesn't work well
Practice Continuous deployment. Now.
Communication makes the difference
Learn your tools
Master your infrastructure
RTFM
...
Thanks!

@cstrep
cosimo@opera.com

More Related Content

What's hot

NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance AnalysisBrendan Gregg
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
Kernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureKernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureAnne Nicolas
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterLISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterIvan Babrou
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 InstancesBrendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesIO Visor Project
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsBrendan Gregg
 
DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2Outlyer
 

What's hot (20)

NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
Kernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureKernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and future
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 
Linux System Troubleshooting
Linux System TroubleshootingLinux System Troubleshooting
Linux System Troubleshooting
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterLISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
The Internet
The InternetThe Internet
The Internet
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challenges
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
 
ZFSperftools2012
ZFSperftools2012ZFSperftools2012
ZFSperftools2012
 
DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2
 

Similar to Velocity 2012 - Learning WebOps the Hard Way

Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
 
Kernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisKernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisAnne Nicolas
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory AnalysisMoabi.com
 
1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera clusterOlinData
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPFIvan Babrou
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Praguetomasbart
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Keisuke Takahashi
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemCyber Security Alliance
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Ontico
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging RubyAman Gupta
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby SystemsEngine Yard
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Vincent Batts
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesPeter Hlavaty
 

Similar to Velocity 2012 - Learning WebOps the Hard Way (20)

Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Kernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisKernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysis
 
Sge
SgeSge
Sge
 
Hacking the swisscom modem
Hacking the swisscom modemHacking the swisscom modem
Hacking the swisscom modem
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPF
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Rac 12c optimization
Rac 12c optimizationRac 12c optimization
Rac 12c optimization
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]
 
Mysql Latency
Mysql LatencyMysql Latency
Mysql Latency
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
 

More from Cosimo Streppone

How we use and deploy Varnish at Opera
How we use and deploy Varnish at OperaHow we use and deploy Varnish at Opera
How we use and deploy Varnish at OperaCosimo Streppone
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Cosimo Streppone
 
VUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareVUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareCosimo Streppone
 
Velocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackVelocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackCosimo Streppone
 
Mojolicious: what works and what doesn't
Mojolicious: what works and what doesn'tMojolicious: what works and what doesn't
Mojolicious: what works and what doesn'tCosimo Streppone
 
Surge 2010 - from disaster to stability - scaling my.opera.com
Surge 2010 - from disaster to stability - scaling my.opera.comSurge 2010 - from disaster to stability - scaling my.opera.com
Surge 2010 - from disaster to stability - scaling my.opera.comCosimo Streppone
 
My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009Cosimo Streppone
 
YAPC::EU::2009 - How Opera Software uses Perl
YAPC::EU::2009 - How Opera Software uses PerlYAPC::EU::2009 - How Opera Software uses Perl
YAPC::EU::2009 - How Opera Software uses PerlCosimo Streppone
 
NPW2009 - my.opera.com scalability v2.0
NPW2009 - my.opera.com scalability v2.0NPW2009 - my.opera.com scalability v2.0
NPW2009 - my.opera.com scalability v2.0Cosimo Streppone
 
IPW2008 - my.opera.com scalability
IPW2008 - my.opera.com scalabilityIPW2008 - my.opera.com scalability
IPW2008 - my.opera.com scalabilityCosimo Streppone
 

More from Cosimo Streppone (11)

How we use and deploy Varnish at Opera
How we use and deploy Varnish at OperaHow we use and deploy Varnish at Opera
How we use and deploy Varnish at Opera
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013
 
Italian, do you speak it?
Italian, do you speak it?Italian, do you speak it?
Italian, do you speak it?
 
VUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareVUG5: Varnish at Opera Software
VUG5: Varnish at Opera Software
 
Velocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackVelocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attack
 
Mojolicious: what works and what doesn't
Mojolicious: what works and what doesn'tMojolicious: what works and what doesn't
Mojolicious: what works and what doesn't
 
Surge 2010 - from disaster to stability - scaling my.opera.com
Surge 2010 - from disaster to stability - scaling my.opera.comSurge 2010 - from disaster to stability - scaling my.opera.com
Surge 2010 - from disaster to stability - scaling my.opera.com
 
My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009
 
YAPC::EU::2009 - How Opera Software uses Perl
YAPC::EU::2009 - How Opera Software uses PerlYAPC::EU::2009 - How Opera Software uses Perl
YAPC::EU::2009 - How Opera Software uses Perl
 
NPW2009 - my.opera.com scalability v2.0
NPW2009 - my.opera.com scalability v2.0NPW2009 - my.opera.com scalability v2.0
NPW2009 - my.opera.com scalability v2.0
 
IPW2008 - my.opera.com scalability
IPW2008 - my.opera.com scalabilityIPW2008 - my.opera.com scalability
IPW2008 - my.opera.com scalability
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Velocity 2012 - Learning WebOps the Hard Way

  • 1. Learning Webops the Hard Way What could possibly go wrong? Cosimo Streppone WebOps Lead – Opera Software
  • 2.
  • 3.
  • 6. teams organization mail webops sysadmin
  • 7. 1 a cascade of errors
  • 8. # # Cyrus IMAPD annotation definitions file # /vendor/messagingengine.com/ preview,message,string,backend,value.shared,
  • 9. misplaced comma + fix didn't make it to master + unintended general rollout + parser choked on comma + fork with no rate limiting + fatal() dumped core + kernel.core_uses_pid = 1 + small SSD metadata partition + indexes corruption = massive outage (no data loss)
  • 10.
  • 11. DO Rate limit fork of children Test disk full conditions Master your infrastructure
  • 12. DO NOT Underestimate Mighty Comma Rollout everywhere at once Leave your CI builds messy
  • 13.
  • 14. read more “A cascade of errors” http://blog.fastmail.fm/2011/05/15/outage- report-a-cascade-of-errors/
  • 16. physical bladecenters? LVS? network? kernel? solar storms? WTF?!? random failures in our defective cpus? infrastructure DDoS? Mayas? bnx2? traffic? recent deploys?
  • 17. what we experienced random performance degradation general instability steady increase of WTFs/min!
  • 18. real problem ● 2.6.32 = debian squeeze kernel ● sched – find_busiest_group() ● TSC register wraparound
  • 19. Proof 64 2 = 208,49 10 9 2 · 86400 · 10
  • 20. Subject: [PATCH] sched: avoid unnecessary overflow in sched_clock From: Salman Qazi <sqazi@google.com> Date: 2011-11-16 20:55:31 In hundreds of days, the __cycles_2_ns calculation in sched_clock has an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing the final value to become zero. We can solve this without losing any precision. We can decompose TSC into quotient and remainder of division by the scale factor, and then use this to convert TSC into nanoseconds. Reviewed-by: Paul Turner <pjt@google.com> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Salman Qazi <sqazi@google.com> --- arch/x86/include/asm/timer.h | 23 ++++++++++++++++++++++- 1 files changed, 22 insertions(+), 1 deletions(-) Patch #1, Nov 16th 2011 diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h index fa7b917..431793e 100644 --- a/arch/x86/include/asm/timer.h +++ b/arch/x86/include/asm/timer.h @@ -32,6 +32,22 @@ extern int no_timer_check; * (mathieu.desnoyers@polymtl.ca) * * -johnstul@us.ibm.com "math is hard, lets go shopping!"
  • 21. --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -608,6 +608,8 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...) { unsigned long long tsc_now, ns_now, *offset; unsigned long flags, *scale; + unsigned long long quot; + unsigned long long rem; Patch #2, Mar 8th 2012 local_irq_save(flags); sched_clock_idle_sleep_event(); @@ -620,7 +622,15 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...) if (cpu_khz) { *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz; - *offset = ns_now - (tsc_now * *scale >> CYC2NS_SCALE_FACTOR); + + /* + * Avoid premature overflow by splitting into quotient + * and remainder. See the comment above __cycles_2_ns + */ + quot = (tsc_now >> CYC2NS_SCALE_FACTOR); + rem = tsc_now & ((1ULL << CYC2NS_SCALE_FACTOR) - 1); + *offset = ns_now - (quot * *scale + + ((rem * *scale) >> CYC2NS_SCALE_FACTOR)); }
  • 22. 32 2 = 49,7 3 86400 · 10
  • 24. DO Be perseverant and creative :) Learn more about your kernel Improve tools to collect data
  • 25. DO NOT Run servers continuously for more than 208 days?
  • 27.
  • 29. t - 4y 2m From: Roman Zippel <zippel@linux-m68k.org> Date: Thu, 1 May 2008 04:34:41 -0700 Subject: [PATCH] ntp: handle leap second via timer Remove the leap second handling from second_overflow(), which doesn't have to check for it every second anymore. With CONFIG_NO_HZ this also makes sure the leap second is handled close to the full second. Additionally this makes it possible to abort a leap second properly by resetting the STA_INS/STA_DEL status bits. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- include/linux/clocksource.h | 2 + include/linux/timex.h | 1 + kernel/time/ntp.c | 133 +++++++++++++++++++++++++++++-------------- kernel/time/timekeeping.c | 4 +-
  • 31. lie(t) = (1 – cos(πt / w)) / 2 lie(s) t
  • 33. T – 1 month package { ntpdate: ensure => installed; adjtimex: ensure => installed; } file { "/usr/local/bin/leap-adjust.pl": ensure => present, source => "puppet:///modules/ntp/leap-adjust.pl", } file { "/etc/cron.d/ntp-leap-second": ensure => present, source => "puppet:///modules/ntp/leap-crontab", require => [ Package["ntp"], Package["adjtimex"] ], }
  • 35. T – 1 day June 30th 2012 chaos begins
  • 37.
  • 38.
  • 39. the work around # date -s now
  • 40. T + {1,2}m {August,September} 1st, 2012 fake leap seconds
  • 41. read more A story of leaping seconds http://blog.fastmail.fm/2012/07/03/a-story-of-leaping-seconds/ Tips and tricks to deal with leap seconds http://my.opera.com/marcomarongiu/blog/index.dml/tag/ntp Serverfault question on random debian crashes http://serverfault.com/questions/403732/leapocalypse Wired article about leap second problems http://www.wired.com/wiredenterprise/2012/07/leap-second-bug- wreaks-havoc-with-java-linux/
  • 42. DO Keep your kernel updated Use valuable external resources (serverfault etc...)
  • 45. failure lessons learned } expect assume prepare simulate failure measure embrace
  • 46. ops lessons learned Don't repeat yourself (DRY) Always keep it simple (KISS) Separate ops team doesn't work well Practice Continuous deployment. Now. Communication makes the difference Learn your tools Master your infrastructure RTFM ...