Operational Efficiency Hacks Web20 Expo2009

John Allspaw
John AllspawEngineer, Researcher, Founder at Adaptive Capacity Labs, LLC
Operational
Efficiency
Hacks
                          John Allspaw
          Operations Engineering, Flickr
who am I?
Manage the Flickr Operations group
Wrote a geeky book:
“Efficiencies”
“Efficiencies”
   Doing more with the robots you’ve got
“Efficiencies”
   Doing more with the robots you’ve got
   Doing more with the humans you’ve got
Some optimization
    “rules”
Some optimization
         “rules”

-   Don’t rely on being able to tweak anything.
Some optimization
         “rules”

-   Don’t rely on being able to tweak anything.
-   Don’t waste too much time tuning when
    you have no evidence it’ll matter.
Optimization “rules”
performance tuning gains




                                 time spent tuning
Optimization “rules”




“We should forget about small efficiencies,
say about 97% of the time: premature
optimization is the root of all evil.”
                         Knuth, (or Hoare)
however...
Optimization “rules”
Optimization “rules”

That doesn’t give us an excuse to be lazy and
inefficient.
Optimization “rules”

That doesn’t give us an excuse to be lazy and
inefficient.
Optimization “rules”

That doesn’t give us an excuse to be lazy and
inefficient.


We lean on the experience of people in the
community for evidence that tuning(s) might
be a worthwhile thing to do.
Optimization “rules”

“Yet we should not pass up our
opportunities in that critical 3 percent.”
                       Knuth, (or Hoare)
So...
          stop somewhere
               in here




                           OMG
obvious
                           I'm wasting
tuning
                           !@#$ time
wins
                           for no reason
Our Context
Our Context

-   24 TB of MySQL data
Our Context

-   24 TB of MySQL data
-   32k/sec of MySQL writes
Our Context

-   24 TB of MySQL data
-   32k/sec of MySQL writes
-   120k/sec of MySQL reads
Our Context

-   24 TB of MySQL data
-   32k/sec of MySQL writes
-   120k/sec of MySQL reads
-   6 PB of photos
Our Context

-   24 TB of MySQL data
-   32k/sec of MySQL writes
-   120k/sec of MySQL reads
-   6 PB of photos
-   10TB storage eaten per day
Our Context

-   24 TB of MySQL data
-   32k/sec of MySQL writes
-   120k/sec of MySQL reads
-   6 PB of photos
-   10TB storage eaten per day
-   15,362 service monitors (alerts)
Infrastructure Hacks
Infrastructure Hacks

-   Examples of what changing software can do
          (plain old-fashioned performance tuning)
Infrastructure Hacks

-   Examples of what changing software can do
          (plain old-fashioned performance tuning)
-   Examples of what changing hardware can do
                              (yay for Mr. Moore!)
Leaning on compilers
(synthetic PHP benchmarks, not real-world)




(http://sebastian-bergmann.de/archives/634-PHP-GCC-
                  ICC-Benchmark.html)
PHP (real-world)




php 4.4.8 to php 5.2.8 migration
Can now handle more with less




same
taste,
 less
filling
Image Processing
Image Processing


-   2004, Flickr was using ImageMagick for
    image processing (version 6.1.9)
Image Processing


-   2004, Flickr was using ImageMagick for
    image processing (version 6.1.9)
-   Changed to GraphicsMagick, about 15%
    faster at the time (version 1.1.5)
Image Processing


-   2004, Flickr was using ImageMagick for
    image processing (version 6.1.9)
-   Changed to GraphicsMagick, about 15%
    faster at the time (version 1.1.5)
-   Only need a subset of ImageMagick features
    anyway for our purposes
Image Processing

-   OpenMP support
    (http://en.wikipedia.org/wiki/Openmp)
    - Allows parallelization of processing jobs,
       using multiple cores working on the same
       image
    - Some algorithms have more parallelization
       than others
Image Processing
-   Test script
     -   7 large-ish DSLR photos
     -   Cascade resizing each to 6 smaller sizes,
         semi-typical for Flickr’s workload
     -   Each resize processed serially
Image Processing
   compiler differences




(GM version 1.1.14, non-OpenMP)
Image Processing
          OpenMP differences


                    OpenMP
                    advantage




(gcc 4.1.2, on quad core Xeon L5335 @ 2.00GHz)
Image Processing
   CPU differences
Diagonal Scaling
Diagonal Scaling
-   Vertically scaling your already horizontally-
    scaled nodes
Diagonal Scaling
-   Vertically scaling your already horizontally-
    scaled nodes
-   a.k.a. “tech refresh”
Diagonal Scaling
-   Vertically scaling your already horizontally-
    scaled nodes
-   a.k.a. “tech refresh”
-   a.k.a. “Moore’s Law Surfing”
Diagonal Scaling
Diagonal Scaling
              67 “old” webservers with 18 “new” :
We replaced
Diagonal Scaling
              67 “old” webservers with 18 “new” :
We replaced


                CPUs         RAM         drives       total power (W)
servers                                                 @60% peak
               per server   per server   per server

   67              2                                    8763.6
                              4GB        1x80GB

   18              8                                    2332.8
                             4GB         1x146GB
Diagonal Scaling
              67 “old” webservers with 18 “new” :
We replaced


                CPUs         RAM         drives       total power (W)
servers                                                 @60% peak

              ~70% LESS power
               per server   per server   per server

   67           2                                       8763.6
                     4GB    1x80GB
              49U LESS rack space
   18              8                                    2332.8
                             4GB         1x146GB
Diagonal Scaling
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced


   server        photos/min        rack        total power (W)
                                                 @60% peak



      23             1035           23          3008.4
      8              1120            8          1036.8
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced


   server        photos/min        rack        total power (W)
                                                 @60% peak

                  ~75% FASTER
      23            1035         23             3008.4
                  15U LESS rack space
                  65% LESS power 8
       8            1120                        1036.8
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced


   server        photos/min        rack        total power (W)

                  ~75% FASTER                    @60% peak



                  15U LESS rack space
      23            1035         23             3008.4
                  65% LESS power 8
       8            1120                        1036.8
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced


   server        photos/min        rack        total power (W)

                  ~75% FASTER                    @60% peak



                  15U LESS rack space
      23            1035         23             3008.4
                  65% LESS power 8
       8            1120                        1036.8
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced


   server        photos/min        rack        total power (W)

                  ~75% FASTER                    @60% peak



                  15U LESS rack space
      23            1035         23             3008.4
                  65% LESS power 8
       8            1120                        1036.8

        from this
                           to this
What do you do with
old/slow machines?
What do you do with
    old/slow machines?

-   Liquidate
What do you do with
    old/slow machines?

-   Liquidate
-   Re-purpose as dev/staging/etc
What do you do with
    old/slow machines?

-   Liquidate
-   Re-purpose as dev/staging/etc
-   “offline” tasks
Offline Tasks
Offline Tasks
-   Out-of-band/asynchronous queuing and execution
    system, for non-realtime tasks
Offline Tasks
-   Out-of-band/asynchronous queuing and execution
    system, for non-realtime tasks
-   See here:
Offline Tasks
-   Out-of-band/asynchronous queuing and execution
    system, for non-realtime tasks
-   See here:
    http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/
Offline Tasks
-   Out-of-band/asynchronous queuing and execution
    system, for non-realtime tasks
-   See here:
    http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/

-   See Myles Grant talk about it more here:
Offline Tasks
-   Out-of-band/asynchronous queuing and execution
    system, for non-realtime tasks
-   See here:
    http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/

-   See Myles Grant talk about it more here:
    http://en.oreilly.com/velocity2009/public/schedule/detail/7552
Runbook Hacks




“WTF HAPPENED LAST NIGHT?!”
Why?
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Some of the How:
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Some of the How:
  - teach machines to build themselves
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Some of the How:
  - teach machines to build themselves
  - teach machines to watch themselves
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Some of the How:
  - teach machines to build themselves
  - teach machines to watch themselves
  - teach machines to fix themselves
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Some of the How:
  -   teach machines to build themselves
  -   teach machines to watch themselves
  -   teach machines to fix themselves
  -   reduce MTTR by streamlining
Automated
Infrastructure
Automated
          Infrastructure
-   If there is only one thing you do, automatic
    configuration and deployment management
    should be it.
Automated
              Infrastructure
-   If there is only one thing you do, automatic
    configuration and deployment management
    should be it.
-   See:
     -     Opscode/Chef (http://opscode.com/)

     -     Puppet (http://reductivelabs.com/products/puppet/)

     -     System Imager/Configurator
           (http://wiki.systemimager.org)
Conguration Management
     Codeswarm
Time
Machine time is cheaper than human time.


If a failure results in some commands being
run to ‘fix’ it, make the machines do it.


(i.e., don’t wake people up for stupid things!)
Aggregate Monitoring
Aggregate Monitoring




Don’t care about single nodes, only care about delta
change of metrics/faults
   -   Warn (email) on X % change
   -   Page (wake up) on Y % change
Aggregate Monitoring




Don’t care about single nodes, only care about delta
change of metrics/faults
   -   Warn (email) on X % change
   -   Page (wake up) on Y % change
High and low water marks for some metrics
Self-Healing
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.


Daemons/processes run on machines, will take
corrective action under certain conditions, and
report back with what they did.
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.


Daemons/processes run on machines, will take
corrective action under certain conditions, and
report back with what they did.
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.


Daemons/processes run on machines, will take
corrective action under certain conditions, and
report back with what they did.


Can greatly reduce your mean time to recovery
(MTTR)
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.


Daemons/processes run on machines, will take
corrective action under certain conditions, and
report back with what they did.


Can greatly reduce your mean time to recovery
(MTTR)
Basic Apache Example
Basic Apache Example

1. Webserver not running?
Basic Apache Example

1. Webserver not running?
2. Under certain conditions, try to start it, and
   email that this happened. (I’ll read it
   tomorrow)
Basic Apache Example

1. Webserver not running?
2. Under certain conditions, try to start it, and
   email that this happened. (I’ll read it
   tomorrow)
3. Won’t start? Assume something’s really
   wrong, so don’t keep trying (email that, too)
MySQL Self-Healing
MySQL Self-Healing
Some MySQL Issues “fixed” by the machines
MySQL Self-Healing
Some MySQL Issues “fixed” by the machines
MySQL Self-Healing
   Some MySQL Issues “fixed” by the machines


- Kill long-running SELECT queries (marked safe to kill)
MySQL Self-Healing
   Some MySQL Issues “fixed” by the machines


- Kill long-running SELECT queries (marked safe to kill)
- Queries not safe to kill are marked by the application
   as “NO KILL” in comments
MySQL Self-Healing
   Some MySQL Issues “fixed” by the machines


- Kill long-running SELECT queries (marked safe to kill)
- Queries not safe to kill are marked by the application
   as “NO KILL” in comments
- Run EXPLAIN on killed queries, and report the results
MySQL Self-Healing
   Some MySQL Issues “fixed” by the machines


- Kill long-running SELECT queries (marked safe to kill)
- Queries not safe to kill are marked by the application
   as “NO KILL” in comments
- Run EXPLAIN on killed queries, and report the results
- Keep track of the query types and databases that need
   the most killing, produce a “DBs that Suck” report
MySQL Self-Healing
MySQL Self-Healing
Some MySQL Replication issues “fixed” by the
machines, by error
MySQL Self-Healing
  Some MySQL Replication issues “fixed” by the
  machines, by error
- Skip errors that can safely be skipped and
  restart slave threads
MySQL Self-Healing
  Some MySQL Replication issues “fixed” by the
  machines, by error
- Skip errors that can safely be skipped and
  restart slave threads
- Force refetch of replication binlogs on:
       - 1064 (ER_PARSE_ERROR)
MySQL Self-Healing
  Some MySQL Replication issues “fixed” by the
  machines, by error
- Skip errors that can safely be skipped and
  restart slave threads
- Force refetch of replication binlogs on:
       - 1064 (ER_PARSE_ERROR)
- Re-run queries on:
       - 1205 (ER_LOCK_WAIT_TIMEOUT)
       - 1213 (ER_LOCK_DEADLOCK)
Troubleshooting
Code and Config
 Deploy Logs
Code and Config
 Deploy Logs

1. ESSENTIAL
Code and Config
 Deploy Logs

1. ESSENTIAL
2. MANDATORY
Communications
Communications
•   Internal IRC

     -   For ongoing discussions
     -   Logged, so “infinite” scrollback
Communications
•   Internal IRC

      -   For ongoing discussions
      -   Logged, so “infinite” scrollback
•   IM Bot (built on libyahoo2.sf.net)

      -   For production changes
      -   Broadcasts all to all contacts
      -   Logged, and injected into IRC
      -   IM Status = who is in primary/secondary on-call
Communications
•   Internal IRC

      -   For ongoing discussions
      -   Logged, so “infinite” scrollback
•   IM Bot (built on libyahoo2.sf.net)

      -   For production changes
      -   Broadcasts all to all contacts
      -   Logged, and injected into IRC
      -   IM Status = who is in primary/secondary on-call

•   All of IRC and IM Bot slurped into a search index
Operational Efficiency Hacks Web20 Expo2009
when
when   what
when   what

              detailed
               what*
when                 what

                                                        detailed
                                                         what*




*also points to what commands should be used to back out the changes
when                 what

                                                        detailed
                                                         what*



                                          who




*also points to what commands should be used to back out the changes
when                 what

                                                        detailed
                                                         what*



                                          who




*also points to what commands should be used to back out the changes
when                 what

                                                        detailed
                                                         what*



                                          who
     time of last deploy at top of ganglia




*also points to what commands should be used to back out the changes
Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009
IM Bot (timestamps help correlation)
IM Bot (timestamps help correlation)
IM Bot (timestamps help correlation)




all IRC, IM
      bot
     into
searchable
  history
Morals of Our Stories
Morals of Our Stories

-   Optimizations can be a Very Good Thing™
Morals of Our Stories

-   Optimizations can be a Very Good Thing™
-   Weigh time spent optimizing against
    expected gains
Morals of Our Stories

-   Optimizations can be a Very Good Thing™
-   Weigh time spent optimizing against
    expected gains
-   Lean on others for how much “expected
    gains” mean for different scenarios
Morals of Our Stories

-   Optimizations can be a Very Good Thing™
-   Weigh time spent optimizing against
    expected gains
-   Lean on others for how much “expected
    gains” mean for different scenarios
-   Plain old-fashioned intuition
Some Wisdom Nuggets


     Jon Prall’s 85 WebOps Rules:
  http://jprall.vox.com/library/post/85-
    operations-rules-to-live-by.html
Questions?

http://www.flickr.com/photos/ebarney/3348965637/
http://www.flickr.com/photos/dgmiller/1606071911/
http://www.flickr.com/photos/dannyboyster/60371673/
http://www.flickr.com/photos/bright/189338394/
http://www.flickr.com/photos/nickwheeleroz/2475011402/
http://www.flickr.com/photos/dramaqueennorma/191063346/
http://www.flickr.com/photos/telstar/2861103147/
http://www.flickr.com/photos/norby/2309046043/
http://www.flickr.com/photos/allysonk/201008992/
1 of 131

Recommended

Grub2 Booting Process by
Grub2 Booting ProcessGrub2 Booting Process
Grub2 Booting ProcessMike Wang
5.5K views38 slides
Multiprocessing with python by
Multiprocessing with pythonMultiprocessing with python
Multiprocessing with pythonPatrick Vergain
54.7K views42 slides
Nagios by
NagiosNagios
Nagiosguest7e7e305
5.4K views18 slides
Qemu device prototyping by
Qemu device prototypingQemu device prototyping
Qemu device prototypingYan Vugenfirer
276 views41 slides
The advantages and disadvantages of video games. by
The advantages and disadvantages of video games.The advantages and disadvantages of video games.
The advantages and disadvantages of video games.Lida Berisha
10.4K views11 slides
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,... by
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...The Linux Foundation
2.6K views13 slides

More Related Content

What's hot

PyCon 2017 예제로 살펴보는 PyQt by
PyCon 2017 예제로 살펴보는 PyQtPyCon 2017 예제로 살펴보는 PyQt
PyCon 2017 예제로 살펴보는 PyQt덕규 임
8.4K views29 slides
Game Development with Pygame by
Game Development with PygameGame Development with Pygame
Game Development with PygameFramgia Vietnam
2.4K views17 slides
Introduction to Git by
Introduction to GitIntroduction to Git
Introduction to GitColin Su
1.8K views32 slides
Introduction to game development by
Introduction to game developmentIntroduction to game development
Introduction to game developmentAbdelrahman Ahmed
535 views42 slides
HKG18-120 - Devicetree Schema Documentation and Validation by
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
1.3K views12 slides
Capacity Management for Web Operations by
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web OperationsJohn Allspaw
27K views57 slides

What's hot(20)

PyCon 2017 예제로 살펴보는 PyQt by 덕규 임
PyCon 2017 예제로 살펴보는 PyQtPyCon 2017 예제로 살펴보는 PyQt
PyCon 2017 예제로 살펴보는 PyQt
덕규 임8.4K views
Introduction to Git by Colin Su
Introduction to GitIntroduction to Git
Introduction to Git
Colin Su1.8K views
HKG18-120 - Devicetree Schema Documentation and Validation by Linaro
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
Linaro1.3K views
Capacity Management for Web Operations by John Allspaw
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web Operations
John Allspaw27K views
Gamification - Defining, Designing and Using it by Zac Fitz-Walter
Gamification - Defining, Designing and Using itGamification - Defining, Designing and Using it
Gamification - Defining, Designing and Using it
Zac Fitz-Walter26.9K views
Basic linux commands for bioinformatics by Bonnie Ng
Basic linux commands for bioinformaticsBasic linux commands for bioinformatics
Basic linux commands for bioinformatics
Bonnie Ng749 views
The impact of video games by SL Coder
The impact of video gamesThe impact of video games
The impact of video games
SL Coder4.5K views
Linux and windows file system by lin yucheng
Linux and windows  file systemLinux and windows  file system
Linux and windows file system
lin yucheng356 views
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson by harryvanhaaren
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
harryvanhaaren2.2K views
The Hand That Strikes, Also Blocks by Saumil Shah
The Hand That Strikes, Also BlocksThe Hand That Strikes, Also Blocks
The Hand That Strikes, Also Blocks
Saumil Shah97 views
Git and github - Verson Control for the Modern Developer by John Stevenson
Git and github - Verson Control for the Modern DeveloperGit and github - Verson Control for the Modern Developer
Git and github - Verson Control for the Modern Developer
John Stevenson1.7K views
Linux Device Driver parallelism using SMP and Kernel Pre-emption by Hemanth Venkatesh
Linux Device Driver parallelism using SMP and Kernel Pre-emptionLinux Device Driver parallelism using SMP and Kernel Pre-emption
Linux Device Driver parallelism using SMP and Kernel Pre-emption
Hemanth Venkatesh2.5K views
Git pour les (pas si) nuls by Malk Zameth
Git pour les (pas si) nulsGit pour les (pas si) nuls
Git pour les (pas si) nuls
Malk Zameth7.5K views
U Boot or Universal Bootloader by Satpal Parmar
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal Bootloader
Satpal Parmar21.1K views

Viewers also liked

Awareness, Feedback, Self-regulation by
Awareness, Feedback, Self-regulationAwareness, Feedback, Self-regulation
Awareness, Feedback, Self-regulationChristian Glahn
3.7K views35 slides
Go or No-Go: Operability and Contingency Planning at Etsy.com by
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comJohn Allspaw
30.5K views43 slides
Programação semana de extensão 2012 folia by
Programação semana de extensão 2012 foliaProgramação semana de extensão 2012 folia
Programação semana de extensão 2012 foliaAdriana Rocha
664 views13 slides
Yoga ocular by
Yoga ocularYoga ocular
Yoga ocularDiego Tondonia
593 views2 slides
Reconociendo habilidades by
Reconociendo habilidadesReconociendo habilidades
Reconociendo habilidadescarorubi1
389 views2 slides
Saludo protocolar al cuerpo diplomático acreditado en panamá by
Saludo protocolar al cuerpo diplomático acreditado en panamáSaludo protocolar al cuerpo diplomático acreditado en panamá
Saludo protocolar al cuerpo diplomático acreditado en panamámirepanama
620 views3 slides

Viewers also liked(20)

Awareness, Feedback, Self-regulation by Christian Glahn
Awareness, Feedback, Self-regulationAwareness, Feedback, Self-regulation
Awareness, Feedback, Self-regulation
Christian Glahn3.7K views
Go or No-Go: Operability and Contingency Planning at Etsy.com by John Allspaw
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.com
John Allspaw30.5K views
Programação semana de extensão 2012 folia by Adriana Rocha
Programação semana de extensão 2012 foliaProgramação semana de extensão 2012 folia
Programação semana de extensão 2012 folia
Adriana Rocha664 views
Reconociendo habilidades by carorubi1
Reconociendo habilidadesReconociendo habilidades
Reconociendo habilidades
carorubi1389 views
Saludo protocolar al cuerpo diplomático acreditado en panamá by mirepanama
Saludo protocolar al cuerpo diplomático acreditado en panamáSaludo protocolar al cuerpo diplomático acreditado en panamá
Saludo protocolar al cuerpo diplomático acreditado en panamá
mirepanama620 views
Strategies for Vacant Properties - John Mills, Camelot Property Management by fhanley
Strategies for Vacant Properties - John Mills, Camelot Property ManagementStrategies for Vacant Properties - John Mills, Camelot Property Management
Strategies for Vacant Properties - John Mills, Camelot Property Management
fhanley603 views
FLOW - Far and Large Offshore Wind (Summary) by NLandUSA
FLOW - Far and Large Offshore Wind (Summary)FLOW - Far and Large Offshore Wind (Summary)
FLOW - Far and Large Offshore Wind (Summary)
NLandUSA2.3K views
B2b newsletter_april by Rene Jack
B2b newsletter_aprilB2b newsletter_april
B2b newsletter_april
Rene Jack295 views
La gestión de la elección gd e e mail by msanchezm
La gestión de la elección gd e e mailLa gestión de la elección gd e e mail
La gestión de la elección gd e e mail
msanchezm501 views
Aguahidratacion cuso4b by J M
Aguahidratacion cuso4bAguahidratacion cuso4b
Aguahidratacion cuso4b
J M2K views

Similar to Operational Efficiency Hacks Web20 Expo2009

CPU Optimizations in the CERN Cloud - February 2016 by
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016Belmiro Moreira
852 views20 slides
How to build a state-of-the-art rails cluster by
How to build a state-of-the-art rails clusterHow to build a state-of-the-art rails cluster
How to build a state-of-the-art rails clusterTim Lossen
2.6K views53 slides
Enterprise Search Summit - Speeding Up Search by
Enterprise Search Summit - Speeding Up SearchEnterprise Search Summit - Speeding Up Search
Enterprise Search Summit - Speeding Up SearchAzul Systems Inc.
936 views47 slides
MySQL Group Replication - Ready For Production? (2018-04) by
MySQL Group Replication - Ready For Production? (2018-04)MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)Kenny Gryp
1.7K views72 slides
Capacity Management from Flickr by
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickrxlight
976 views57 slides
Capacity Planning For Web Operations Presentation by
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentationjward5519
627 views52 slides

Similar to Operational Efficiency Hacks Web20 Expo2009(20)

CPU Optimizations in the CERN Cloud - February 2016 by Belmiro Moreira
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016
Belmiro Moreira852 views
How to build a state-of-the-art rails cluster by Tim Lossen
How to build a state-of-the-art rails clusterHow to build a state-of-the-art rails cluster
How to build a state-of-the-art rails cluster
Tim Lossen2.6K views
Enterprise Search Summit - Speeding Up Search by Azul Systems Inc.
Enterprise Search Summit - Speeding Up SearchEnterprise Search Summit - Speeding Up Search
Enterprise Search Summit - Speeding Up Search
Azul Systems Inc.936 views
MySQL Group Replication - Ready For Production? (2018-04) by Kenny Gryp
MySQL Group Replication - Ready For Production? (2018-04)MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)
Kenny Gryp1.7K views
Capacity Management from Flickr by xlight
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickr
xlight976 views
Capacity Planning For Web Operations Presentation by jward5519
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
jward5519627 views
Capacity Planning For Web Operations Presentation by jward5519
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
jward5519422 views
Apache Gearpump - Lightweight Real-time Streaming Engine by Tianlun Zhang
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming Engine
Tianlun Zhang3.5K views
Scylla Summit 2018: OLAP or OLTP? Why Not Both? by ScyllaDB
Scylla Summit 2018: OLAP or OLTP? Why Not Both?Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
ScyllaDB4.8K views
2016-JAN-28 -- High Performance Production Databases on Ceph by Ceph Community
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph
Ceph Community 2.7K views
Cloudcon East Presentation by br7tt
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentation
br7tt429 views
Cloudcon East Presentation by br7tt
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentation
br7tt310 views
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ... by Amazon Web Services
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing by Sean Yu
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingWeakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Sean Yu7 views
Low latency in java 8 by Peter Lawrey by J On The Beach
Low latency in java 8 by Peter Lawrey Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey
J On The Beach9.3K views
Fisl - Deployment by Fabio Akita
Fisl - DeploymentFisl - Deployment
Fisl - Deployment
Fabio Akita934 views

More from John Allspaw

Resilience Engineering: A field of study, a community, and some perspective s... by
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...John Allspaw
3.3K views63 slides
Considerations for Alert Design by
Considerations for Alert DesignConsiderations for Alert Design
Considerations for Alert DesignJohn Allspaw
9.8K views48 slides
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls by
Velocity EU 2012 Escalating Scenarios: Outage Handling PitfallsVelocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling PitfallsJohn Allspaw
5.4K views118 slides
Responding to Outages Maturely by
Responding to Outages MaturelyResponding to Outages Maturely
Responding to Outages MaturelyJohn Allspaw
3K views94 slides
Resilient Response In Complex Systems by
Resilient Response In Complex SystemsResilient Response In Complex Systems
Resilient Response In Complex SystemsJohn Allspaw
5.7K views92 slides
Outages, PostMortems, and Human Error by
Outages, PostMortems, and Human ErrorOutages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorJohn Allspaw
36K views244 slides

More from John Allspaw(14)

Resilience Engineering: A field of study, a community, and some perspective s... by John Allspaw
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
John Allspaw3.3K views
Considerations for Alert Design by John Allspaw
Considerations for Alert DesignConsiderations for Alert Design
Considerations for Alert Design
John Allspaw9.8K views
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls by John Allspaw
Velocity EU 2012 Escalating Scenarios: Outage Handling PitfallsVelocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
John Allspaw5.4K views
Responding to Outages Maturely by John Allspaw
Responding to Outages MaturelyResponding to Outages Maturely
Responding to Outages Maturely
John Allspaw3K views
Resilient Response In Complex Systems by John Allspaw
Resilient Response In Complex SystemsResilient Response In Complex Systems
Resilient Response In Complex Systems
John Allspaw5.7K views
Outages, PostMortems, and Human Error by John Allspaw
Outages, PostMortems, and Human ErrorOutages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
John Allspaw36K views
Anticipation: What Could Possibly Go Wrong? by John Allspaw
Anticipation: What Could Possibly Go Wrong?Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?
John Allspaw3.2K views
Advanced PostMortem Fu and Human Error 101 (Velocity 2011) by John Allspaw
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
John Allspaw81.7K views
Dev and Ops Collaboration and Awareness at Etsy and Flickr by John Allspaw
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
John Allspaw35.6K views
Ops Meta-Metrics: The Currency You Pay For Change by John Allspaw
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
John Allspaw14.1K views
Ops Meta-Metrics: The Currency You Pay For Change by John Allspaw
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
John Allspaw8.4K views
Capacity Planning For LAMP by John Allspaw
Capacity Planning For LAMPCapacity Planning For LAMP
Capacity Planning For LAMP
John Allspaw3.6K views
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr by John Allspaw
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
John Allspaw1M views
Capacity Planning for Web Operations - Web20 Expo 2008 by John Allspaw
Capacity Planning for Web Operations - Web20 Expo 2008Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008
John Allspaw6.3K views

Recently uploaded

magicsheetv3.pdf by
magicsheetv3.pdfmagicsheetv3.pdf
magicsheetv3.pdfdavisclibby
7 views1 slide
Algorithms are the New Historians by
Algorithms are the New HistoriansAlgorithms are the New Historians
Algorithms are the New HistoriansGeorge Oates
10 views73 slides
Black and White Simple Elegant Creative Design Portfolio Presentation.pdf by
Black and White Simple Elegant Creative Design Portfolio Presentation.pdfBlack and White Simple Elegant Creative Design Portfolio Presentation.pdf
Black and White Simple Elegant Creative Design Portfolio Presentation.pdfJasonRuiz27
14 views9 slides
Christmas with Lisi Martin3 by
Christmas with Lisi Martin3Christmas with Lisi Martin3
Christmas with Lisi Martin3michaelasanda *
8 views54 slides
17 Famous Funny Poems: Laugh with Popular Fun Poems by
17 Famous Funny Poems: Laugh with Popular Fun Poems17 Famous Funny Poems: Laugh with Popular Fun Poems
17 Famous Funny Poems: Laugh with Popular Fun PoemsOZoFeTeam
5 views37 slides
Global_Placemaking_Presentation_V2.pdf by
Global_Placemaking_Presentation_V2.pdfGlobal_Placemaking_Presentation_V2.pdf
Global_Placemaking_Presentation_V2.pdfryann44
10 views9 slides

Recently uploaded(20)

Algorithms are the New Historians by George Oates
Algorithms are the New HistoriansAlgorithms are the New Historians
Algorithms are the New Historians
George Oates10 views
Black and White Simple Elegant Creative Design Portfolio Presentation.pdf by JasonRuiz27
Black and White Simple Elegant Creative Design Portfolio Presentation.pdfBlack and White Simple Elegant Creative Design Portfolio Presentation.pdf
Black and White Simple Elegant Creative Design Portfolio Presentation.pdf
JasonRuiz2714 views
17 Famous Funny Poems: Laugh with Popular Fun Poems by OZoFeTeam
17 Famous Funny Poems: Laugh with Popular Fun Poems17 Famous Funny Poems: Laugh with Popular Fun Poems
17 Famous Funny Poems: Laugh with Popular Fun Poems
OZoFeTeam5 views
Global_Placemaking_Presentation_V2.pdf by ryann44
Global_Placemaking_Presentation_V2.pdfGlobal_Placemaking_Presentation_V2.pdf
Global_Placemaking_Presentation_V2.pdf
ryann4410 views
E-Catalog-2023-July-Edit (TeckWrap).pdf by epifanioelias
E-Catalog-2023-July-Edit (TeckWrap).pdfE-Catalog-2023-July-Edit (TeckWrap).pdf
E-Catalog-2023-July-Edit (TeckWrap).pdf
epifanioelias11 views
Ballerina by mattenro
BallerinaBallerina
Ballerina
mattenro50 views
Channel Hookup - 10wayshookupv2.pdf by davisclibby
Channel Hookup - 10wayshookupv2.pdfChannel Hookup - 10wayshookupv2.pdf
Channel Hookup - 10wayshookupv2.pdf
davisclibby6 views
The beauty of a tear ….ppsx by guimera
The beauty of a tear ….ppsxThe beauty of a tear ….ppsx
The beauty of a tear ….ppsx
guimera 10 views
Joint Pain & Related Diseases.pdf by PhytoAtomy
Joint Pain & Related Diseases.pdfJoint Pain & Related Diseases.pdf
Joint Pain & Related Diseases.pdf
PhytoAtomy5 views
pinkman slideshow.pdf by will521124
pinkman slideshow.pdfpinkman slideshow.pdf
pinkman slideshow.pdf
will5211248 views
Natalie Fuchs-Photography + Design Portfolio by fuchsna
Natalie Fuchs-Photography + Design PortfolioNatalie Fuchs-Photography + Design Portfolio
Natalie Fuchs-Photography + Design Portfolio
fuchsna10 views

Operational Efficiency Hacks Web20 Expo2009

  • 1. Operational Efficiency Hacks John Allspaw Operations Engineering, Flickr
  • 2. who am I? Manage the Flickr Operations group Wrote a geeky book:
  • 4. “Efficiencies” Doing more with the robots you’ve got
  • 5. “Efficiencies” Doing more with the robots you’ve got Doing more with the humans you’ve got
  • 6. Some optimization “rules”
  • 7. Some optimization “rules” - Don’t rely on being able to tweak anything.
  • 8. Some optimization “rules” - Don’t rely on being able to tweak anything. - Don’t waste too much time tuning when you have no evidence it’ll matter.
  • 10. Optimization “rules” “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.” Knuth, (or Hoare)
  • 13. Optimization “rules” That doesn’t give us an excuse to be lazy and inefficient.
  • 14. Optimization “rules” That doesn’t give us an excuse to be lazy and inefficient.
  • 15. Optimization “rules” That doesn’t give us an excuse to be lazy and inefficient. We lean on the experience of people in the community for evidence that tuning(s) might be a worthwhile thing to do.
  • 16. Optimization “rules” “Yet we should not pass up our opportunities in that critical 3 percent.” Knuth, (or Hoare)
  • 17. So... stop somewhere in here OMG obvious I'm wasting tuning !@#$ time wins for no reason
  • 19. Our Context - 24 TB of MySQL data
  • 20. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes
  • 21. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads
  • 22. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads - 6 PB of photos
  • 23. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads - 6 PB of photos - 10TB storage eaten per day
  • 24. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads - 6 PB of photos - 10TB storage eaten per day - 15,362 service monitors (alerts)
  • 26. Infrastructure Hacks - Examples of what changing software can do (plain old-fashioned performance tuning)
  • 27. Infrastructure Hacks - Examples of what changing software can do (plain old-fashioned performance tuning) - Examples of what changing hardware can do (yay for Mr. Moore!)
  • 28. Leaning on compilers (synthetic PHP benchmarks, not real-world) (http://sebastian-bergmann.de/archives/634-PHP-GCC- ICC-Benchmark.html)
  • 29. PHP (real-world) php 4.4.8 to php 5.2.8 migration
  • 30. Can now handle more with less same taste, less filling
  • 32. Image Processing - 2004, Flickr was using ImageMagick for image processing (version 6.1.9)
  • 33. Image Processing - 2004, Flickr was using ImageMagick for image processing (version 6.1.9) - Changed to GraphicsMagick, about 15% faster at the time (version 1.1.5)
  • 34. Image Processing - 2004, Flickr was using ImageMagick for image processing (version 6.1.9) - Changed to GraphicsMagick, about 15% faster at the time (version 1.1.5) - Only need a subset of ImageMagick features anyway for our purposes
  • 35. Image Processing - OpenMP support (http://en.wikipedia.org/wiki/Openmp) - Allows parallelization of processing jobs, using multiple cores working on the same image - Some algorithms have more parallelization than others
  • 36. Image Processing - Test script - 7 large-ish DSLR photos - Cascade resizing each to 6 smaller sizes, semi-typical for Flickr’s workload - Each resize processed serially
  • 37. Image Processing compiler differences (GM version 1.1.14, non-OpenMP)
  • 38. Image Processing OpenMP differences OpenMP advantage (gcc 4.1.2, on quad core Xeon L5335 @ 2.00GHz)
  • 39. Image Processing CPU differences
  • 41. Diagonal Scaling - Vertically scaling your already horizontally- scaled nodes
  • 42. Diagonal Scaling - Vertically scaling your already horizontally- scaled nodes - a.k.a. “tech refresh”
  • 43. Diagonal Scaling - Vertically scaling your already horizontally- scaled nodes - a.k.a. “tech refresh” - a.k.a. “Moore’s Law Surfing”
  • 45. Diagonal Scaling 67 “old” webservers with 18 “new” : We replaced
  • 46. Diagonal Scaling 67 “old” webservers with 18 “new” : We replaced CPUs RAM drives total power (W) servers @60% peak per server per server per server 67 2 8763.6 4GB 1x80GB 18 8 2332.8 4GB 1x146GB
  • 47. Diagonal Scaling 67 “old” webservers with 18 “new” : We replaced CPUs RAM drives total power (W) servers @60% peak ~70% LESS power per server per server per server 67 2 8763.6 4GB 1x80GB 49U LESS rack space 18 8 2332.8 4GB 1x146GB
  • 49. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced
  • 50. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) @60% peak 23 1035 23 3008.4 8 1120 8 1036.8
  • 51. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) @60% peak ~75% FASTER 23 1035 23 3008.4 15U LESS rack space 65% LESS power 8 8 1120 1036.8
  • 52. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) ~75% FASTER @60% peak 15U LESS rack space 23 1035 23 3008.4 65% LESS power 8 8 1120 1036.8
  • 53. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) ~75% FASTER @60% peak 15U LESS rack space 23 1035 23 3008.4 65% LESS power 8 8 1120 1036.8
  • 54. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) ~75% FASTER @60% peak 15U LESS rack space 23 1035 23 3008.4 65% LESS power 8 8 1120 1036.8 from this to this
  • 55. What do you do with old/slow machines?
  • 56. What do you do with old/slow machines? - Liquidate
  • 57. What do you do with old/slow machines? - Liquidate - Re-purpose as dev/staging/etc
  • 58. What do you do with old/slow machines? - Liquidate - Re-purpose as dev/staging/etc - “offline” tasks
  • 60. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks
  • 61. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here:
  • 62. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here: http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/
  • 63. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here: http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/ - See Myles Grant talk about it more here:
  • 64. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here: http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/ - See Myles Grant talk about it more here: http://en.oreilly.com/velocity2009/public/schedule/detail/7552
  • 65. Runbook Hacks “WTF HAPPENED LAST NIGHT?!”
  • 66. Why?
  • 67. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand
  • 68. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How:
  • 69. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves
  • 70. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves - teach machines to watch themselves
  • 71. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves - teach machines to watch themselves - teach machines to fix themselves
  • 72. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves - teach machines to watch themselves - teach machines to fix themselves - reduce MTTR by streamlining
  • 74. Automated Infrastructure - If there is only one thing you do, automatic configuration and deployment management should be it.
  • 75. Automated Infrastructure - If there is only one thing you do, automatic configuration and deployment management should be it. - See: - Opscode/Chef (http://opscode.com/) - Puppet (http://reductivelabs.com/products/puppet/) - System Imager/Configurator (http://wiki.systemimager.org)
  • 77. Time Machine time is cheaper than human time. If a failure results in some commands being run to ‘fix’ it, make the machines do it. (i.e., don’t wake people up for stupid things!)
  • 79. Aggregate Monitoring Don’t care about single nodes, only care about delta change of metrics/faults - Warn (email) on X % change - Page (wake up) on Y % change
  • 80. Aggregate Monitoring Don’t care about single nodes, only care about delta change of metrics/faults - Warn (email) on X % change - Page (wake up) on Y % change High and low water marks for some metrics
  • 82. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it.
  • 83. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it.
  • 84. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did.
  • 85. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did.
  • 86. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did. Can greatly reduce your mean time to recovery (MTTR)
  • 87. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did. Can greatly reduce your mean time to recovery (MTTR)
  • 89. Basic Apache Example 1. Webserver not running?
  • 90. Basic Apache Example 1. Webserver not running? 2. Under certain conditions, try to start it, and email that this happened. (I’ll read it tomorrow)
  • 91. Basic Apache Example 1. Webserver not running? 2. Under certain conditions, try to start it, and email that this happened. (I’ll read it tomorrow) 3. Won’t start? Assume something’s really wrong, so don’t keep trying (email that, too)
  • 93. MySQL Self-Healing Some MySQL Issues “fixed” by the machines
  • 94. MySQL Self-Healing Some MySQL Issues “fixed” by the machines
  • 95. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill)
  • 96. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill) - Queries not safe to kill are marked by the application as “NO KILL” in comments
  • 97. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill) - Queries not safe to kill are marked by the application as “NO KILL” in comments - Run EXPLAIN on killed queries, and report the results
  • 98. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill) - Queries not safe to kill are marked by the application as “NO KILL” in comments - Run EXPLAIN on killed queries, and report the results - Keep track of the query types and databases that need the most killing, produce a “DBs that Suck” report
  • 100. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error
  • 101. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error - Skip errors that can safely be skipped and restart slave threads
  • 102. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error - Skip errors that can safely be skipped and restart slave threads - Force refetch of replication binlogs on: - 1064 (ER_PARSE_ERROR)
  • 103. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error - Skip errors that can safely be skipped and restart slave threads - Force refetch of replication binlogs on: - 1064 (ER_PARSE_ERROR) - Re-run queries on: - 1205 (ER_LOCK_WAIT_TIMEOUT) - 1213 (ER_LOCK_DEADLOCK)
  • 105. Code and Config Deploy Logs
  • 106. Code and Config Deploy Logs 1. ESSENTIAL
  • 107. Code and Config Deploy Logs 1. ESSENTIAL 2. MANDATORY
  • 109. Communications • Internal IRC - For ongoing discussions - Logged, so “infinite” scrollback
  • 110. Communications • Internal IRC - For ongoing discussions - Logged, so “infinite” scrollback • IM Bot (built on libyahoo2.sf.net) - For production changes - Broadcasts all to all contacts - Logged, and injected into IRC - IM Status = who is in primary/secondary on-call
  • 111. Communications • Internal IRC - For ongoing discussions - Logged, so “infinite” scrollback • IM Bot (built on libyahoo2.sf.net) - For production changes - Broadcasts all to all contacts - Logged, and injected into IRC - IM Status = who is in primary/secondary on-call • All of IRC and IM Bot slurped into a search index
  • 113. when
  • 114. when what
  • 115. when what detailed what*
  • 116. when what detailed what* *also points to what commands should be used to back out the changes
  • 117. when what detailed what* who *also points to what commands should be used to back out the changes
  • 118. when what detailed what* who *also points to what commands should be used to back out the changes
  • 119. when what detailed what* who time of last deploy at top of ganglia *also points to what commands should be used to back out the changes
  • 122. IM Bot (timestamps help correlation)
  • 123. IM Bot (timestamps help correlation)
  • 124. IM Bot (timestamps help correlation) all IRC, IM bot into searchable history
  • 125. Morals of Our Stories
  • 126. Morals of Our Stories - Optimizations can be a Very Good Thing™
  • 127. Morals of Our Stories - Optimizations can be a Very Good Thing™ - Weigh time spent optimizing against expected gains
  • 128. Morals of Our Stories - Optimizations can be a Very Good Thing™ - Weigh time spent optimizing against expected gains - Lean on others for how much “expected gains” mean for different scenarios
  • 129. Morals of Our Stories - Optimizations can be a Very Good Thing™ - Weigh time spent optimizing against expected gains - Lean on others for how much “expected gains” mean for different scenarios - Plain old-fashioned intuition
  • 130. Some Wisdom Nuggets Jon Prall’s 85 WebOps Rules: http://jprall.vox.com/library/post/85- operations-rules-to-live-by.html

Editor's Notes

  1. Finding huge gains from tweaking gets harder as you turn the knobs and pull the levers.
  2. He had some evidence that suggested compilers and versions would make a difference for PHP.
  3. So, we did it. (not just for performance reasons, but still...) Will this happen to you, too? Maybe. Maybe not.
  4. Performance gains like this don’t come very often, for “free”.
  5. We don’t need 80% of what ImageMagick has.
  6. We don’t need 80% of what ImageMagick has.
  7. We don’t need 80% of what ImageMagick has.
  8. Cascading resizing: Original -> Large Large -> Medium Large -> Small Medium -> Thumb Medium -> Square
  9. This was done before OpenMP support was in. Compilers and optimization flags can make a difference!
  10. Enter OpenMP. Example of how
  11. These have been the revisions of our image processing hardware over time. 4x faster
  12. Examples are contact notifications, large photoset deletions, etc.
  13. Examples are contact notifications, large photoset deletions, etc.
  14. Examples are contact notifications, large photoset deletions, etc.
  15. Examples are contact notifications, large photoset deletions, etc.
  16. Examples are contact notifications, large photoset deletions, etc.
  17. Examples are contact notifications, large photoset deletions, etc.
  18. Runbook hacks: tuning the process of operations failure handling and mitigation.
  19. Low water marks as indirect trouble indicators.
  20. Low water marks as indirect trouble indicators.
  21. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  22. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  23. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  24. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  25. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  26. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  27. Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  28. Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  29. Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  30. Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  31. Simple tips and tricks that can help in fixing things when they break.
  32. This shouldn’t be considered optional.
  33. This shouldn’t be considered optional.
  34. Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.
  35. Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.
  36. Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.
  37. Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.