Operational
Efficiency
Hacks
                          John Allspaw
          Operations Engineering, Flickr
who am I?
Manage the Flickr Operations group
Wrote a geeky book:
“Efficiencies”
“Efficiencies”
   Doing more with the robots you’ve got
“Efficiencies”
   Doing more with the robots you’ve got
   Doing more with the humans you’ve got
Some optimization
    “rules”
Some optimization
         “rules”

-   Don’t rely on being able to tweak anything.
Some optimization
         “rules”

-   Don’t rely on being able to tweak anything.
-   Don’t waste too much time tuning w...
Optimization “rules”
performance tuning gains




                                 time spent tuning
Optimization “rules”




“We should forget about small efficiencies,
say about 97% of the time: premature
optimization is t...
however...
Optimization “rules”
Optimization “rules”

That doesn’t give us an excuse to be lazy and
inefficient.
Optimization “rules”

That doesn’t give us an excuse to be lazy and
inefficient.
Optimization “rules”

That doesn’t give us an excuse to be lazy and
inefficient.


We lean on the experience of people in t...
Optimization “rules”

“Yet we should not pass up our
opportunities in that critical 3 percent.”
                       Knu...
So...
          stop somewhere
               in here




                           OMG
obvious
                         ...
Our Context
Our Context

-   24 TB of MySQL data
Our Context

-   24 TB of MySQL data
-   32k/sec of MySQL writes
Our Context

-   24 TB of MySQL data
-   32k/sec of MySQL writes
-   120k/sec of MySQL reads
Our Context

-   24 TB of MySQL data
-   32k/sec of MySQL writes
-   120k/sec of MySQL reads
-   6 PB of photos
Our Context

-   24 TB of MySQL data
-   32k/sec of MySQL writes
-   120k/sec of MySQL reads
-   6 PB of photos
-   10TB s...
Our Context

-   24 TB of MySQL data
-   32k/sec of MySQL writes
-   120k/sec of MySQL reads
-   6 PB of photos
-   10TB s...
Infrastructure Hacks
Infrastructure Hacks

-   Examples of what changing software can do
          (plain old-fashioned performance tuning)
Infrastructure Hacks

-   Examples of what changing software can do
          (plain old-fashioned performance tuning)
-  ...
Leaning on compilers
(synthetic PHP benchmarks, not real-world)




(http://sebastian-bergmann.de/archives/634-PHP-GCC-
  ...
PHP (real-world)




php 4.4.8 to php 5.2.8 migration
Can now handle more with less




same
taste,
 less
filling
Image Processing
Image Processing


-   2004, Flickr was using ImageMagick for
    image processing (version 6.1.9)
Image Processing


-   2004, Flickr was using ImageMagick for
    image processing (version 6.1.9)
-   Changed to Graphics...
Image Processing


-   2004, Flickr was using ImageMagick for
    image processing (version 6.1.9)
-   Changed to Graphics...
Image Processing

-   OpenMP support
    (http://en.wikipedia.org/wiki/Openmp)
    - Allows parallelization of processing ...
Image Processing
-   Test script
     -   7 large-ish DSLR photos
     -   Cascade resizing each to 6 smaller sizes,
     ...
Image Processing
   compiler differences




(GM version 1.1.14, non-OpenMP)
Image Processing
          OpenMP differences


                    OpenMP
                    advantage




(gcc 4.1.2, o...
Image Processing
   CPU differences
Diagonal Scaling
Diagonal Scaling
-   Vertically scaling your already horizontally-
    scaled nodes
Diagonal Scaling
-   Vertically scaling your already horizontally-
    scaled nodes
-   a.k.a. “tech refresh”
Diagonal Scaling
-   Vertically scaling your already horizontally-
    scaled nodes
-   a.k.a. “tech refresh”
-   a.k.a. “...
Diagonal Scaling
Diagonal Scaling
              67 “old” webservers with 18 “new” :
We replaced
Diagonal Scaling
              67 “old” webservers with 18 “new” :
We replaced


                CPUs         RAM         ...
Diagonal Scaling
              67 “old” webservers with 18 “new” :
We replaced


                CPUs         RAM         ...
Diagonal Scaling
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced


   server        photos/min     ...
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced


   server        photos/min     ...
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced


   server        photos/min     ...
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced


   server        photos/min     ...
Diagonal Scaling
              23 “old” image processing boxes with 8 “new”
We replaced


   server        photos/min     ...
What do you do with
old/slow machines?
What do you do with
    old/slow machines?

-   Liquidate
What do you do with
    old/slow machines?

-   Liquidate
-   Re-purpose as dev/staging/etc
What do you do with
    old/slow machines?

-   Liquidate
-   Re-purpose as dev/staging/etc
-   “offline” tasks
Offline Tasks
Offline Tasks
-   Out-of-band/asynchronous queuing and execution
    system, for non-realtime tasks
Offline Tasks
-   Out-of-band/asynchronous queuing and execution
    system, for non-realtime tasks
-   See here:
Offline Tasks
-   Out-of-band/asynchronous queuing and execution
    system, for non-realtime tasks
-   See here:
    http:...
Offline Tasks
-   Out-of-band/asynchronous queuing and execution
    system, for non-realtime tasks
-   See here:
    http:...
Offline Tasks
-   Out-of-band/asynchronous queuing and execution
    system, for non-realtime tasks
-   See here:
    http:...
Runbook Hacks




“WTF HAPPENED LAST NIGHT?!”
Why?
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Some of the How:
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Some of the How:
  - teach ma...
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Some of the How:
  - teach ma...
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Some of the How:
  - teach ma...
Why?
As infrastructure grows, try to keep the
Humans:Machines ratio from getting out of
hand
Some of the How:
  -   teach ...
Automated
Infrastructure
Automated
          Infrastructure
-   If there is only one thing you do, automatic
    configuration and deployment manage...
Automated
              Infrastructure
-   If there is only one thing you do, automatic
    configuration and deployment ma...
Conguration Management
     Codeswarm
Time
Machine time is cheaper than human time.


If a failure results in some commands being
run to ‘fix’ it, make the machi...
Aggregate Monitoring
Aggregate Monitoring




Don’t care about single nodes, only care about delta
change of metrics/faults
   -   Warn (email)...
Aggregate Monitoring




Don’t care about single nodes, only care about delta
change of metrics/faults
   -   Warn (email)...
Self-Healing
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.


Daemons/processes run on mac...
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.


Daemons/processes run on mac...
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.


Daemons/processes run on mac...
Self-Healing
Make service monitoring fix common failure
scenarios, notify us later about it.


Daemons/processes run on mac...
Basic Apache Example
Basic Apache Example

1. Webserver not running?
Basic Apache Example

1. Webserver not running?
2. Under certain conditions, try to start it, and
   email that this happe...
Basic Apache Example

1. Webserver not running?
2. Under certain conditions, try to start it, and
   email that this happe...
MySQL Self-Healing
MySQL Self-Healing
Some MySQL Issues “fixed” by the machines
MySQL Self-Healing
Some MySQL Issues “fixed” by the machines
MySQL Self-Healing
   Some MySQL Issues “fixed” by the machines


- Kill long-running SELECT queries (marked safe to kill)
MySQL Self-Healing
   Some MySQL Issues “fixed” by the machines


- Kill long-running SELECT queries (marked safe to kill)
...
MySQL Self-Healing
   Some MySQL Issues “fixed” by the machines


- Kill long-running SELECT queries (marked safe to kill)
...
MySQL Self-Healing
   Some MySQL Issues “fixed” by the machines


- Kill long-running SELECT queries (marked safe to kill)
...
MySQL Self-Healing
MySQL Self-Healing
Some MySQL Replication issues “fixed” by the
machines, by error
MySQL Self-Healing
  Some MySQL Replication issues “fixed” by the
  machines, by error
- Skip errors that can safely be ski...
MySQL Self-Healing
  Some MySQL Replication issues “fixed” by the
  machines, by error
- Skip errors that can safely be ski...
MySQL Self-Healing
  Some MySQL Replication issues “fixed” by the
  machines, by error
- Skip errors that can safely be ski...
Troubleshooting
Code and Config
 Deploy Logs
Code and Config
 Deploy Logs

1. ESSENTIAL
Code and Config
 Deploy Logs

1. ESSENTIAL
2. MANDATORY
Communications
Communications
•   Internal IRC

     -   For ongoing discussions
     -   Logged, so “infinite” scrollback
Communications
•   Internal IRC

      -   For ongoing discussions
      -   Logged, so “infinite” scrollback
•   IM Bot (b...
Communications
•   Internal IRC

      -   For ongoing discussions
      -   Logged, so “infinite” scrollback
•   IM Bot (b...
when
when   what
when   what

              detailed
               what*
when                 what

                                                        detailed
                              ...
when                 what

                                                        detailed
                              ...
when                 what

                                                        detailed
                              ...
when                 what

                                                        detailed
                              ...
IM Bot (timestamps help correlation)
IM Bot (timestamps help correlation)
IM Bot (timestamps help correlation)




all IRC, IM
      bot
     into
searchable
  history
Morals of Our Stories
Morals of Our Stories

-   Optimizations can be a Very Good Thing™
Morals of Our Stories

-   Optimizations can be a Very Good Thing™
-   Weigh time spent optimizing against
    expected ga...
Morals of Our Stories

-   Optimizations can be a Very Good Thing™
-   Weigh time spent optimizing against
    expected ga...
Morals of Our Stories

-   Optimizations can be a Very Good Thing™
-   Weigh time spent optimizing against
    expected ga...
Some Wisdom Nuggets


     Jon Prall’s 85 WebOps Rules:
  http://jprall.vox.com/library/post/85-
    operations-rules-to-l...
Questions?

http://www.flickr.com/photos/ebarney/3348965637/
http://www.flickr.com/photos/dgmiller/1606071911/
http://www.fli...
Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009
Upcoming SlideShare
Loading in...5
×

Operational Efficiency Hacks Web20 Expo2009

31,689

Published on

Published in: Art & Photos, Technology
6 Comments
163 Likes
Statistics
Notes
No Downloads
Views
Total Views
31,689
On Slideshare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
0
Comments
6
Likes
163
Embeds 0
No embeds

No notes for slide












  • Finding huge gains from tweaking gets harder as you turn the knobs and pull the levers.






























  • He had some evidence that suggested compilers and versions would make a difference for PHP.
  • So, we did it. (not just for performance reasons, but still...)
    Will this happen to you, too? Maybe. Maybe not.
  • Performance gains like this don’t come very often, for “free”.
  • We don’t need 80% of what ImageMagick has.
  • We don’t need 80% of what ImageMagick has.
  • We don’t need 80% of what ImageMagick has.


  • Cascading resizing:
    Original -> Large
    Large -> Medium
    Large -> Small
    Medium -> Thumb
    Medium -> Square
  • This was done before OpenMP support was in. Compilers and optimization flags can make a difference!
  • Enter OpenMP. Example of how
  • These have been the revisions of our image processing hardware over time. 4x faster




























  • Examples are contact notifications, large photoset deletions, etc.
  • Examples are contact notifications, large photoset deletions, etc.
  • Examples are contact notifications, large photoset deletions, etc.
  • Examples are contact notifications, large photoset deletions, etc.
  • Examples are contact notifications, large photoset deletions, etc.
  • Examples are contact notifications, large photoset deletions, etc.
  • Runbook hacks: tuning the process of operations failure handling and mitigation.




















  • Low water marks as indirect trouble indicators.





  • Low water marks as indirect trouble indicators.





  • Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  • Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  • Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  • Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  • Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  • Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.


















  • Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  • Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  • Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  • Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  • Simple tips and tricks that can help in fixing things when they break.
  • This shouldn’t be considered optional.
  • This shouldn’t be considered optional.




















  • Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.
  • Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.
  • Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.
  • Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.












  • Operational Efficiency Hacks Web20 Expo2009

    1. 1. Operational Efficiency Hacks John Allspaw Operations Engineering, Flickr
    2. 2. who am I? Manage the Flickr Operations group Wrote a geeky book:
    3. 3. “Efficiencies”
    4. 4. “Efficiencies” Doing more with the robots you’ve got
    5. 5. “Efficiencies” Doing more with the robots you’ve got Doing more with the humans you’ve got
    6. 6. Some optimization “rules”
    7. 7. Some optimization “rules” - Don’t rely on being able to tweak anything.
    8. 8. Some optimization “rules” - Don’t rely on being able to tweak anything. - Don’t waste too much time tuning when you have no evidence it’ll matter.
    9. 9. Optimization “rules” performance tuning gains time spent tuning
    10. 10. Optimization “rules” “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.” Knuth, (or Hoare)
    11. 11. however...
    12. 12. Optimization “rules”
    13. 13. Optimization “rules” That doesn’t give us an excuse to be lazy and inefficient.
    14. 14. Optimization “rules” That doesn’t give us an excuse to be lazy and inefficient.
    15. 15. Optimization “rules” That doesn’t give us an excuse to be lazy and inefficient. We lean on the experience of people in the community for evidence that tuning(s) might be a worthwhile thing to do.
    16. 16. Optimization “rules” “Yet we should not pass up our opportunities in that critical 3 percent.” Knuth, (or Hoare)
    17. 17. So... stop somewhere in here OMG obvious I'm wasting tuning !@#$ time wins for no reason
    18. 18. Our Context
    19. 19. Our Context - 24 TB of MySQL data
    20. 20. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes
    21. 21. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads
    22. 22. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads - 6 PB of photos
    23. 23. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads - 6 PB of photos - 10TB storage eaten per day
    24. 24. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads - 6 PB of photos - 10TB storage eaten per day - 15,362 service monitors (alerts)
    25. 25. Infrastructure Hacks
    26. 26. Infrastructure Hacks - Examples of what changing software can do (plain old-fashioned performance tuning)
    27. 27. Infrastructure Hacks - Examples of what changing software can do (plain old-fashioned performance tuning) - Examples of what changing hardware can do (yay for Mr. Moore!)
    28. 28. Leaning on compilers (synthetic PHP benchmarks, not real-world) (http://sebastian-bergmann.de/archives/634-PHP-GCC- ICC-Benchmark.html)
    29. 29. PHP (real-world) php 4.4.8 to php 5.2.8 migration
    30. 30. Can now handle more with less same taste, less filling
    31. 31. Image Processing
    32. 32. Image Processing - 2004, Flickr was using ImageMagick for image processing (version 6.1.9)
    33. 33. Image Processing - 2004, Flickr was using ImageMagick for image processing (version 6.1.9) - Changed to GraphicsMagick, about 15% faster at the time (version 1.1.5)
    34. 34. Image Processing - 2004, Flickr was using ImageMagick for image processing (version 6.1.9) - Changed to GraphicsMagick, about 15% faster at the time (version 1.1.5) - Only need a subset of ImageMagick features anyway for our purposes
    35. 35. Image Processing - OpenMP support (http://en.wikipedia.org/wiki/Openmp) - Allows parallelization of processing jobs, using multiple cores working on the same image - Some algorithms have more parallelization than others
    36. 36. Image Processing - Test script - 7 large-ish DSLR photos - Cascade resizing each to 6 smaller sizes, semi-typical for Flickr’s workload - Each resize processed serially
    37. 37. Image Processing compiler differences (GM version 1.1.14, non-OpenMP)
    38. 38. Image Processing OpenMP differences OpenMP advantage (gcc 4.1.2, on quad core Xeon L5335 @ 2.00GHz)
    39. 39. Image Processing CPU differences
    40. 40. Diagonal Scaling
    41. 41. Diagonal Scaling - Vertically scaling your already horizontally- scaled nodes
    42. 42. Diagonal Scaling - Vertically scaling your already horizontally- scaled nodes - a.k.a. “tech refresh”
    43. 43. Diagonal Scaling - Vertically scaling your already horizontally- scaled nodes - a.k.a. “tech refresh” - a.k.a. “Moore’s Law Surfing”
    44. 44. Diagonal Scaling
    45. 45. Diagonal Scaling 67 “old” webservers with 18 “new” : We replaced
    46. 46. Diagonal Scaling 67 “old” webservers with 18 “new” : We replaced CPUs RAM drives total power (W) servers @60% peak per server per server per server 67 2 8763.6 4GB 1x80GB 18 8 2332.8 4GB 1x146GB
    47. 47. Diagonal Scaling 67 “old” webservers with 18 “new” : We replaced CPUs RAM drives total power (W) servers @60% peak ~70% LESS power per server per server per server 67 2 8763.6 4GB 1x80GB 49U LESS rack space 18 8 2332.8 4GB 1x146GB
    48. 48. Diagonal Scaling
    49. 49. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced
    50. 50. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) @60% peak 23 1035 23 3008.4 8 1120 8 1036.8
    51. 51. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) @60% peak ~75% FASTER 23 1035 23 3008.4 15U LESS rack space 65% LESS power 8 8 1120 1036.8
    52. 52. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) ~75% FASTER @60% peak 15U LESS rack space 23 1035 23 3008.4 65% LESS power 8 8 1120 1036.8
    53. 53. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) ~75% FASTER @60% peak 15U LESS rack space 23 1035 23 3008.4 65% LESS power 8 8 1120 1036.8
    54. 54. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) ~75% FASTER @60% peak 15U LESS rack space 23 1035 23 3008.4 65% LESS power 8 8 1120 1036.8 from this to this
    55. 55. What do you do with old/slow machines?
    56. 56. What do you do with old/slow machines? - Liquidate
    57. 57. What do you do with old/slow machines? - Liquidate - Re-purpose as dev/staging/etc
    58. 58. What do you do with old/slow machines? - Liquidate - Re-purpose as dev/staging/etc - “offline” tasks
    59. 59. Offline Tasks
    60. 60. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks
    61. 61. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here:
    62. 62. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here: http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/
    63. 63. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here: http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/ - See Myles Grant talk about it more here:
    64. 64. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here: http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/ - See Myles Grant talk about it more here: http://en.oreilly.com/velocity2009/public/schedule/detail/7552
    65. 65. Runbook Hacks “WTF HAPPENED LAST NIGHT?!”
    66. 66. Why?
    67. 67. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand
    68. 68. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How:
    69. 69. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves
    70. 70. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves - teach machines to watch themselves
    71. 71. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves - teach machines to watch themselves - teach machines to fix themselves
    72. 72. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves - teach machines to watch themselves - teach machines to fix themselves - reduce MTTR by streamlining
    73. 73. Automated Infrastructure
    74. 74. Automated Infrastructure - If there is only one thing you do, automatic configuration and deployment management should be it.
    75. 75. Automated Infrastructure - If there is only one thing you do, automatic configuration and deployment management should be it. - See: - Opscode/Chef (http://opscode.com/) - Puppet (http://reductivelabs.com/products/puppet/) - System Imager/Configurator (http://wiki.systemimager.org)
    76. 76. Conguration Management Codeswarm
    77. 77. Time Machine time is cheaper than human time. If a failure results in some commands being run to ‘fix’ it, make the machines do it. (i.e., don’t wake people up for stupid things!)
    78. 78. Aggregate Monitoring
    79. 79. Aggregate Monitoring Don’t care about single nodes, only care about delta change of metrics/faults - Warn (email) on X % change - Page (wake up) on Y % change
    80. 80. Aggregate Monitoring Don’t care about single nodes, only care about delta change of metrics/faults - Warn (email) on X % change - Page (wake up) on Y % change High and low water marks for some metrics
    81. 81. Self-Healing
    82. 82. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it.
    83. 83. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it.
    84. 84. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did.
    85. 85. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did.
    86. 86. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did. Can greatly reduce your mean time to recovery (MTTR)
    87. 87. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did. Can greatly reduce your mean time to recovery (MTTR)
    88. 88. Basic Apache Example
    89. 89. Basic Apache Example 1. Webserver not running?
    90. 90. Basic Apache Example 1. Webserver not running? 2. Under certain conditions, try to start it, and email that this happened. (I’ll read it tomorrow)
    91. 91. Basic Apache Example 1. Webserver not running? 2. Under certain conditions, try to start it, and email that this happened. (I’ll read it tomorrow) 3. Won’t start? Assume something’s really wrong, so don’t keep trying (email that, too)
    92. 92. MySQL Self-Healing
    93. 93. MySQL Self-Healing Some MySQL Issues “fixed” by the machines
    94. 94. MySQL Self-Healing Some MySQL Issues “fixed” by the machines
    95. 95. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill)
    96. 96. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill) - Queries not safe to kill are marked by the application as “NO KILL” in comments
    97. 97. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill) - Queries not safe to kill are marked by the application as “NO KILL” in comments - Run EXPLAIN on killed queries, and report the results
    98. 98. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill) - Queries not safe to kill are marked by the application as “NO KILL” in comments - Run EXPLAIN on killed queries, and report the results - Keep track of the query types and databases that need the most killing, produce a “DBs that Suck” report
    99. 99. MySQL Self-Healing
    100. 100. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error
    101. 101. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error - Skip errors that can safely be skipped and restart slave threads
    102. 102. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error - Skip errors that can safely be skipped and restart slave threads - Force refetch of replication binlogs on: - 1064 (ER_PARSE_ERROR)
    103. 103. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error - Skip errors that can safely be skipped and restart slave threads - Force refetch of replication binlogs on: - 1064 (ER_PARSE_ERROR) - Re-run queries on: - 1205 (ER_LOCK_WAIT_TIMEOUT) - 1213 (ER_LOCK_DEADLOCK)
    104. 104. Troubleshooting
    105. 105. Code and Config Deploy Logs
    106. 106. Code and Config Deploy Logs 1. ESSENTIAL
    107. 107. Code and Config Deploy Logs 1. ESSENTIAL 2. MANDATORY
    108. 108. Communications
    109. 109. Communications • Internal IRC - For ongoing discussions - Logged, so “infinite” scrollback
    110. 110. Communications • Internal IRC - For ongoing discussions - Logged, so “infinite” scrollback • IM Bot (built on libyahoo2.sf.net) - For production changes - Broadcasts all to all contacts - Logged, and injected into IRC - IM Status = who is in primary/secondary on-call
    111. 111. Communications • Internal IRC - For ongoing discussions - Logged, so “infinite” scrollback • IM Bot (built on libyahoo2.sf.net) - For production changes - Broadcasts all to all contacts - Logged, and injected into IRC - IM Status = who is in primary/secondary on-call • All of IRC and IM Bot slurped into a search index
    112. 112. when
    113. 113. when what
    114. 114. when what detailed what*
    115. 115. when what detailed what* *also points to what commands should be used to back out the changes
    116. 116. when what detailed what* who *also points to what commands should be used to back out the changes
    117. 117. when what detailed what* who *also points to what commands should be used to back out the changes
    118. 118. when what detailed what* who time of last deploy at top of ganglia *also points to what commands should be used to back out the changes
    119. 119. IM Bot (timestamps help correlation)
    120. 120. IM Bot (timestamps help correlation)
    121. 121. IM Bot (timestamps help correlation) all IRC, IM bot into searchable history
    122. 122. Morals of Our Stories
    123. 123. Morals of Our Stories - Optimizations can be a Very Good Thing™
    124. 124. Morals of Our Stories - Optimizations can be a Very Good Thing™ - Weigh time spent optimizing against expected gains
    125. 125. Morals of Our Stories - Optimizations can be a Very Good Thing™ - Weigh time spent optimizing against expected gains - Lean on others for how much “expected gains” mean for different scenarios
    126. 126. Morals of Our Stories - Optimizations can be a Very Good Thing™ - Weigh time spent optimizing against expected gains - Lean on others for how much “expected gains” mean for different scenarios - Plain old-fashioned intuition
    127. 127. Some Wisdom Nuggets Jon Prall’s 85 WebOps Rules: http://jprall.vox.com/library/post/85- operations-rules-to-live-by.html
    128. 128. Questions? http://www.flickr.com/photos/ebarney/3348965637/ http://www.flickr.com/photos/dgmiller/1606071911/ http://www.flickr.com/photos/dannyboyster/60371673/ http://www.flickr.com/photos/bright/189338394/ http://www.flickr.com/photos/nickwheeleroz/2475011402/ http://www.flickr.com/photos/dramaqueennorma/191063346/ http://www.flickr.com/photos/telstar/2861103147/ http://www.flickr.com/photos/norby/2309046043/ http://www.flickr.com/photos/allysonk/201008992/

    ×