SlideShare a Scribd company logo

Operational Efficiency Hacks Web20 Expo2009

1 of 131
Operational
Efficiency
Hacks
                          John Allspaw
          Operations Engineering, Flickr
who am I?
Manage the Flickr Operations group
Wrote a geeky book:
“Efficiencies”
“Efficiencies”
   Doing more with the robots you’ve got
“Efficiencies”
   Doing more with the robots you’ve got
   Doing more with the humans you’ve got
Some optimization
    “rules”

Recommended

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsAlexander Korotkov
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
Graal in GraalVM - A New JIT Compiler
Graal in GraalVM - A New JIT CompilerGraal in GraalVM - A New JIT Compiler
Graal in GraalVM - A New JIT CompilerKoichi Sakata
 
事件風暴-領域建模
事件風暴-領域建模事件風暴-領域建模
事件風暴-領域建模國昭 張
 
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022HostedbyConfluent
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservicespflueras
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 

More Related Content

What's hot

From Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance GainsFrom Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance GainsScyllaDB
 
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL ShellMySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL ShellMiguel Araújo
 
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Masahiko Sawada
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveYingjun Wu
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringAnne Nicolas
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problemGrokking VN
 
Sharded Redis With Sentinel Vs Redis Cluster: What We Learned: Patrick King
Sharded Redis With Sentinel Vs Redis Cluster: What We Learned: Patrick KingSharded Redis With Sentinel Vs Redis Cluster: What We Learned: Patrick King
Sharded Redis With Sentinel Vs Redis Cluster: What We Learned: Patrick KingRedis Labs
 
PostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsPostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsMydbops
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OGeorge Cao
 
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaS
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaSMariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaS
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaSJelastic Multi-Cloud PaaS
 
PostgreSQL and Benchmarks
PostgreSQL and BenchmarksPostgreSQL and Benchmarks
PostgreSQL and BenchmarksJignesh Shah
 
Yahoo Cloud Serving Benchmark
Yahoo Cloud Serving BenchmarkYahoo Cloud Serving Benchmark
Yahoo Cloud Serving Benchmarkkevin han
 
InnoDB Performance Optimisation
InnoDB Performance OptimisationInnoDB Performance Optimisation
InnoDB Performance OptimisationMydbops
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdfProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdfJesmar Cannao'
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...Altinity Ltd
 

What's hot (20)

From Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance GainsFrom Postgres to ScyllaDB: Migration Strategies and Performance Gains
From Postgres to ScyllaDB: Migration Strategies and Performance Gains
 
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL ShellMySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
 
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uring
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Sharded Redis With Sentinel Vs Redis Cluster: What We Learned: Patrick King
Sharded Redis With Sentinel Vs Redis Cluster: What We Learned: Patrick KingSharded Redis With Sentinel Vs Redis Cluster: What We Learned: Patrick King
Sharded Redis With Sentinel Vs Redis Cluster: What We Learned: Patrick King
 
PostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsPostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability Methods
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
 
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaS
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaSMariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaS
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaS
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
 
PostgreSQL and Benchmarks
PostgreSQL and BenchmarksPostgreSQL and Benchmarks
PostgreSQL and Benchmarks
 
Yahoo Cloud Serving Benchmark
Yahoo Cloud Serving BenchmarkYahoo Cloud Serving Benchmark
Yahoo Cloud Serving Benchmark
 
InnoDB Performance Optimisation
InnoDB Performance OptimisationInnoDB Performance Optimisation
InnoDB Performance Optimisation
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdfProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
 

Viewers also liked

Awareness, Feedback, Self-regulation
Awareness, Feedback, Self-regulationAwareness, Feedback, Self-regulation
Awareness, Feedback, Self-regulationChristian Glahn
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comJohn Allspaw
 
Programação semana de extensão 2012 folia
Programação semana de extensão 2012 foliaProgramação semana de extensão 2012 folia
Programação semana de extensão 2012 foliaAdriana Rocha
 
Reconociendo habilidades
Reconociendo habilidadesReconociendo habilidades
Reconociendo habilidadescarorubi1
 
Saludo protocolar al cuerpo diplomático acreditado en panamá
Saludo protocolar al cuerpo diplomático acreditado en panamáSaludo protocolar al cuerpo diplomático acreditado en panamá
Saludo protocolar al cuerpo diplomático acreditado en panamámirepanama
 
Pelleting: The link between practice and engineering
Pelleting: The link between practice and engineeringPelleting: The link between practice and engineering
Pelleting: The link between practice and engineeringMilling and Grain magazine
 
Annual Subsea Meeting Registration Form
Annual Subsea Meeting Registration FormAnnual Subsea Meeting Registration Form
Annual Subsea Meeting Registration FormPaulo de Tarso Molina
 
Strategies for Vacant Properties - John Mills, Camelot Property Management
Strategies for Vacant Properties - John Mills, Camelot Property ManagementStrategies for Vacant Properties - John Mills, Camelot Property Management
Strategies for Vacant Properties - John Mills, Camelot Property Managementfhanley
 
FLOW - Far and Large Offshore Wind (Summary)
FLOW - Far and Large Offshore Wind (Summary)FLOW - Far and Large Offshore Wind (Summary)
FLOW - Far and Large Offshore Wind (Summary)NLandUSA
 
B2b newsletter_april
B2b newsletter_aprilB2b newsletter_april
B2b newsletter_aprilRene Jack
 
La gestión de la elección gd e e mail
La gestión de la elección gd e e mailLa gestión de la elección gd e e mail
La gestión de la elección gd e e mailmsanchezm
 
Aguahidratacion cuso4b
Aguahidratacion cuso4bAguahidratacion cuso4b
Aguahidratacion cuso4bJ M
 

Viewers also liked (20)

Awareness, Feedback, Self-regulation
Awareness, Feedback, Self-regulationAwareness, Feedback, Self-regulation
Awareness, Feedback, Self-regulation
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.com
 
Programação semana de extensão 2012 folia
Programação semana de extensão 2012 foliaProgramação semana de extensão 2012 folia
Programação semana de extensão 2012 folia
 
Yoga ocular
Yoga ocularYoga ocular
Yoga ocular
 
Reconociendo habilidades
Reconociendo habilidadesReconociendo habilidades
Reconociendo habilidades
 
Saludo protocolar al cuerpo diplomático acreditado en panamá
Saludo protocolar al cuerpo diplomático acreditado en panamáSaludo protocolar al cuerpo diplomático acreditado en panamá
Saludo protocolar al cuerpo diplomático acreditado en panamá
 
Pelleting: The link between practice and engineering
Pelleting: The link between practice and engineeringPelleting: The link between practice and engineering
Pelleting: The link between practice and engineering
 
Annual Subsea Meeting Registration Form
Annual Subsea Meeting Registration FormAnnual Subsea Meeting Registration Form
Annual Subsea Meeting Registration Form
 
Strategies for Vacant Properties - John Mills, Camelot Property Management
Strategies for Vacant Properties - John Mills, Camelot Property ManagementStrategies for Vacant Properties - John Mills, Camelot Property Management
Strategies for Vacant Properties - John Mills, Camelot Property Management
 
EBT Health & Safety
EBT Health & SafetyEBT Health & Safety
EBT Health & Safety
 
Pulpo congelado en españa
Pulpo congelado en españaPulpo congelado en españa
Pulpo congelado en españa
 
FLOW - Far and Large Offshore Wind (Summary)
FLOW - Far and Large Offshore Wind (Summary)FLOW - Far and Large Offshore Wind (Summary)
FLOW - Far and Large Offshore Wind (Summary)
 
B2b newsletter_april
B2b newsletter_aprilB2b newsletter_april
B2b newsletter_april
 
Cardiodiagnostico
CardiodiagnosticoCardiodiagnostico
Cardiodiagnostico
 
Cyclus-US Portfolio
Cyclus-US PortfolioCyclus-US Portfolio
Cyclus-US Portfolio
 
La gestión de la elección gd e e mail
La gestión de la elección gd e e mailLa gestión de la elección gd e e mail
La gestión de la elección gd e e mail
 
Aguahidratacion cuso4b
Aguahidratacion cuso4bAguahidratacion cuso4b
Aguahidratacion cuso4b
 
Final.gest.hum.1
Final.gest.hum.1Final.gest.hum.1
Final.gest.hum.1
 
Ba38
Ba38Ba38
Ba38
 
Numero
NumeroNumero
Numero
 

Similar to Operational Efficiency Hacks Web20 Expo2009

Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web OperationsJohn Allspaw
 
CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016Belmiro Moreira
 
How to build a state-of-the-art rails cluster
How to build a state-of-the-art rails clusterHow to build a state-of-the-art rails cluster
How to build a state-of-the-art rails clusterTim Lossen
 
Enterprise Search Summit - Speeding Up Search
Enterprise Search Summit - Speeding Up SearchEnterprise Search Summit - Speeding Up Search
Enterprise Search Summit - Speeding Up SearchAzul Systems Inc.
 
MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)Kenny Gryp
 
Capacity Management from Flickr
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickrxlight
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentationjward5519
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentationjward5519
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineTianlun Zhang
 
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?ScyllaDB
 
2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on CephCeph Community
 
Cloudcon East Presentation
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentationbr7tt
 
Cloudcon East Presentation
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentationbr7tt
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016Pierre Mavro
 
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingWeakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingSean Yu
 
Smashing Big Data with AHA Hardware GZIP
Smashing Big Data with AHA Hardware GZIPSmashing Big Data with AHA Hardware GZIP
Smashing Big Data with AHA Hardware GZIPJuan D. Deaton, Ph.D.
 
Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey J On The Beach
 

Similar to Operational Efficiency Hacks Web20 Expo2009 (20)

Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web Operations
 
CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016
 
How to build a state-of-the-art rails cluster
How to build a state-of-the-art rails clusterHow to build a state-of-the-art rails cluster
How to build a state-of-the-art rails cluster
 
Enterprise Search Summit - Speeding Up Search
Enterprise Search Summit - Speeding Up SearchEnterprise Search Summit - Speeding Up Search
Enterprise Search Summit - Speeding Up Search
 
MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)
 
Capacity Management from Flickr
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickr
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming Engine
 
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
 
2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph
 
Cloudcon East Presentation
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentation
 
Cloudcon East Presentation
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentation
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016
 
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud ComputingWeakly Supervised Whole Slide Image Analysis Using Cloud Computing
Weakly Supervised Whole Slide Image Analysis Using Cloud Computing
 
Smashing Big Data with AHA Hardware GZIP
Smashing Big Data with AHA Hardware GZIPSmashing Big Data with AHA Hardware GZIP
Smashing Big Data with AHA Hardware GZIP
 
Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 

More from John Allspaw

Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...John Allspaw
 
Considerations for Alert Design
Considerations for Alert DesignConsiderations for Alert Design
Considerations for Alert DesignJohn Allspaw
 
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling PitfallsVelocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling PitfallsJohn Allspaw
 
Responding to Outages Maturely
Responding to Outages MaturelyResponding to Outages Maturely
Responding to Outages MaturelyJohn Allspaw
 
Resilient Response In Complex Systems
Resilient Response In Complex SystemsResilient Response In Complex Systems
Resilient Response In Complex SystemsJohn Allspaw
 
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorOutages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorJohn Allspaw
 
Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?John Allspaw
 
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)John Allspaw
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrJohn Allspaw
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeJohn Allspaw
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeJohn Allspaw
 
Capacity Planning For LAMP
Capacity Planning For LAMPCapacity Planning For LAMP
Capacity Planning For LAMPJohn Allspaw
 
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at FlickrJohn Allspaw
 
Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008John Allspaw
 

More from John Allspaw (14)

Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
 
Considerations for Alert Design
Considerations for Alert DesignConsiderations for Alert Design
Considerations for Alert Design
 
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling PitfallsVelocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
 
Responding to Outages Maturely
Responding to Outages MaturelyResponding to Outages Maturely
Responding to Outages Maturely
 
Resilient Response In Complex Systems
Resilient Response In Complex SystemsResilient Response In Complex Systems
Resilient Response In Complex Systems
 
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorOutages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
 
Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?
 
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
 
Capacity Planning For LAMP
Capacity Planning For LAMPCapacity Planning For LAMP
Capacity Planning For LAMP
 
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
 
Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008
 

Recently uploaded

Aardman Academy Storybaord - A Very British Holiday
Aardman Academy Storybaord - A Very British HolidayAardman Academy Storybaord - A Very British Holiday
Aardman Academy Storybaord - A Very British Holidaycaitlinbooton
 
innovegypt business management Presentation.pptx
innovegypt business management Presentation.pptxinnovegypt business management Presentation.pptx
innovegypt business management Presentation.pptxMariamsaad57
 
durraiz shuaib khanjdddkdjsbkjdbskdb p1 5.pdf
durraiz  shuaib khanjdddkdjsbkjdbskdb p1 5.pdfdurraiz  shuaib khanjdddkdjsbkjdbskdb p1 5.pdf
durraiz shuaib khanjdddkdjsbkjdbskdb p1 5.pdfdurraizshuaib
 
Animal based campaigns featured on Billboard Media boards.pptx
Animal based campaigns featured on Billboard Media boards.pptxAnimal based campaigns featured on Billboard Media boards.pptx
Animal based campaigns featured on Billboard Media boards.pptxenyaj2
 
Reach Excerpt 2 - Girl jumps over ice walls
Reach Excerpt 2 - Girl jumps over ice wallsReach Excerpt 2 - Girl jumps over ice walls
Reach Excerpt 2 - Girl jumps over ice wallscaitlinbooton
 
Farm Stop by Argus Final Proposal Presentation
Farm Stop by Argus Final Proposal PresentationFarm Stop by Argus Final Proposal Presentation
Farm Stop by Argus Final Proposal Presentationmakaiodm
 
Q1 Contemporary Art Forms Filipino Artists’ Roles and their Contribution to C...
Q1 Contemporary Art Forms Filipino Artists’ Roles and their Contribution to C...Q1 Contemporary Art Forms Filipino Artists’ Roles and their Contribution to C...
Q1 Contemporary Art Forms Filipino Artists’ Roles and their Contribution to C...Alona Diciano
 
After the Storm by Vaasudev Tallapragada
After the Storm by Vaasudev TallapragadaAfter the Storm by Vaasudev Tallapragada
After the Storm by Vaasudev TallapragadaVaasudevTallapragada
 
The PCC Newsletter January February 2024
The PCC Newsletter January February 2024The PCC Newsletter January February 2024
The PCC Newsletter January February 2024pccwebmasterhmb
 
Photography of Storytelling (A case study of Nigerian Advertising)
Photography of Storytelling (A case study of Nigerian Advertising)Photography of Storytelling (A case study of Nigerian Advertising)
Photography of Storytelling (A case study of Nigerian Advertising)Maurice C. Ugwonoh
 
Ready check go stroyboard for Network Rail
Ready check go stroyboard for Network RailReady check go stroyboard for Network Rail
Ready check go stroyboard for Network Railcaitlinbooton
 
Ms.Shala Design Proposal UA-TAB 1022024.pdf
Ms.Shala Design Proposal   UA-TAB 1022024.pdfMs.Shala Design Proposal   UA-TAB 1022024.pdf
Ms.Shala Design Proposal UA-TAB 1022024.pdfmahmoudtabaa3
 
Allstar casino game online. betting and slot
Allstar casino game online. betting and slotAllstar casino game online. betting and slot
Allstar casino game online. betting and slotcoxaje7046
 
Gartstien Temperment Lab images for the slideshow
Gartstien Temperment Lab images for the slideshowGartstien Temperment Lab images for the slideshow
Gartstien Temperment Lab images for the slideshowdannychan619
 
Ms.Shala Design Proposal UA-TAB 1022024.pdf
Ms.Shala Design Proposal   UA-TAB 1022024.pdfMs.Shala Design Proposal   UA-TAB 1022024.pdf
Ms.Shala Design Proposal UA-TAB 1022024.pdfmahmoudtabaa3
 
Choosing The Right Size Wall Art For Your Bedroom A Comprehensive Guide
Choosing The Right Size Wall Art For Your Bedroom A Comprehensive GuideChoosing The Right Size Wall Art For Your Bedroom A Comprehensive Guide
Choosing The Right Size Wall Art For Your Bedroom A Comprehensive GuideArt by Maudsch
 
Aardman Academy Storyboard - Wallace and Gromit in Dream Home
Aardman Academy Storyboard - Wallace and Gromit in Dream HomeAardman Academy Storyboard - Wallace and Gromit in Dream Home
Aardman Academy Storyboard - Wallace and Gromit in Dream Homecaitlinbooton
 
aesthetic aksdhakdhajkhdakshdahakhdjshdakhd
aesthetic aksdhakdhajkhdakshdahakhdjshdakhdaesthetic aksdhakdhajkhdakshdahakhdjshdakhd
aesthetic aksdhakdhajkhdakshdahakhdjshdakhdsdiaza20
 
“After the Storm” by Vaasudev Tallapragada
“After the Storm” by Vaasudev Tallapragada“After the Storm” by Vaasudev Tallapragada
“After the Storm” by Vaasudev TallapragadaVaasudevTallapragada
 

Recently uploaded (20)

Aardman Academy Storybaord - A Very British Holiday
Aardman Academy Storybaord - A Very British HolidayAardman Academy Storybaord - A Very British Holiday
Aardman Academy Storybaord - A Very British Holiday
 
innovegypt business management Presentation.pptx
innovegypt business management Presentation.pptxinnovegypt business management Presentation.pptx
innovegypt business management Presentation.pptx
 
durraiz shuaib khanjdddkdjsbkjdbskdb p1 5.pdf
durraiz  shuaib khanjdddkdjsbkjdbskdb p1 5.pdfdurraiz  shuaib khanjdddkdjsbkjdbskdb p1 5.pdf
durraiz shuaib khanjdddkdjsbkjdbskdb p1 5.pdf
 
Computer And Internet.pptx by Harshavardhan
Computer And Internet.pptx  by HarshavardhanComputer And Internet.pptx  by Harshavardhan
Computer And Internet.pptx by Harshavardhan
 
Animal based campaigns featured on Billboard Media boards.pptx
Animal based campaigns featured on Billboard Media boards.pptxAnimal based campaigns featured on Billboard Media boards.pptx
Animal based campaigns featured on Billboard Media boards.pptx
 
Reach Excerpt 2 - Girl jumps over ice walls
Reach Excerpt 2 - Girl jumps over ice wallsReach Excerpt 2 - Girl jumps over ice walls
Reach Excerpt 2 - Girl jumps over ice walls
 
Farm Stop by Argus Final Proposal Presentation
Farm Stop by Argus Final Proposal PresentationFarm Stop by Argus Final Proposal Presentation
Farm Stop by Argus Final Proposal Presentation
 
Q1 Contemporary Art Forms Filipino Artists’ Roles and their Contribution to C...
Q1 Contemporary Art Forms Filipino Artists’ Roles and their Contribution to C...Q1 Contemporary Art Forms Filipino Artists’ Roles and their Contribution to C...
Q1 Contemporary Art Forms Filipino Artists’ Roles and their Contribution to C...
 
After the Storm by Vaasudev Tallapragada
After the Storm by Vaasudev TallapragadaAfter the Storm by Vaasudev Tallapragada
After the Storm by Vaasudev Tallapragada
 
The PCC Newsletter January February 2024
The PCC Newsletter January February 2024The PCC Newsletter January February 2024
The PCC Newsletter January February 2024
 
Photography of Storytelling (A case study of Nigerian Advertising)
Photography of Storytelling (A case study of Nigerian Advertising)Photography of Storytelling (A case study of Nigerian Advertising)
Photography of Storytelling (A case study of Nigerian Advertising)
 
Ready check go stroyboard for Network Rail
Ready check go stroyboard for Network RailReady check go stroyboard for Network Rail
Ready check go stroyboard for Network Rail
 
Ms.Shala Design Proposal UA-TAB 1022024.pdf
Ms.Shala Design Proposal   UA-TAB 1022024.pdfMs.Shala Design Proposal   UA-TAB 1022024.pdf
Ms.Shala Design Proposal UA-TAB 1022024.pdf
 
Allstar casino game online. betting and slot
Allstar casino game online. betting and slotAllstar casino game online. betting and slot
Allstar casino game online. betting and slot
 
Gartstien Temperment Lab images for the slideshow
Gartstien Temperment Lab images for the slideshowGartstien Temperment Lab images for the slideshow
Gartstien Temperment Lab images for the slideshow
 
Ms.Shala Design Proposal UA-TAB 1022024.pdf
Ms.Shala Design Proposal   UA-TAB 1022024.pdfMs.Shala Design Proposal   UA-TAB 1022024.pdf
Ms.Shala Design Proposal UA-TAB 1022024.pdf
 
Choosing The Right Size Wall Art For Your Bedroom A Comprehensive Guide
Choosing The Right Size Wall Art For Your Bedroom A Comprehensive GuideChoosing The Right Size Wall Art For Your Bedroom A Comprehensive Guide
Choosing The Right Size Wall Art For Your Bedroom A Comprehensive Guide
 
Aardman Academy Storyboard - Wallace and Gromit in Dream Home
Aardman Academy Storyboard - Wallace and Gromit in Dream HomeAardman Academy Storyboard - Wallace and Gromit in Dream Home
Aardman Academy Storyboard - Wallace and Gromit in Dream Home
 
aesthetic aksdhakdhajkhdakshdahakhdjshdakhd
aesthetic aksdhakdhajkhdakshdahakhdjshdakhdaesthetic aksdhakdhajkhdakshdahakhdjshdakhd
aesthetic aksdhakdhajkhdakshdahakhdjshdakhd
 
“After the Storm” by Vaasudev Tallapragada
“After the Storm” by Vaasudev Tallapragada“After the Storm” by Vaasudev Tallapragada
“After the Storm” by Vaasudev Tallapragada
 

Operational Efficiency Hacks Web20 Expo2009

  • 1. Operational Efficiency Hacks John Allspaw Operations Engineering, Flickr
  • 2. who am I? Manage the Flickr Operations group Wrote a geeky book:
  • 4. “Efficiencies” Doing more with the robots you’ve got
  • 5. “Efficiencies” Doing more with the robots you’ve got Doing more with the humans you’ve got
  • 6. Some optimization “rules”
  • 7. Some optimization “rules” - Don’t rely on being able to tweak anything.
  • 8. Some optimization “rules” - Don’t rely on being able to tweak anything. - Don’t waste too much time tuning when you have no evidence it’ll matter.
  • 10. Optimization “rules” “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.” Knuth, (or Hoare)
  • 13. Optimization “rules” That doesn’t give us an excuse to be lazy and inefficient.
  • 14. Optimization “rules” That doesn’t give us an excuse to be lazy and inefficient.
  • 15. Optimization “rules” That doesn’t give us an excuse to be lazy and inefficient. We lean on the experience of people in the community for evidence that tuning(s) might be a worthwhile thing to do.
  • 16. Optimization “rules” “Yet we should not pass up our opportunities in that critical 3 percent.” Knuth, (or Hoare)
  • 17. So... stop somewhere in here OMG obvious I'm wasting tuning !@#$ time wins for no reason
  • 19. Our Context - 24 TB of MySQL data
  • 20. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes
  • 21. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads
  • 22. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads - 6 PB of photos
  • 23. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads - 6 PB of photos - 10TB storage eaten per day
  • 24. Our Context - 24 TB of MySQL data - 32k/sec of MySQL writes - 120k/sec of MySQL reads - 6 PB of photos - 10TB storage eaten per day - 15,362 service monitors (alerts)
  • 26. Infrastructure Hacks - Examples of what changing software can do (plain old-fashioned performance tuning)
  • 27. Infrastructure Hacks - Examples of what changing software can do (plain old-fashioned performance tuning) - Examples of what changing hardware can do (yay for Mr. Moore!)
  • 28. Leaning on compilers (synthetic PHP benchmarks, not real-world) (http://sebastian-bergmann.de/archives/634-PHP-GCC- ICC-Benchmark.html)
  • 29. PHP (real-world) php 4.4.8 to php 5.2.8 migration
  • 30. Can now handle more with less same taste, less filling
  • 32. Image Processing - 2004, Flickr was using ImageMagick for image processing (version 6.1.9)
  • 33. Image Processing - 2004, Flickr was using ImageMagick for image processing (version 6.1.9) - Changed to GraphicsMagick, about 15% faster at the time (version 1.1.5)
  • 34. Image Processing - 2004, Flickr was using ImageMagick for image processing (version 6.1.9) - Changed to GraphicsMagick, about 15% faster at the time (version 1.1.5) - Only need a subset of ImageMagick features anyway for our purposes
  • 35. Image Processing - OpenMP support (http://en.wikipedia.org/wiki/Openmp) - Allows parallelization of processing jobs, using multiple cores working on the same image - Some algorithms have more parallelization than others
  • 36. Image Processing - Test script - 7 large-ish DSLR photos - Cascade resizing each to 6 smaller sizes, semi-typical for Flickr’s workload - Each resize processed serially
  • 37. Image Processing compiler differences (GM version 1.1.14, non-OpenMP)
  • 38. Image Processing OpenMP differences OpenMP advantage (gcc 4.1.2, on quad core Xeon L5335 @ 2.00GHz)
  • 39. Image Processing CPU differences
  • 41. Diagonal Scaling - Vertically scaling your already horizontally- scaled nodes
  • 42. Diagonal Scaling - Vertically scaling your already horizontally- scaled nodes - a.k.a. “tech refresh”
  • 43. Diagonal Scaling - Vertically scaling your already horizontally- scaled nodes - a.k.a. “tech refresh” - a.k.a. “Moore’s Law Surfing”
  • 45. Diagonal Scaling 67 “old” webservers with 18 “new” : We replaced
  • 46. Diagonal Scaling 67 “old” webservers with 18 “new” : We replaced CPUs RAM drives total power (W) servers @60% peak per server per server per server 67 2 8763.6 4GB 1x80GB 18 8 2332.8 4GB 1x146GB
  • 47. Diagonal Scaling 67 “old” webservers with 18 “new” : We replaced CPUs RAM drives total power (W) servers @60% peak ~70% LESS power per server per server per server 67 2 8763.6 4GB 1x80GB 49U LESS rack space 18 8 2332.8 4GB 1x146GB
  • 49. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced
  • 50. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) @60% peak 23 1035 23 3008.4 8 1120 8 1036.8
  • 51. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) @60% peak ~75% FASTER 23 1035 23 3008.4 15U LESS rack space 65% LESS power 8 8 1120 1036.8
  • 52. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) ~75% FASTER @60% peak 15U LESS rack space 23 1035 23 3008.4 65% LESS power 8 8 1120 1036.8
  • 53. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) ~75% FASTER @60% peak 15U LESS rack space 23 1035 23 3008.4 65% LESS power 8 8 1120 1036.8
  • 54. Diagonal Scaling 23 “old” image processing boxes with 8 “new” We replaced server photos/min rack total power (W) ~75% FASTER @60% peak 15U LESS rack space 23 1035 23 3008.4 65% LESS power 8 8 1120 1036.8 from this to this
  • 55. What do you do with old/slow machines?
  • 56. What do you do with old/slow machines? - Liquidate
  • 57. What do you do with old/slow machines? - Liquidate - Re-purpose as dev/staging/etc
  • 58. What do you do with old/slow machines? - Liquidate - Re-purpose as dev/staging/etc - “offline” tasks
  • 60. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks
  • 61. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here:
  • 62. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here: http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/
  • 63. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here: http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/ - See Myles Grant talk about it more here:
  • 64. Offline Tasks - Out-of-band/asynchronous queuing and execution system, for non-realtime tasks - See here: http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/ - See Myles Grant talk about it more here: http://en.oreilly.com/velocity2009/public/schedule/detail/7552
  • 65. Runbook Hacks “WTF HAPPENED LAST NIGHT?!”
  • 66. Why?
  • 67. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand
  • 68. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How:
  • 69. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves
  • 70. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves - teach machines to watch themselves
  • 71. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves - teach machines to watch themselves - teach machines to fix themselves
  • 72. Why? As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How: - teach machines to build themselves - teach machines to watch themselves - teach machines to fix themselves - reduce MTTR by streamlining
  • 74. Automated Infrastructure - If there is only one thing you do, automatic configuration and deployment management should be it.
  • 75. Automated Infrastructure - If there is only one thing you do, automatic configuration and deployment management should be it. - See: - Opscode/Chef (http://opscode.com/) - Puppet (http://reductivelabs.com/products/puppet/) - System Imager/Configurator (http://wiki.systemimager.org)
  • 77. Time Machine time is cheaper than human time. If a failure results in some commands being run to ‘fix’ it, make the machines do it. (i.e., don’t wake people up for stupid things!)
  • 79. Aggregate Monitoring Don’t care about single nodes, only care about delta change of metrics/faults - Warn (email) on X % change - Page (wake up) on Y % change
  • 80. Aggregate Monitoring Don’t care about single nodes, only care about delta change of metrics/faults - Warn (email) on X % change - Page (wake up) on Y % change High and low water marks for some metrics
  • 82. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it.
  • 83. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it.
  • 84. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did.
  • 85. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did.
  • 86. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did. Can greatly reduce your mean time to recovery (MTTR)
  • 87. Self-Healing Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did. Can greatly reduce your mean time to recovery (MTTR)
  • 89. Basic Apache Example 1. Webserver not running?
  • 90. Basic Apache Example 1. Webserver not running? 2. Under certain conditions, try to start it, and email that this happened. (I’ll read it tomorrow)
  • 91. Basic Apache Example 1. Webserver not running? 2. Under certain conditions, try to start it, and email that this happened. (I’ll read it tomorrow) 3. Won’t start? Assume something’s really wrong, so don’t keep trying (email that, too)
  • 93. MySQL Self-Healing Some MySQL Issues “fixed” by the machines
  • 94. MySQL Self-Healing Some MySQL Issues “fixed” by the machines
  • 95. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill)
  • 96. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill) - Queries not safe to kill are marked by the application as “NO KILL” in comments
  • 97. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill) - Queries not safe to kill are marked by the application as “NO KILL” in comments - Run EXPLAIN on killed queries, and report the results
  • 98. MySQL Self-Healing Some MySQL Issues “fixed” by the machines - Kill long-running SELECT queries (marked safe to kill) - Queries not safe to kill are marked by the application as “NO KILL” in comments - Run EXPLAIN on killed queries, and report the results - Keep track of the query types and databases that need the most killing, produce a “DBs that Suck” report
  • 100. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error
  • 101. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error - Skip errors that can safely be skipped and restart slave threads
  • 102. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error - Skip errors that can safely be skipped and restart slave threads - Force refetch of replication binlogs on: - 1064 (ER_PARSE_ERROR)
  • 103. MySQL Self-Healing Some MySQL Replication issues “fixed” by the machines, by error - Skip errors that can safely be skipped and restart slave threads - Force refetch of replication binlogs on: - 1064 (ER_PARSE_ERROR) - Re-run queries on: - 1205 (ER_LOCK_WAIT_TIMEOUT) - 1213 (ER_LOCK_DEADLOCK)
  • 105. Code and Config Deploy Logs
  • 106. Code and Config Deploy Logs 1. ESSENTIAL
  • 107. Code and Config Deploy Logs 1. ESSENTIAL 2. MANDATORY
  • 109. Communications • Internal IRC - For ongoing discussions - Logged, so “infinite” scrollback
  • 110. Communications • Internal IRC - For ongoing discussions - Logged, so “infinite” scrollback • IM Bot (built on libyahoo2.sf.net) - For production changes - Broadcasts all to all contacts - Logged, and injected into IRC - IM Status = who is in primary/secondary on-call
  • 111. Communications • Internal IRC - For ongoing discussions - Logged, so “infinite” scrollback • IM Bot (built on libyahoo2.sf.net) - For production changes - Broadcasts all to all contacts - Logged, and injected into IRC - IM Status = who is in primary/secondary on-call • All of IRC and IM Bot slurped into a search index
  • 113. when
  • 114. when what
  • 115. when what detailed what*
  • 116. when what detailed what* *also points to what commands should be used to back out the changes
  • 117. when what detailed what* who *also points to what commands should be used to back out the changes
  • 118. when what detailed what* who *also points to what commands should be used to back out the changes
  • 119. when what detailed what* who time of last deploy at top of ganglia *also points to what commands should be used to back out the changes
  • 122. IM Bot (timestamps help correlation)
  • 123. IM Bot (timestamps help correlation)
  • 124. IM Bot (timestamps help correlation) all IRC, IM bot into searchable history
  • 125. Morals of Our Stories
  • 126. Morals of Our Stories - Optimizations can be a Very Good Thing™
  • 127. Morals of Our Stories - Optimizations can be a Very Good Thing™ - Weigh time spent optimizing against expected gains
  • 128. Morals of Our Stories - Optimizations can be a Very Good Thing™ - Weigh time spent optimizing against expected gains - Lean on others for how much “expected gains” mean for different scenarios
  • 129. Morals of Our Stories - Optimizations can be a Very Good Thing™ - Weigh time spent optimizing against expected gains - Lean on others for how much “expected gains” mean for different scenarios - Plain old-fashioned intuition
  • 130. Some Wisdom Nuggets Jon Prall’s 85 WebOps Rules: http://jprall.vox.com/library/post/85- operations-rules-to-live-by.html

Editor's Notes

  1. Finding huge gains from tweaking gets harder as you turn the knobs and pull the levers.
  2. He had some evidence that suggested compilers and versions would make a difference for PHP.
  3. So, we did it. (not just for performance reasons, but still...) Will this happen to you, too? Maybe. Maybe not.
  4. Performance gains like this don’t come very often, for “free”.
  5. We don’t need 80% of what ImageMagick has.
  6. We don’t need 80% of what ImageMagick has.
  7. We don’t need 80% of what ImageMagick has.
  8. Cascading resizing: Original -> Large Large -> Medium Large -> Small Medium -> Thumb Medium -> Square
  9. This was done before OpenMP support was in. Compilers and optimization flags can make a difference!
  10. Enter OpenMP. Example of how
  11. These have been the revisions of our image processing hardware over time. 4x faster
  12. Examples are contact notifications, large photoset deletions, etc.
  13. Examples are contact notifications, large photoset deletions, etc.
  14. Examples are contact notifications, large photoset deletions, etc.
  15. Examples are contact notifications, large photoset deletions, etc.
  16. Examples are contact notifications, large photoset deletions, etc.
  17. Examples are contact notifications, large photoset deletions, etc.
  18. Runbook hacks: tuning the process of operations failure handling and mitigation.
  19. Low water marks as indirect trouble indicators.
  20. Low water marks as indirect trouble indicators.
  21. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  22. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  23. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  24. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  25. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  26. Can be as simple as cron jobs, or nagios plugins. NRPE or NSCA.
  27. Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  28. Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  29. Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  30. Skippable errors: “can’t drop”, “already exists”, “duplicate key name”
  31. Simple tips and tricks that can help in fixing things when they break.
  32. This shouldn’t be considered optional.
  33. This shouldn’t be considered optional.
  34. Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.
  35. Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.
  36. Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.
  37. Our IRC and IM logs get injected into a search engine, almost exactly like Lucene.