Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

•

5 likes•23,531 views

"Why Did We Think Large Scale Distributed Systems Would be Easy?" by Gordon Rowell, Site Reliability Manager, Google. Presentation Overview: Google's Corporate Engineering SRE team provides infrastructure services used by many of Google's desktops, laptops and servers. This talk gives an overview of the design philosophy, challenges, technologies and some interesting failures seen while implementing infrastructure at scale. Speaker Bio: Gordon Rowell is a site reliability manager at Google, Sydney. His team focuses on delivering services to Googlers around the world. They have migrated major internal services to run on Google technology and are currently focused on removing dependencies on the corporate network. He enjoys the challenges of building robust systems that scale and has a particular passion for configuration management. Prior to joining Google in 2006, he worked as an independent systems developer with a focus on telecommunications infrastructure. He also worked at e-smith/Mitel building an open source Internet small business server/gateway. He lives in Sydney, but used to live in Ottawa, where he ice-skated to work. Gordon has earned a bachelor's of science with honours in Computer Science from the University of NSW.

Technology

Why did we think large scale
distributed systems would be
easy?
Gordon Rowell
PuppetConf San Francisco 2013
gordonr@google.com

Background
Site Reliability Engineering runs many services
The same rules always apply:
●  Make the service scale
●  Make the deployment consistent
●  Understand all layers of the system
●  Monitor everything
●  Plan for failure
●  Break things, under controlled conditions

Scaling is fun
We don't deploy "a server"
•  Servers break, power fails
•  Clients/DNS need to be reconfigured
We don't deploy "a cluster"
•  Networks break, servers break, power fails
•  Clients/DNS need to be reconfigured
We deploy redundant clusters
•  Attempt to send clients to nearest serving cluster
•  Anycast allows for unified client configuration

But client DoS is not
Poorly written code...
●  on small numbers of clients...
●  is annoying
Poorly written code...
●  on a huge number of clients...
●  can cause serious infrastructure pain
Write good code and stage your releases
●  Work with the service owners
●  Stage rollouts, allow soak time
●  Have a rollback plan for clients and test it
●  Have DoS limits for services, test them

Load balancing is fun
Do you have enough capacity?
•  How many backends do you need?
•  What happens if half of your backends lose power?
•  What about when half are already out for repairs?
How do you send clients to the right cluster?
•  Client configuration
•  DNS round-robin (simple global load balancing)
•  DNS views (give best answer for client IP)
•  Anycast (portable IP, routed to "nearest" cluster)
•  Consider: DNS views plus Anycast

But global outages are not
Monitor everything
●  Health check failures bring down your service
●  ...by design
Test everything
●  You should expect (and test) data center outages
●  A global outage can ruin your day
●  Cascading failures are unpleasant
Learn from outages
●  Write postmortems
●  Focus on the facts!
●  What went wrong and what can be better?
●  A postmortem is not about blame

Thundering herds are not
For Puppet
•  "Lots" of Mac desktops and laptops
•  "Lots" of Ubuntu desktops, laptops and servers
•  "Some" others
What if they all want to do a puppet run?
•  What about every hour?
•  What about every five minutes?
Randomize your cron jobs! (and test it)
How can you shed load on the server?

Anycast is fun
Anycast is "coarse-grain" load balancing
•  Routes traffic to the “nearest”, “serving” cluster
Networks break
•  Physical issues
•  Routing issues
•  Configuration issues
•  Load balancer bugs
Anycast monitoring is hard

Anycast directed to one site is not fun
All clients could be sent to the same cluster
•  Be ready for that
•  Can a single cluster handle worldwide traffic?
•  What do you do if it can't?
Have a mitigation strategy to shed load
●  Include load calculations early in health checks
●  Consider DNS views to redirect some traffic
●  Drop traffic if you have to

Diversity is good...for people
Be ruthless against platform diversity
If you can’t automate it, don’t do it
●  “Could we bring up another 50 today, please?”
●  “That backend was just a little different and...oops”
Anycast helps you be consistent
●  Traffic could go anywhere
Every OS upgrade is a time to refactor and clean

Questions?
Gordon Rowell
gordonr@google.com

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

🐬 The future of MySQL is Postgres 🐘RTylerCroy

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Artificial Intelligence: Facts and MythsJoaquim Jorge

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Real Time Object Detection Using Open CVKhem

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount

🐬 The future of MySQL is Postgres 🐘

08448380779 Call Girls In Civil Lines Women Seeking Men

Boost PC performance: How more available memory can improve productivity

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Powerful Google developer tools for immediate impact! (2023-24 C)

GenCyber Cyber Security Day Presentation

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Driving Behavioral Change for Information Management through Data-Driven Gree...

Finology Group – Insurtech Innovation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Advantages of Hiring UIUX Design Service Providers for Your Business

Presentation on how to chat with PDF using ChatGPT code interpreter

Axa Assurance Maroc - Insurer Innovation Award 2024

A Domino Admins Adventures (Engage 2024)

Artificial Intelligence: Facts and Myths

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Real Time Object Detection Using Open CV

Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

1. Why did we think large scale distributed systems would be easy? Gordon Rowell PuppetConf San Francisco 2013 gordonr@google.com

2. Background Site Reliability Engineering runs many services The same rules always apply: ●  Make the service scale ●  Make the deployment consistent ●  Understand all layers of the system ●  Monitor everything ●  Plan for failure ●  Break things, under controlled conditions

3. Scaling is fun We don't deploy "a server" •  Servers break, power fails •  Clients/DNS need to be reconfigured We don't deploy "a cluster" •  Networks break, servers break, power fails •  Clients/DNS need to be reconfigured We deploy redundant clusters •  Attempt to send clients to nearest serving cluster •  Anycast allows for unified client configuration

4. But client DoS is not Poorly written code... ●  on small numbers of clients... ●  is annoying Poorly written code... ●  on a huge number of clients... ●  can cause serious infrastructure pain Write good code and stage your releases ●  Work with the service owners ●  Stage rollouts, allow soak time ●  Have a rollback plan for clients and test it ●  Have DoS limits for services, test them

5. Load balancing is fun Do you have enough capacity? •  How many backends do you need? •  What happens if half of your backends lose power? •  What about when half are already out for repairs? How do you send clients to the right cluster? •  Client configuration •  DNS round-robin (simple global load balancing) •  DNS views (give best answer for client IP) •  Anycast (portable IP, routed to "nearest" cluster) •  Consider: DNS views plus Anycast

6. But global outages are not Monitor everything ●  Health check failures bring down your service ●  ...by design Test everything ●  You should expect (and test) data center outages ●  A global outage can ruin your day ●  Cascading failures are unpleasant Learn from outages ●  Write postmortems ●  Focus on the facts! ●  What went wrong and what can be better? ●  A postmortem is not about blame

7. Thundering herds are not For Puppet •  "Lots" of Mac desktops and laptops •  "Lots" of Ubuntu desktops, laptops and servers •  "Some" others What if they all want to do a puppet run? •  What about every hour? •  What about every five minutes? Randomize your cron jobs! (and test it) How can you shed load on the server?

8. Anycast is fun Anycast is "coarse-grain" load balancing •  Routes traffic to the “nearest”, “serving” cluster Networks break •  Physical issues •  Routing issues •  Configuration issues •  Load balancer bugs Anycast monitoring is hard

9. Anycast directed to one site is not fun

10. Anycast directed to one site is not fun All clients could be sent to the same cluster •  Be ready for that •  Can a single cluster handle worldwide traffic? •  What do you do if it can't? Have a mitigation strategy to shed load ●  Include load calculations early in health checks ●  Consider DNS views to redirect some traffic ●  Drop traffic if you have to

11. Diversity is good...for people Be ruthless against platform diversity If you can’t automate it, don’t do it ●  “Could we bring up another 50 today, please?” ●  “That backend was just a little different and...oops” Anycast helps you be consistent ●  Traffic could go anywhere Every OS upgrade is a time to refactor and clean

12. Questions? Gordon Rowell gordonr@google.com

Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Recommended

Recommended

More Related Content

More from Puppet

More from Puppet (20)

Recently uploaded

Recently uploaded (20)

Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013