Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Enterprise Drupal Application & Hosting Infrastructure Level Monitoring

189 views

Published on

This talk shares the story of how SiteGround created an enterprise monitoring system for its Drupal VIP clients. As the person behind this SiteGround project I'll talk about the following topics in details:

1. What is an enterprise level monitoring system for Drupal sites and the underlying hosting infrastructure.
2. Why big enterprise Drupal sites need such a system and what is the business value for the customer.
3. What is the best way to technically implement a system which monitors and solves issues with sites that are extremely complicated.
4. Why a migration from reactive monitoring to SRE best methods is the only option for such sites.

At the end of the talk people will know the following:

- Why big enterprise Drupal sites need custom monitoring.
- Why traditional monitoring is not suitable for sites that use the latest technologies - Elasticsearch, Solr, Nginx, Redis, Docker, LXC.
- At the end of the talk the people will be familiar with the concepts of proactive system/site management. I'll talk about what site reliability engineers do and how a big part of this has been automated at SiteGround and why this is very important.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Enterprise Drupal Application & Hosting Infrastructure Level Monitoring

  1. 1. Enterprise Drupal Application & Hosting Infrastructure Level Monitoring Daniel Kanchev Senior Site Reliability Engineer @dvkanchev
  2. 2. Enterprise Drupal Hosting Characteristics ○ Consists of multiple servers ○ Provides high availability ○ Offers auto scalability ○ Requires multiple services to work as expected
  3. 3. Enterprise Drupal Hosting Characteristics ○ Consists of multiple servers ○ Provides high availability ○ Offers auto scalability ○ Requires multiple services to work as expected ○ Really expensive ○ Nobody wants to manage this sh*t :)
  4. 4. Hosting Types Complexity
  5. 5. Hosting Types Complexity ○ Shared Hosting Service ○ Single Virtual Server ○ Single Dedicated Server ○ PaaS
  6. 6. Hosting Types Complexity ○ Shared Hosting Service ○ Single Virtual Server ○ Single Dedicated Server ○ PaaS ○ Custom Private/Public Clouds
  7. 7. ○ ElasticSearch/Solr ○ Redis/Memcached ○ GraphQL ○ MongoDB ○ Nodejs ○ Gearman ○ CI systems
  8. 8. One Monitoring To Rule Them All • Website Monitoring • Hosting Infrastructure Monitoring
  9. 9. Website Monitoring Architecture Website London Amsterdam Munich
  10. 10. Website Monitoring Architecture Website London Amsterdam Munich 503 ISE
  11. 11. Incidents ○ Critical Incident - website is down from all locations ○ Major Incident - website is down from a single location; MySQL replication is broken; PHP fatal errors recorded in the logs; read-only file system issue ○ Minor Incident - Memcached/Redis on a single server is down ○ Notice Incident - web node X is running out of space; PHP warnings recorded in the logs
  12. 12. Core Principles ○ Log all events and archive them. Write postmortem reports ○ Check every single incident - even minor ones and notices ○ Define performance limits and regularly check reports ○ Beware of cascade failures ○ Always strive to go back to pre-incident state ○ Check one thing at a time and return “OK” or “Failure”
  13. 13. Examples ○ 1 of 5 app servers goes down ○ Load on the other 4 increases by 20% ○ Redis caches are invalidated - overload ○ Varnish is restarted by a system administrator to apply a configuration change ○ App servers start to return 503 errors ○ MySQL master goes down ○ MySQL slave 1 takes over and at this moment there is no downtime ○ MySQL slave 2 is behind the new master ○ The new MySQL master goes down too result is a broken DB or outdated one
  14. 14. KEY TAKEAWAYS 1. Embrace Failure and Design for Failure 2. Automate Recovery 3. Log all incidents and analyse them 4. Measure and graph the performance of all components 5. Regularly brake things on purpose in order to test
  15. 15. RESOURCES Injecting Failure at Netflix - goo.gl/YE1sEY What is SRE - goo.gl/2lI8E0 SRE book - goo.gl/bfL2At Netflix Open Source Software - https://netflix.github.io/ Etsy “Measure Everything” - goo.gl/CPVUT5
  16. 16. JOIN US FOR CONTRIBUTION SPRINTS First Time Sprinter Workshop - 9:00-12:00 - Room Wicklow2A Mentored Core Sprint - 9:00-18:00 - Wicklow Hall 2B General Sprints - 9:00 - 18:00 - Wicklow Hall 2A
  17. 17. Evaluate This Session THANK YOU! events.drupal.org/dublin2016/schedule WHAT DID YOU THINK?

×