Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web 2.0 Performance and Reliability: How to Run Large Web Apps

10,380 views

Published on

Speaker: Artur Bergman

Published in: Business, Technology

Web 2.0 Performance and Reliability: How to Run Large Web Apps

  1. 1. Artur Bergman sky@crucially.net • Wikia Inc – We are hiring – Community/Bizdev in Germany – Engineers in Poland – http://www.wikia.com/wiki/hiring • O’Reilly Radar – http://radar.oreilly.com/artur/
  2. 2. The value of operations • Google • Orkut • Friendster • Myspace
  3. 3. Benefits • Users trust your brand • They rely on you • They spend more time on your site • Bad operations wastes R&D money • Fixed amount of time + faster site = more page views
  4. 4. Stepchild of Engineering • Product development • Engineering • Operations – Sysadmins? • Why?
  5. 5. Operations Engineering • It is engineering • Google terminology - – Site Reliability Engineer • Sure there are sysadmins too, people mananing NOCs and datacenters • Provide career growth
  6. 6. Good Engineers • Detail Oriented • Aspire to be operational engineers • Stubborn • Can steer their inner ADD – Interrupt driven • Not the same as good developers
  7. 7. Danger signs • Thinks operation is a path to development engineering – Fire them • Want people dedicated to the task • A good operations engineer should spend some time in development • A good development engineer MUST spend some time in operations
  8. 8. Debugging • 9 Rules of debugging • http://www.debuggingrules.com/Poster_ download.html – Yes the font is horrible
  9. 9. Rule 1: Understand the system • Complexity Kills • No excuse • If you write it, you must know it • If you run it, you must know it • If you buy it, you must know it
  10. 10. Rule 3: Quit thinking and look • quot;It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
  11. 11. Rule 3: Quit thinking and look • What do you look at? • The importance of monitoring • Monitoring • Monitoring • Monitoring
  12. 12. My my, confusing term • Monitoring • Alerting • Trending
  13. 13. Monitoring • Collects data • Puts into databases • Makes it available for you • Active collection • Passive interaction
  14. 14. Alerting • Acts on monitoring data • Severe alerts – Active – Needs action • Passive alerts – Things that need to be done but not right now • DO NOT OVER ALERT • DO NOT CRY WOLF
  15. 15. Wikia alerting strategy • When the site is slow • Or down • We send emails and do phone calls • Europe and US West coast • Looking to hire in East Asia • No night time
  16. 16. Trending • Long term • Capacity planning
  17. 17. Monitor Tools • Nagios • Cacti • MRTG • Hyperic • Cricket • Ganglia
  18. 18. External Monitoring • Use one, tells you what your clients see every x minutes • Keynote • Gomez • Websitepulse (cheap - easy - I like them; no annoying salesforce)
  19. 19. Nagios • Alerting • Hassle • C CGI?? • Doesn’t scale
  20. 20. Hyperic • Most exciting open source tool • Agent base - self configured • Baseline alerting
  21. 21. Cricket MRTG Cacti • Impossible to configure • You need to write tools to do it • Especially Cacti – Somewhat more pleasant than clawing out your eyes
  22. 22. Ganglia • We love ganglia • Automatically graphs everything you want - just works • Large scale clusters • Multicast • Zero config • RRD
  23. 23. http://ganglia.wikimedia.org/ • 270 hosts • 880 CPU • 2 clusters • 1.2 TB of Memory
  24. 24. http://ganglia.wikimedia.org
  25. 25. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  26. 26. Custom Ganglia Gmetrics • Or Learn Unix gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  27. 27. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  28. 28. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  29. 29. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  30. 30. Something is wrong • Don’t worry, data warehouse QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
  31. 31. tcpdump / waveshark • If you suspect the network • Don’t just suspect • LOOK AT IT • Tcpdump / waveshark will tell you – If your packets are lost, delayed or corrupted – Your windowing is wrong
  32. 32. Rule 4: Divde and Conquer • Look at the problems in turn • Split between people • Go in the order you suspect is the most likely
  33. 33. Rule 5: Change one thing at a time • I cannot stress this enough • IF YOU DO NOT THEN YOU HAVE FAILED TO IDENTIFY THE PROBLEM
  34. 34. Rule 6: Keep an audit trail • You might be making things worse • Good for the root cause analysis • Have your shell log all commands – Good practice anyway • Version control
  35. 35. Rule 9: If you didn’t fix it, it ain’t fixed • You must do something to fix a problem • Or it will bite you again • And again • And again • They don’t just appear and disappear • Except BGP route convergence :)
  36. 36. Process • You need a little • Don’t worry
  37. 37. Don’t forget
  38. 38. Complexity kills • Design against it • Reuse components • Define standards • Have a few images that all machines look like - reimage machines every now and then for the heck of it. – EC2 forces you to do this
  39. 39. MTBF Meduim Time Between Failure • Actually mostly irrelevant • Dealing with failure is more important • Target the right uptime – Complexity scales exponatially with required uptime • Don’t kid yourself, you don’t need 5 nines
  40. 40. MTTR Medium Time To Recovery • Important • Noone cares if you fail once a minute – If you recover in 50 ms • If you are down 1 minute a week, you are still going to hit 4 nines (99.99%) • Failures happen, plan how to deal with them
  41. 41. Problem found • If it is critical, start a phone conversation • Use IRC to communicate technical data • One person liasons with non technical staff • One person specifically in command • Sleep scheduling ( audit log important )
  42. 42. Post crisis • Root cause analysis – Just find out what went wrong – And how to avoid it – Or fix it faster next time if you can’t • Keep track of your uptime
  43. 43. Automation • All machines are created equal • Seriously • If you manually make changes • You are wrong – Unless you know what you are doing
  44. 44. Best practices • Version control • Gold images • Centralised authentication • Time Sync ( NTP ) • Central logging • ( All of this applies for virtual machines too!)
  45. 45. cfengine • Standard automation tool • Written in C • Not much support • Very good • Very annoying
  46. 46. contro : l s te i = ( mys te ) i domain = ( mysite .count y ) r sysadm = (mark ) netmask = ( 255.255.255.0 ) ac i t onsequence = ( mounta ll mount nfo i addmounts mounta l l lnks i ) mountpat rn = / ie) ( te $(s t /$ host)) homepat r = ( u? ) te n
  47. 47. Puppet • New hip kid on the block • Written in ruby • Better support? • Much nicer syntax • Easier to extend
  48. 48. def ne yumrepo (enab i led = true) {c i i onf gfle { /e c quot; t /yum.repos / .d $name.repo”: mode => 644, source => quot; yum/repos / /$name. repoquot;, ensure => $enab led ? { true => fl , ie defau t=> absent l } }}
  49. 49. cobb er l • Automatic PXE Installer – Uses kickstart files • Redhat Enterprise • Centos • Fedora • Some support for debian
  50. 50. cobbler cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A console=ttyS1,57600 console=tty0'
  51. 51. cobbler cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A console=ttyS1,57600 console=tty0’
  52. 52. koan • Client install tool – Xen – Or OS re-image koan --server=10.10.30.205 --virt -- profile=virt_fc6 --virt-name=otrs
  53. 53. Your datacenter • Keep it tidy – Label things, keep cables as short as possible – Have a switch in each rack • If you are small without dedicated DC staff you need – Remote control power switches – Remote console!
  54. 54. Virtualization • Please use it • Managing becomes much easier • Power consumption • Need a new test box – The requestor can have it in minutes
  55. 55. Power consumption • Maybe not as important in Europe • 8 core machines are more efficient than 1 core • But memcache uses 1 core and all RAM • Get more RAM and virtualise
  56. 56. Our network admin boxes • 1 Xen CPU for Vyatta • 1 Xen CPU for LVS • 1 Xen CPU for Squid - Carp • 1 Xen CPU for Squid • 1 Xen CPU for Monitoring • 1 Xen CPU for network tasks • We can have more of these and a loss of one affects us less
  57. 57. Vyatta • Opensource router – Really like it – No need to use Cisco
  58. 58. LVS • Linux Virtual Server • Low level load balancer • HA • Fast • Doesn’t inspire people to put things in the only place that is hard to scale
  59. 59. Squid Carp • Squids configured to hash the urls and send them to specific backend • Very little configuration done • Logging of UDP - no disk IO
  60. 60. Squid • As a reverse web accelerator • 90 % of our hits served from RAM in less than 1 ms • Same as wikipedia • We only use RAM cache ( unlike wikipedia) • Cached per user • If not cacheable - cache for a second to redue backend effect
  61. 61. App servers • 1 xen cpu for memcache ( 5 GB Ram) • 1 xen cpu for squid ( 5GB Ram ) • 6 xen cpus for apache (6 GB Ram ) • More power efficient, less affected by loss • Applications can’t affect each other
  62. 62. Databases • Keep developers on short leash • Report bad queries • Fear object relational mappers
  63. 63. Outsourcing • As much as possible • The younger you are as a company the less risk – When you have no users, you have no value • VCs don’t like having their money go into Capex
  64. 64. What I want from Vendors • They do what they tell me • They do what I tell them • No annoying up sells, no premium services – I know more about what you are selling than you
  65. 65. Services we use • Amazon EC2 and S3 • Panther-Express
  66. 66. Panther Express • Fantastic Content Distribution Network • Cheap, simple price list – Take note akamai • Cut delivery time to Europe by 70% • We let our images be cached 1 second to redue load
  67. 67. EC2 and S3 • We save all our binlogs to S3 • We save database dumps to S3 • We have monitors running from EC2 • We plan to build a datawarehouse cluster on EC2
  68. 68. EC2 Requires Automation • Machine is blank when you bring it up • Download database dump from S3 and replicate up - automatically • Use puppet • Amazon saves you hardware headaches – But complexity is still a problem
  69. 69. Thank you

×