Artur Bergman
          sky@crucially.net
• Wikia Inc
  – We are hiring
  – Community/Bizdev in Germany
  – Engineers in P...
The value of operations
•   Google
•   Orkut
•   Friendster
•   Myspace
Benefits
•   Users trust your brand
•   They rely on you
•   They spend more time on your site
•   Bad operations wastes R...
Stepchild of Engineering
• Product development
• Engineering
• Operations
  – Sysadmins?
• Why?
Operations Engineering
• It is engineering
• Google terminology -
  – Site Reliability Engineer
• Sure there are sysadmins...
Good Engineers
•   Detail Oriented
•   Aspire to be operational engineers
•   Stubborn
•   Can steer their inner ADD
    –...
Danger signs
• Thinks operation is a path to
  development engineering
  – Fire them
• Want people dedicated to the task
•...
Debugging
• 9 Rules of debugging
• http://www.debuggingrules.com/Poster_
  download.html
  – Yes the font is horrible
Rule 1:
       Understand the system
•   Complexity Kills
•   No excuse
•   If you write it, you must know it
•   If you r...
Rule 3:
      Quit thinking and look
• quot;It is a capital mistake to theorize before
  one has data. Insensibly one begi...
Rule 3:
        Quit thinking and look
•   What do you look at?
•   The importance of monitoring
•   Monitoring
•   Monito...
My my, confusing term
• Monitoring
• Alerting
• Trending
Monitoring
•   Collects data
•   Puts into databases
•   Makes it available for you
•   Active collection
•   Passive inte...
Alerting
• Acts on monitoring data
• Severe alerts
  – Active
  – Needs action
• Passive alerts
  – Things that need to be...
Wikia alerting strategy
•   When the site is slow
•   Or down
•   We send emails and do phone calls
•   Europe and US West...
Trending
• Long term
• Capacity planning
Monitor Tools
•   Nagios
•   Cacti
•   MRTG
•   Hyperic
•   Cricket
•   Ganglia
External Monitoring
• Use one, tells you what your clients see
  every x minutes
• Keynote
• Gomez
• Websitepulse (cheap -...
Nagios
•   Alerting
•   Hassle
•   C CGI??
•   Doesn’t
    scale
Hyperic
• Most exciting open source tool
• Agent base - self configured
• Baseline alerting
Cricket MRTG Cacti
• Impossible to configure
• You need to write tools to do it
• Especially Cacti
  – Somewhat more pleas...
Ganglia
• We love ganglia
• Automatically graphs everything you
  want - just works
• Large scale clusters
• Multicast
• Z...
http://ganglia.wikimedia.org/
•   270 hosts
•   880 CPU
•   2 clusters
•   1.2 TB of Memory
http://ganglia.wikimedia.org
Custom Ganglia Gmetrics
• Write your own

gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo...
Custom Ganglia Gmetrics
• Or Learn Unix

gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo ...
Custom Ganglia Gmetrics
• Write your own

gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo...
Custom Ganglia Gmetrics
• Write your own

gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo...
Custom Ganglia Gmetrics
• Write your own

gmetric --name='Oldest query' --type=int32
--units='sec' --dmax=65 --value=`echo...
Something is wrong

• Don’t worry, data warehouse




                      QuickTime™ and a
            TIFF (Uncompresse...
tcpdump / waveshark
•   If you suspect the network
•   Don’t just suspect
•   LOOK AT IT
•   Tcpdump / waveshark will tell...
Rule 4: Divde and Conquer
• Look at the problems in turn
• Split between people
• Go in the order you suspect is the most
...
Rule 5:
 Change one thing at a time
• I cannot stress this enough
• IF YOU DO NOT THEN YOU HAVE
  FAILED TO IDENTIFY THE P...
Rule 6:
        Keep an audit trail
• You might be making things worse
• Good for the root cause analysis
• Have your shel...
Rule 9:
    If you didn’t fix it, it ain’t fixed
•   You must do something to fix a problem
•   Or it will bite you again
...
Process
• You need a little
• Don’t worry
Don’t forget
Complexity kills
•   Design against it
•   Reuse components
•   Define standards
•   Have a few images that all machines
 ...
MTBF
Meduim Time Between Failure
• Actually mostly irrelevant
• Dealing with failure is more important
• Target the right ...
MTTR
  Medium Time To Recovery
• Important
• Noone cares if you fail once a minute
  – If you recover in 50 ms
• If you ar...
Problem found
• If it is critical, start a phone conversation
• Use IRC to communicate technical data
• One person liasons...
Post crisis
• Root cause analysis
  – Just find out what went wrong
  – And how to avoid it
  – Or fix it faster next time...
Automation
•   All machines are created equal
•   Seriously
•   If you manually make changes
•   You are wrong
    – Unles...
Best practices
•   Version control
•   Gold images
•   Centralised authentication
•   Time Sync ( NTP )
•   Central loggin...
cfengine
•   Standard automation tool
•   Written in C
•   Not much support
•   Very good
•   Very annoying
contro :
      l
  s te
   i      = ( mys te )
                 i        domain = (
  mysite .count y )
               r
 ...
Puppet
•   New hip kid on the block
•   Written in ruby
•   Better support?
•   Much nicer syntax
•   Easier to extend
def ne yumrepo (enab
   i                 led = true)
{c i i
    onf gfle
{ /e c
 quot; t /yum.repos /
               .d $...
cobb er
                        l
• Automatic PXE Installer
    – Uses kickstart files
•   Redhat Enterprise
•   Centos
• ...
cobbler
cobbler system add
  --name=xen8
  --mac=00:19:B9:EE:6D:0A
  --ip=10.10.30.208
  --profile=Centos-5-x86_64
  --kop...
cobbler
cobbler system add
  --name=xen8
  --mac=00:19:B9:EE:6D:0A
  --ip=10.10.30.208
  --profile=Centos-5-x86_64
  --kop...
koan
• Client install tool
  – Xen
  – Or OS re-image


koan --server=10.10.30.205 --virt --
  profile=virt_fc6 --virt-nam...
Your datacenter
• Keep it tidy
   – Label things, keep cables as short as possible
   – Have a switch in each rack
• If yo...
Virtualization
•   Please use it
•   Managing becomes much easier
•   Power consumption
•   Need a new test box
    – The ...
Power consumption
• Maybe not as important in Europe
• 8 core machines are more efficient than
  1 core
• But memcache use...
Our network admin boxes
•   1 Xen CPU for Vyatta
•   1 Xen CPU for LVS
•   1 Xen CPU for Squid - Carp
•   1 Xen CPU for Sq...
Vyatta
• Opensource router
  – Really like it
  – No need to use Cisco
LVS
•   Linux Virtual Server
•   Low level load balancer
•   HA
•   Fast
•   Doesn’t inspire people to put things in
    t...
Squid Carp
• Squids configured to hash the urls and
  send them to specific backend
• Very little configuration done
• Log...
Squid
• As a reverse web accelerator
• 90 % of our hits served from RAM in less than
  1 ms
• Same as wikipedia
• We only ...
App servers
• 1 xen cpu for memcache ( 5 GB Ram)
• 1 xen cpu for squid ( 5GB Ram )
• 6 xen cpus for apache (6 GB Ram )

• ...
Databases
• Keep developers on short leash
• Report bad queries
• Fear object relational mappers
Outsourcing
• As much as possible
• The younger you are as a company the
  less risk
  – When you have no users, you have ...
What I want from Vendors
• They do what they tell me
• They do what I tell them

• No annoying up sells, no premium
  serv...
Services we use
• Amazon EC2 and S3
• Panther-Express
Panther Express
• Fantastic Content Distribution Network
• Cheap, simple price list
  – Take note akamai
• Cut delivery ti...
EC2 and S3
•   We save all our binlogs to S3
•   We save database dumps to S3
•   We have monitors running from EC2
•   We...
EC2 Requires Automation
• Machine is blank when you bring it up
• Download database dump from S3 and
  replicate up - auto...
Thank you
Web 2.0 Performance and Reliability: How to Run Large Web Apps
Upcoming SlideShare
Loading in …5
×

Web 2.0 Performance and Reliability: How to Run Large Web Apps

9,927
-1

Published on

Speaker: Artur Bergman

Published in: Business, Technology
2 Comments
10 Likes
Statistics
Notes
No Downloads
Views
Total Views
9,927
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
531
Comments
2
Likes
10
Embeds 0
No embeds

No notes for slide

Web 2.0 Performance and Reliability: How to Run Large Web Apps

  1. 1. Artur Bergman sky@crucially.net • Wikia Inc – We are hiring – Community/Bizdev in Germany – Engineers in Poland – http://www.wikia.com/wiki/hiring • O’Reilly Radar – http://radar.oreilly.com/artur/
  2. 2. The value of operations • Google • Orkut • Friendster • Myspace
  3. 3. Benefits • Users trust your brand • They rely on you • They spend more time on your site • Bad operations wastes R&D money • Fixed amount of time + faster site = more page views
  4. 4. Stepchild of Engineering • Product development • Engineering • Operations – Sysadmins? • Why?
  5. 5. Operations Engineering • It is engineering • Google terminology - – Site Reliability Engineer • Sure there are sysadmins too, people mananing NOCs and datacenters • Provide career growth
  6. 6. Good Engineers • Detail Oriented • Aspire to be operational engineers • Stubborn • Can steer their inner ADD – Interrupt driven • Not the same as good developers
  7. 7. Danger signs • Thinks operation is a path to development engineering – Fire them • Want people dedicated to the task • A good operations engineer should spend some time in development • A good development engineer MUST spend some time in operations
  8. 8. Debugging • 9 Rules of debugging • http://www.debuggingrules.com/Poster_ download.html – Yes the font is horrible
  9. 9. Rule 1: Understand the system • Complexity Kills • No excuse • If you write it, you must know it • If you run it, you must know it • If you buy it, you must know it
  10. 10. Rule 3: Quit thinking and look • quot;It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
  11. 11. Rule 3: Quit thinking and look • What do you look at? • The importance of monitoring • Monitoring • Monitoring • Monitoring
  12. 12. My my, confusing term • Monitoring • Alerting • Trending
  13. 13. Monitoring • Collects data • Puts into databases • Makes it available for you • Active collection • Passive interaction
  14. 14. Alerting • Acts on monitoring data • Severe alerts – Active – Needs action • Passive alerts – Things that need to be done but not right now • DO NOT OVER ALERT • DO NOT CRY WOLF
  15. 15. Wikia alerting strategy • When the site is slow • Or down • We send emails and do phone calls • Europe and US West coast • Looking to hire in East Asia • No night time
  16. 16. Trending • Long term • Capacity planning
  17. 17. Monitor Tools • Nagios • Cacti • MRTG • Hyperic • Cricket • Ganglia
  18. 18. External Monitoring • Use one, tells you what your clients see every x minutes • Keynote • Gomez • Websitepulse (cheap - easy - I like them; no annoying salesforce)
  19. 19. Nagios • Alerting • Hassle • C CGI?? • Doesn’t scale
  20. 20. Hyperic • Most exciting open source tool • Agent base - self configured • Baseline alerting
  21. 21. Cricket MRTG Cacti • Impossible to configure • You need to write tools to do it • Especially Cacti – Somewhat more pleasant than clawing out your eyes
  22. 22. Ganglia • We love ganglia • Automatically graphs everything you want - just works • Large scale clusters • Multicast • Zero config • RRD
  23. 23. http://ganglia.wikimedia.org/ • 270 hosts • 880 CPU • 2 clusters • 1.2 TB of Memory
  24. 24. http://ganglia.wikimedia.org
  25. 25. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  26. 26. Custom Ganglia Gmetrics • Or Learn Unix gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  27. 27. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  28. 28. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  29. 29. Custom Ganglia Gmetrics • Write your own gmetric --name='Oldest query' --type=int32 --units='sec' --dmax=65 --value=`echo ' show processlist' | mysql -uroot -ppass | grep -v Sleep | grep -v 'system user' | head -2 | tail -1 | cut -f 6`
  30. 30. Something is wrong • Don’t worry, data warehouse QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
  31. 31. tcpdump / waveshark • If you suspect the network • Don’t just suspect • LOOK AT IT • Tcpdump / waveshark will tell you – If your packets are lost, delayed or corrupted – Your windowing is wrong
  32. 32. Rule 4: Divde and Conquer • Look at the problems in turn • Split between people • Go in the order you suspect is the most likely
  33. 33. Rule 5: Change one thing at a time • I cannot stress this enough • IF YOU DO NOT THEN YOU HAVE FAILED TO IDENTIFY THE PROBLEM
  34. 34. Rule 6: Keep an audit trail • You might be making things worse • Good for the root cause analysis • Have your shell log all commands – Good practice anyway • Version control
  35. 35. Rule 9: If you didn’t fix it, it ain’t fixed • You must do something to fix a problem • Or it will bite you again • And again • And again • They don’t just appear and disappear • Except BGP route convergence :)
  36. 36. Process • You need a little • Don’t worry
  37. 37. Don’t forget
  38. 38. Complexity kills • Design against it • Reuse components • Define standards • Have a few images that all machines look like - reimage machines every now and then for the heck of it. – EC2 forces you to do this
  39. 39. MTBF Meduim Time Between Failure • Actually mostly irrelevant • Dealing with failure is more important • Target the right uptime – Complexity scales exponatially with required uptime • Don’t kid yourself, you don’t need 5 nines
  40. 40. MTTR Medium Time To Recovery • Important • Noone cares if you fail once a minute – If you recover in 50 ms • If you are down 1 minute a week, you are still going to hit 4 nines (99.99%) • Failures happen, plan how to deal with them
  41. 41. Problem found • If it is critical, start a phone conversation • Use IRC to communicate technical data • One person liasons with non technical staff • One person specifically in command • Sleep scheduling ( audit log important )
  42. 42. Post crisis • Root cause analysis – Just find out what went wrong – And how to avoid it – Or fix it faster next time if you can’t • Keep track of your uptime
  43. 43. Automation • All machines are created equal • Seriously • If you manually make changes • You are wrong – Unless you know what you are doing
  44. 44. Best practices • Version control • Gold images • Centralised authentication • Time Sync ( NTP ) • Central logging • ( All of this applies for virtual machines too!)
  45. 45. cfengine • Standard automation tool • Written in C • Not much support • Very good • Very annoying
  46. 46. contro : l s te i = ( mys te ) i domain = ( mysite .count y ) r sysadm = (mark ) netmask = ( 255.255.255.0 ) ac i t onsequence = ( mounta ll mount nfo i addmounts mounta l l lnks i ) mountpat rn = / ie) ( te $(s t /$ host)) homepat r = ( u? ) te n
  47. 47. Puppet • New hip kid on the block • Written in ruby • Better support? • Much nicer syntax • Easier to extend
  48. 48. def ne yumrepo (enab i led = true) {c i i onf gfle { /e c quot; t /yum.repos / .d $name.repo”: mode => 644, source => quot; yum/repos / /$name. repoquot;, ensure => $enab led ? { true => fl , ie defau t=> absent l } }}
  49. 49. cobb er l • Automatic PXE Installer – Uses kickstart files • Redhat Enterprise • Centos • Fedora • Some support for debian
  50. 50. cobbler cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A console=ttyS1,57600 console=tty0'
  51. 51. cobbler cobbler system add --name=xen8 --mac=00:19:B9:EE:6D:0A --ip=10.10.30.208 --profile=Centos-5-x86_64 --kopts='ksdevice=00:19:B9:EE:6D:0A console=ttyS1,57600 console=tty0’
  52. 52. koan • Client install tool – Xen – Or OS re-image koan --server=10.10.30.205 --virt -- profile=virt_fc6 --virt-name=otrs
  53. 53. Your datacenter • Keep it tidy – Label things, keep cables as short as possible – Have a switch in each rack • If you are small without dedicated DC staff you need – Remote control power switches – Remote console!
  54. 54. Virtualization • Please use it • Managing becomes much easier • Power consumption • Need a new test box – The requestor can have it in minutes
  55. 55. Power consumption • Maybe not as important in Europe • 8 core machines are more efficient than 1 core • But memcache uses 1 core and all RAM • Get more RAM and virtualise
  56. 56. Our network admin boxes • 1 Xen CPU for Vyatta • 1 Xen CPU for LVS • 1 Xen CPU for Squid - Carp • 1 Xen CPU for Squid • 1 Xen CPU for Monitoring • 1 Xen CPU for network tasks • We can have more of these and a loss of one affects us less
  57. 57. Vyatta • Opensource router – Really like it – No need to use Cisco
  58. 58. LVS • Linux Virtual Server • Low level load balancer • HA • Fast • Doesn’t inspire people to put things in the only place that is hard to scale
  59. 59. Squid Carp • Squids configured to hash the urls and send them to specific backend • Very little configuration done • Logging of UDP - no disk IO
  60. 60. Squid • As a reverse web accelerator • 90 % of our hits served from RAM in less than 1 ms • Same as wikipedia • We only use RAM cache ( unlike wikipedia) • Cached per user • If not cacheable - cache for a second to redue backend effect
  61. 61. App servers • 1 xen cpu for memcache ( 5 GB Ram) • 1 xen cpu for squid ( 5GB Ram ) • 6 xen cpus for apache (6 GB Ram ) • More power efficient, less affected by loss • Applications can’t affect each other
  62. 62. Databases • Keep developers on short leash • Report bad queries • Fear object relational mappers
  63. 63. Outsourcing • As much as possible • The younger you are as a company the less risk – When you have no users, you have no value • VCs don’t like having their money go into Capex
  64. 64. What I want from Vendors • They do what they tell me • They do what I tell them • No annoying up sells, no premium services – I know more about what you are selling than you
  65. 65. Services we use • Amazon EC2 and S3 • Panther-Express
  66. 66. Panther Express • Fantastic Content Distribution Network • Cheap, simple price list – Take note akamai • Cut delivery time to Europe by 70% • We let our images be cached 1 second to redue load
  67. 67. EC2 and S3 • We save all our binlogs to S3 • We save database dumps to S3 • We have monitors running from EC2 • We plan to build a datawarehouse cluster on EC2
  68. 68. EC2 Requires Automation • Machine is blank when you bring it up • Download database dump from S3 and replicate up - automatically • Use puppet • Amazon saves you hardware headaches – But complexity is still a problem
  69. 69. Thank you
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×