Successfully reported this slideshow.
Events & MetricsThe Lifeblood Of Webops Alexis Lê-Quôc (Product Guy) at Datadog                NYCBUG              July 6t...
I <3 BSD  ‣OpenBSD user since 2.8 (pf)  ‣Love the documentation  ‣m0n0wall/pfSense  ‣ZFS-envy
What I’m going to talk about ‣Briefly we do and for whom ‣Where we started ‣The kind of data we deal with ‣How it fits altog...
SaaS Platform for Dev & Ops‣Aggregation‣Correlation‣Collaboration       What we do?
Where We Started
The Mess                                                                                                            Usage ...
WHERE WE STARTED   Discourages exploration
Very Specific View
Different View   Same Reality
Dev Interdiction      Part 1
Where We Are
In Actionhttps://app.datad0g.com/dash/host/8#/date_range/1309383780732-1309988580732
Welcome developers           Context Matters‣Graphite                    ‣Ganglia Event API‣statsdLarge Datasets          ...
Sides Of A Coin
Events          MetricsUser comments   Unique visitorsAlert           LoadBuild           Transaction durationBatch job   ...
etc.       Aggregation
Taxonomy
AtomicityConcistencyIsolationDurabilitye.g. SQL DBs           CLASSICS        http://en.wikipedia.org/wiki/Eventual_consis...
Atomicity                                    BasicallyConcistency                                  AvailableIsolation     ...
Data      Intensive      Real      Time      e.g. real-time webNEW COMERBrian Cantrill: http://dtrace.org/resources/bmc/DI...
AggregationConstant data influxLarge data sets              Correlation              On-demand visualization              B...
Aggregation    SEConstant data influx  BALarge data sets              Correlation              On-demand visualization     ...
Aggregation    SE             TConstant data influx           IR  BA          DLarge data sets              Correlation    ...
Aggregation    SE             TConstant data influx           IR  BA          DLarge data sets              Correlation    ...
Aggregation    SE             TConstant data influx           IR  BA          DLarge data sets              Correlation    ...
Aggregation    SE             TConstant data influx           IR  BA          DLarge data sets              Correlation    ...
How It All Fits Together    http://www.flickr.com/photos/tom-margie/1253798184/
Architecture   Simplified
Architecture       Simplified  SEBA
Architecture              Simplified         SE   T IR       BAD
Architecture              Simplified         SE                ID   T IR               C       BA              AD
The Environment
4 DimensionsComputeStorageNetworkManagement
ON-PREMISE TRAITShttp://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
ComputeFastInelastic       ON-PREMISE TRAITS        http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/phot...
ComputeFastInelasticStorageFastCentralizedRedundant         ON-PREMISE TRAITS          http://www.flickr.com/photos/theplan...
Compute                                                                               NetworkFast                         ...
Compute                                                                               NetworkFast                         ...
CLOUD TRAITS
ComputeSlowElastic          CLOUD TRAITS
ComputeSlowElasticStorageSlowJitteryMaybe durableLow memory                CLOUD TRAITS
Compute                    NetworkSlow                       “Fast”Elastic                    Geo-distributedStorageSlowJi...
Compute                    NetworkSlow                       “Fast”Elastic                    Geo-distributedStorageSlowJi...
What We Have   Found
Network
NetworkLayer 2: Virtual DomainLayer 3: Crude Edge FilteringLayer 7: Crude Load BalancingDNSCDN
NetworkLayer 2: Virtual Domain                !Layer 3: Crude Edge Filtering              ks           orLayer 7: Crude Lo...
Storage
Latency                                     BASE                                     Amazon S3                       BASE ...
Latency                                      BASE                                            y                            ...
Low Memory http://aws.amazon.com/ec2/#instance
Jittery, Limited Throughput          Network Block Storage (EBS)  https://app.datad0g.com/dash/dash/1032#/date_range/13086...
Average wait in ms                     DEV      tps   rd_sec/s   wr_sec/s   avgrq-sz   avgqu-sz    await   svctm   %util03...
Some Tricks
Software RAIDRAID 0Offsite backups              Some Tricks
Software RAID       Limited by slowestRAID 0              volumeOffsite backups              Some Tricks
Software RAID           Limited by slowestRAID 0                  volumeOffsite backupsStreaming replicationS3 backups    ...
Software RAID           Limited by slowestRAID 0                  volumeOffsite backupsEphemeral volumesAnd Offsite backup...
Software RAID           Limited by slowestRAID 0                  volumeOffsite backupsEphemeral volumesAnd Offsite backup...
Software RAID           Limited by slowestRAID 0                  volumeOffsite backupsEphemeral volumesAnd Offsite backup...
Software RAID           Limited by slowestRAID 0                  volumeOffsite backupsEphemeral volumesAnd Offsite backup...
Network Block Storage Is The Dark Side
Network Block Storage Is The Dark Side Bait For Enterprise    Customers
Network Block Storage    Is The Dark Side    Bait For Enterprise       CustomersHard Problem For Cloud Providers
Don’t rely on networked block storageSmall data sets only if you have toDon’t trust data-at-restCopy, replicate, back upDo...
Compute
“Performance”      Scale up   Shard       ACID       Nodes                 BASE DIRT Add more                 Nodes Nodes ...
Don’t rely on scale-upsLow memory a hard limit for DBsNoisy neighborsIndividual performance poor and jitteryScale outFirst...
Management
An API for everythingComputeStorageNetworkManagement
Questions!http://datadoghq.com      twitter: @alq
Datadog at NYCBUG
Upcoming SlideShare
Loading in …5
×

Datadog at NYCBUG

2,219 views

Published on

Published in: Technology, Business

Datadog at NYCBUG

  1. 1. Events & MetricsThe Lifeblood Of Webops Alexis Lê-Quôc (Product Guy) at Datadog NYCBUG July 6th, 2011
  2. 2. I <3 BSD ‣OpenBSD user since 2.8 (pf) ‣Love the documentation ‣m0n0wall/pfSense ‣ZFS-envy
  3. 3. What I’m going to talk about ‣Briefly we do and for whom ‣Where we started ‣The kind of data we deal with ‣How it fits altogether ‣A few things we learned along the way ‣Q+A
  4. 4. SaaS Platform for Dev & Ops‣Aggregation‣Correlation‣Collaboration What we do?
  5. 5. Where We Started
  6. 6. The Mess Usage Analytics Too many data streams, too many silos IAAS / PAAS Issue Resolution t ics Servers and Devices s igh ices tric ins metr g e Too many choices to billin m m cho et ric s make, too often s ?!? change Dev team changes !? s ic choices metr Ops team Applications s t ric ch an me ts geme even Only getting worse as ev en ts stri ad s + fe ice edb SaaS Silos multiply cs vic o ack ch e me s s tric choice tri me cs Cap. Planning SDLC support Monitoring Hosting Asset Mgmt CDNs Separate Dev and Ops teams, looking at separate data streams
  7. 7. WHERE WE STARTED Discourages exploration
  8. 8. Very Specific View
  9. 9. Different View Same Reality
  10. 10. Dev Interdiction Part 1
  11. 11. Where We Are
  12. 12. In Actionhttps://app.datad0g.com/dash/host/8#/date_range/1309383780732-1309988580732
  13. 13. Welcome developers Context Matters‣Graphite ‣Ganglia Event API‣statsdLarge Datasets Data Exploration‣OpenTSDB ‣d3, protovis TRENDS Visible through Datadog and others
  14. 14. Sides Of A Coin
  15. 15. Events MetricsUser comments Unique visitorsAlert LoadBuild Transaction durationBatch job etc.
  16. 16. etc. Aggregation
  17. 17. Taxonomy
  18. 18. AtomicityConcistencyIsolationDurabilitye.g. SQL DBs CLASSICS http://en.wikipedia.org/wiki/Eventual_consistency
  19. 19. Atomicity BasicallyConcistency AvailableIsolation Soft-stateDurability Eventual consistencye.g. SQL DBs e.g. DNS CLASSICS http://en.wikipedia.org/wiki/Eventual_consistency
  20. 20. Data Intensive Real Time e.g. real-time webNEW COMERBrian Cantrill: http://dtrace.org/resources/bmc/DIRT.pdf
  21. 21. AggregationConstant data influxLarge data sets Correlation On-demand visualization Background data analysis Collaboration Real-time updates On-the-fly data analysis
  22. 22. Aggregation SEConstant data influx BALarge data sets Correlation On-demand visualization Background data analysis Collaboration Real-time updates On-the-fly data analysis
  23. 23. Aggregation SE TConstant data influx IR BA DLarge data sets Correlation On-demand visualization Background data analysis Collaboration Real-time updates On-the-fly data analysis
  24. 24. Aggregation SE TConstant data influx IR BA DLarge data sets Correlation SE On-demand visualization BA Background data analysis Collaboration Real-time updates On-the-fly data analysis
  25. 25. Aggregation SE TConstant data influx IR BA DLarge data sets Correlation SE On-demand visualization BA Background data analysis Collaboration T Real-time updates IR D On-the-fly data analysis
  26. 26. Aggregation SE TConstant data influx IR BA DLarge data sets Correlation SE On-demand visualization BA Background data analysis Collaboration T Real-time updates IR D On-the-fly data analysis Datadog = DIRT + BASE + a tiny bit of ACID
  27. 27. How It All Fits Together http://www.flickr.com/photos/tom-margie/1253798184/
  28. 28. Architecture Simplified
  29. 29. Architecture Simplified SEBA
  30. 30. Architecture Simplified SE T IR BAD
  31. 31. Architecture Simplified SE ID T IR C BA AD
  32. 32. The Environment
  33. 33. 4 DimensionsComputeStorageNetworkManagement
  34. 34. ON-PREMISE TRAITShttp://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  35. 35. ComputeFastInelastic ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  36. 36. ComputeFastInelasticStorageFastCentralizedRedundant ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  37. 37. Compute NetworkFast FastInelastic LocalizedStorageFastCentralizedRedundant ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  38. 38. Compute NetworkFast FastInelastic LocalizedStorageFast ManagementCentralized People-basedRedundant Full access ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  39. 39. CLOUD TRAITS
  40. 40. ComputeSlowElastic CLOUD TRAITS
  41. 41. ComputeSlowElasticStorageSlowJitteryMaybe durableLow memory CLOUD TRAITS
  42. 42. Compute NetworkSlow “Fast”Elastic Geo-distributedStorageSlowJitteryMaybe durableLow memory CLOUD TRAITS
  43. 43. Compute NetworkSlow “Fast”Elastic Geo-distributedStorageSlowJittery ManagementMaybe durable No bare-metalLow memory “Magic” API CLOUD TRAITS
  44. 44. What We Have Found
  45. 45. Network
  46. 46. NetworkLayer 2: Virtual DomainLayer 3: Crude Edge FilteringLayer 7: Crude Load BalancingDNSCDN
  47. 47. NetworkLayer 2: Virtual Domain !Layer 3: Crude Edge Filtering ks orLayer 7: Crude Load BalancingDNS W ItCDN
  48. 48. Storage
  49. 49. Latency BASE Amazon S3 BASE Apache Cassandra ACID PostgreSQL DIRT Redis Capacity Storage
  50. 50. Latency BASE y nc Amazon S3 te La t BASE pu y gh er Apache Cassandra ou ACID tt hr Ji dt PostgreSQL i te Lim DIRT y or em Redis Capacity m wLo Storage
  51. 51. Low Memory http://aws.amazon.com/ec2/#instance
  52. 52. Jittery, Limited Throughput Network Block Storage (EBS) https://app.datad0g.com/dash/dash/1032#/date_range/1308608717016-1309213517016
  53. 53. Average wait in ms DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util03:35:02 PM dev8-80 375.95 23614.08 5.70 62.83 47.21 125.58 1.26 47.3403:35:02 PM dev8-96 373.63 23749.65 5.64 63.58 45.55 121.91 1.22 45.7203:35:02 PM dev8-112 375.28 23693.47 5.52 63.15 45.52 121.22 1.23 46.3103:35:02 PM dev8-128 375.31 23721.57 7.19 63.22 56.00 148.96 1.34 50.35 Read throughput in sector/s Average service Total: 368Mb/s time in ms Limited Throughput In Numbers RAID 0 EBS Volumes, m1.large instances
  54. 54. Some Tricks
  55. 55. Software RAIDRAID 0Offsite backups Some Tricks
  56. 56. Software RAID Limited by slowestRAID 0 volumeOffsite backups Some Tricks
  57. 57. Software RAID Limited by slowestRAID 0 volumeOffsite backupsStreaming replicationS3 backups Some Tricks
  58. 58. Software RAID Limited by slowestRAID 0 volumeOffsite backupsEphemeral volumesAnd Offsite backupsStreaming replicationS3 backups Some Tricks
  59. 59. Software RAID Limited by slowestRAID 0 volumeOffsite backupsEphemeral volumesAnd Offsite backups Complexity Recovery Time ObjectiveStreaming replication Recovery Point ObjectiveS3 backups Some Tricks
  60. 60. Software RAID Limited by slowestRAID 0 volumeOffsite backupsEphemeral volumesAnd Offsite backups Complexity Recovery Time ObjectiveStreaming replication Recovery Point ObjectiveS3 backupsDatabase ServiceMySQL/Oracle RDS Some Tricks
  61. 61. Software RAID Limited by slowestRAID 0 volumeOffsite backupsEphemeral volumesAnd Offsite backups Complexity Recovery Time ObjectiveStreaming replication Recovery Point ObjectiveS3 backupsDatabase Service TrustMySQL/Oracle RDS RDS Outage 2 months ago Some Tricks
  62. 62. Network Block Storage Is The Dark Side
  63. 63. Network Block Storage Is The Dark Side Bait For Enterprise Customers
  64. 64. Network Block Storage Is The Dark Side Bait For Enterprise CustomersHard Problem For Cloud Providers
  65. 65. Don’t rely on networked block storageSmall data sets only if you have toDon’t trust data-at-restCopy, replicate, back upDo use S3 if you canObject semantics a limitationSlow but durable Some Do’s And Don’t
  66. 66. Compute
  67. 67. “Performance” Scale up Shard ACID Nodes BASE DIRT Add more Nodes Nodes Number Compute
  68. 68. Don’t rely on scale-upsLow memory a hard limit for DBsNoisy neighborsIndividual performance poor and jitteryScale outFirst scale upThen ShardParallelize across machinesVector-processing via GPUs Some Do’s And Don’t
  69. 69. Management
  70. 70. An API for everythingComputeStorageNetworkManagement
  71. 71. Questions!http://datadoghq.com twitter: @alq

×