A Journey In The Public Clouds       With Datadog    Alexis Lê-Quôc (Product Guy) at Datadog             IASA New York Cha...
What I’m going to talk about ‣What we do and for whom ‣The kind of data we deal with ‣Our architecture ‣Our architecture i...
SaaS Platform forAggregation, Correlation, Collaboration           For Dev & Ops            What we do?
The Mess                                                                                                        Usage Anal...
We SimplifyDatadog to the rescue                system metrics                                    key metrics             ...
Concretely
etc.       Aggregation
AGGREGATION        Aggregation
https://app.datad0g.com/dash/dash/1000#/date_range/1308057152698-1308143552698                                            ...
Collaboration
What Architecture For What Kind Of Data?
Events          MetricsUser comments   Unique visitorsAlert           LoadBuild           Transaction durationBatch job   ...
Taxonomy
AtomicityConcistencyIsolationDurabilitye.g. SQL DBs           CLASSICS        http://en.wikipedia.org/wiki/Eventual_consis...
Atomicity                                    BasicallyConcistency                                  AvailableIsolation     ...
Data      Intensive      Real      Time      e.g. real-time webNEW COMERBrian Cantrill: http://dtrace.org/resources/bmc/DI...
AggregationConstant data influxLarge data sets              Correlation              On-demand visualization              B...
Aggregation    SEConstant data influx  BALarge data sets              Correlation              On-demand visualization     ...
Aggregation    SE             TConstant data influx           IR  BA          DLarge data sets              Correlation    ...
Aggregation    SE             TConstant data influx           IR  BA          DLarge data sets              Correlation    ...
Aggregation    SE             TConstant data influx           IR  BA          DLarge data sets              Correlation    ...
Aggregation    SE             TConstant data influx           IR  BA          DLarge data sets              Correlation    ...
How It All Fits Together    http://www.flickr.com/photos/tom-margie/1253798184/
Architecture   Simplified
Architecture       Simplified  SEBA
Architecture              Simplified         SE   T IR       BAD
Architecture              Simplified         SE                ID   T IR               C       BA              AD
The Environment
4 DimensionsComputeStorageNetworkManagement
ON-PREMISE TRAITShttp://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
ComputeFastInelastic       ON-PREMISE TRAITS        http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/phot...
ComputeFastInelasticStorageFastCentralizedRedundant         ON-PREMISE TRAITS          http://www.flickr.com/photos/theplan...
Compute                                                                               NetworkFast                         ...
Compute                                                                               NetworkFast                         ...
CLOUD TRAITS
ComputeSlowElastic          CLOUD TRAITS
ComputeSlowElasticStorageSlowJitteryMaybe durableLow memory                CLOUD TRAITS
Compute                    NetworkSlow                       “Fast”Elastic                    Geo-distributedStorageSlowJi...
Compute                    NetworkSlow                       “Fast”Elastic                    Geo-distributedStorageSlowJi...
What We Have   Found
Network
NetworkLayer 2: Virtual DomainLayer 3: Crude Edge FilteringLayer 7: Crude Load BalancingDNSCDN
NetworkLayer 2: Virtual Domain                !Layer 3: Crude Edge Filtering              ks           orLayer 7: Crude Lo...
Storage
Latency                                     BASE                                     Amazon S3                       BASE ...
Latency                                      BASE                                            y                            ...
Low Memory http://aws.amazon.com/ec2/#instance
Jittery, Limited Throughput          Network Block Storage (EBS)  https://app.datad0g.com/dash/dash/1032#/date_range/13086...
Average wait in ms                     DEV      tps   rd_sec/s   wr_sec/s   avgrq-sz   avgqu-sz    await   svctm   %util03...
Some Tricks
Software RAIDRAID 0Offsite backups              Some Tricks
Software RAID       Limited by slowestRAID 0              volumeOffsite backups              Some Tricks
Software RAID           Limited by slowestRAID 0                  volumeOffsite backupsStreaming replicationS3 backups    ...
Software RAID           Limited by slowestRAID 0                  volumeOffsite backupsEphemeral volumesAnd Offsite backup...
Software RAID           Limited by slowestRAID 0                  volumeOffsite backupsEphemeral volumesAnd Offsite backup...
Software RAID           Limited by slowestRAID 0                  volumeOffsite backupsEphemeral volumesAnd Offsite backup...
Software RAID           Limited by slowestRAID 0                  volumeOffsite backupsEphemeral volumesAnd Offsite backup...
Network Block Storage Is The Dark Side
Network Block Storage Is The Dark Side Bait For Enterprise    Customers
Network Block Storage    Is The Dark Side    Bait For Enterprise       CustomersHard Problem For Cloud Providers
Don’t rely on networked block storageSmall data sets only if you have toDon’t trust data-at-restCopy, replicate, back upDo...
Compute
“Performance”      Scale up   Shard       ACID       Nodes                 BASE DIRT Add more                 Nodes Nodes ...
Don’t rely on scale-upsLow memory a hard limit for DBsNoisy neighborsIndividual performance poor and jitteryScale outFirst...
Management
An API for everythingComputeStorageNetworkManagement
Do use the AWS APIsAlmost like magicRich librariesEver expandingDo use toolse.g. Chef, Puppet, cfengine, etc.DatadogDo Kil...
New RulesNew ToolsNew PlaybookSame Fundamentals
Questions!http://datadoghq.com      twitter: @alq
Upcoming SlideShare
Loading in …5
×

A journey in the public clouds

892 views

Published on

Presented at the NYC IASA Chapter, 6/28/2011.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
892
On SlideShare
0
From Embeds
0
Number of Embeds
119
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

A journey in the public clouds

  1. 1. A Journey In The Public Clouds With Datadog Alexis Lê-Quôc (Product Guy) at Datadog IASA New York Chapter June 28th, 2011
  2. 2. What I’m going to talk about ‣What we do and for whom ‣The kind of data we deal with ‣Our architecture ‣Our architecture in a public cloud (AWS) ‣What we learned ‣Q+A
  3. 3. SaaS Platform forAggregation, Correlation, Collaboration For Dev & Ops What we do?
  4. 4. The Mess Usage Analytics Too many data streams, IAAS / PAAS too many silos Issue Resolution t ics Servers and Devices ics igh ices etr ins metr g billin Too many choices to m m cho et ri c s s ?!? change make, too often Dev team changes !? ics choices metr Ops team Applications tri cs ch an Only getting worse as me nts ge SaaS Silos multiplyme even s ve tstri ad e + fe es edb cs vic oic ack ch e me s s tric choice tri me cs Separate Dev and Ops Cap. Planning SDLC support Monitoring teams, looking at separate Hosting data streams Asset Mgmt CDNs Data-Driven decision making in IT is rarely happening. Too slow, Too expensive, requires too much discipline.
  5. 5. We SimplifyDatadog to the rescue system metrics key metrics quality metrics to Alice Dev SaaS data visibility capacity metrics usage analytics recommendations cloud billing to Bob Ops code metrics visibility config changes IaaS pricing business metrics perf. data to Charlie CEO vendors info curated metadata Aggregation Correlation Collaboration
  6. 6. Concretely
  7. 7. etc. Aggregation
  8. 8. AGGREGATION Aggregation
  9. 9. https://app.datad0g.com/dash/dash/1000#/date_range/1308057152698-1308143552698 Correlation
  10. 10. Collaboration
  11. 11. What Architecture For What Kind Of Data?
  12. 12. Events MetricsUser comments Unique visitorsAlert LoadBuild Transaction durationBatch job etc.
  13. 13. Taxonomy
  14. 14. AtomicityConcistencyIsolationDurabilitye.g. SQL DBs CLASSICS http://en.wikipedia.org/wiki/Eventual_consistency
  15. 15. Atomicity BasicallyConcistency AvailableIsolation Soft-stateDurability Eventual consistencye.g. SQL DBs e.g. DNS CLASSICS http://en.wikipedia.org/wiki/Eventual_consistency
  16. 16. Data Intensive Real Time e.g. real-time webNEW COMERBrian Cantrill: http://dtrace.org/resources/bmc/DIRT.pdf
  17. 17. AggregationConstant data influxLarge data sets Correlation On-demand visualization Background data analysis Collaboration Real-time updates On-the-fly data analysis
  18. 18. Aggregation SEConstant data influx BALarge data sets Correlation On-demand visualization Background data analysis Collaboration Real-time updates On-the-fly data analysis
  19. 19. Aggregation SE TConstant data influx IR BA DLarge data sets Correlation On-demand visualization Background data analysis Collaboration Real-time updates On-the-fly data analysis
  20. 20. Aggregation SE TConstant data influx IR BA DLarge data sets Correlation SE On-demand visualization BA Background data analysis Collaboration Real-time updates On-the-fly data analysis
  21. 21. Aggregation SE TConstant data influx IR BA DLarge data sets Correlation SE On-demand visualization BA Background data analysis Collaboration T Real-time updates IR D On-the-fly data analysis
  22. 22. Aggregation SE TConstant data influx IR BA DLarge data sets Correlation SE On-demand visualization BA Background data analysis Collaboration T Real-time updates IR D On-the-fly data analysis Datadog = DIRT + BASE + a tiny bit of ACID
  23. 23. How It All Fits Together http://www.flickr.com/photos/tom-margie/1253798184/
  24. 24. Architecture Simplified
  25. 25. Architecture Simplified SEBA
  26. 26. Architecture Simplified SE T IR BAD
  27. 27. Architecture Simplified SE ID T IR C BA AD
  28. 28. The Environment
  29. 29. 4 DimensionsComputeStorageNetworkManagement
  30. 30. ON-PREMISE TRAITShttp://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  31. 31. ComputeFastInelastic ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  32. 32. ComputeFastInelasticStorageFastCentralizedRedundant ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  33. 33. Compute NetworkFast FastInelastic LocalizedStorageFastCentralizedRedundant ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  34. 34. Compute NetworkFast FastInelastic LocalizedStorageFast ManagementCentralized People-basedRedundant Full access ON-PREMISE TRAITS http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
  35. 35. CLOUD TRAITS
  36. 36. ComputeSlowElastic CLOUD TRAITS
  37. 37. ComputeSlowElasticStorageSlowJitteryMaybe durableLow memory CLOUD TRAITS
  38. 38. Compute NetworkSlow “Fast”Elastic Geo-distributedStorageSlowJitteryMaybe durableLow memory CLOUD TRAITS
  39. 39. Compute NetworkSlow “Fast”Elastic Geo-distributedStorageSlowJittery ManagementMaybe durable No bare-metalLow memory “Magic” API CLOUD TRAITS
  40. 40. What We Have Found
  41. 41. Network
  42. 42. NetworkLayer 2: Virtual DomainLayer 3: Crude Edge FilteringLayer 7: Crude Load BalancingDNSCDN
  43. 43. NetworkLayer 2: Virtual Domain !Layer 3: Crude Edge Filtering ks orLayer 7: Crude Load BalancingDNS W ItCDN
  44. 44. Storage
  45. 45. Latency BASE Amazon S3 BASE Apache Cassandra ACID PostgreSQL DIRT Redis Capacity Storage
  46. 46. Latency BASE y nc Amazon S3 te La t BASE pu y gh er Apache Cassandra ou ACID tt hr Ji dt PostgreSQL i te Lim DIRT y or em Redis Capacity m wLo Storage
  47. 47. Low Memory http://aws.amazon.com/ec2/#instance
  48. 48. Jittery, Limited Throughput Network Block Storage (EBS) https://app.datad0g.com/dash/dash/1032#/date_range/1308608717016-1309213517016
  49. 49. Average wait in ms DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util03:35:02 PM dev8-80 375.95 23614.08 5.70 62.83 47.21 125.58 1.26 47.3403:35:02 PM dev8-96 373.63 23749.65 5.64 63.58 45.55 121.91 1.22 45.7203:35:02 PM dev8-112 375.28 23693.47 5.52 63.15 45.52 121.22 1.23 46.3103:35:02 PM dev8-128 375.31 23721.57 7.19 63.22 56.00 148.96 1.34 50.35 Read throughput in sector/s Average service Total: 368Mb/s time in ms Limited Throughput In Numbers RAID 0 EBS Volumes, m1.large instances
  50. 50. Some Tricks
  51. 51. Software RAIDRAID 0Offsite backups Some Tricks
  52. 52. Software RAID Limited by slowestRAID 0 volumeOffsite backups Some Tricks
  53. 53. Software RAID Limited by slowestRAID 0 volumeOffsite backupsStreaming replicationS3 backups Some Tricks
  54. 54. Software RAID Limited by slowestRAID 0 volumeOffsite backupsEphemeral volumesAnd Offsite backupsStreaming replicationS3 backups Some Tricks
  55. 55. Software RAID Limited by slowestRAID 0 volumeOffsite backupsEphemeral volumesAnd Offsite backups Complexity Recovery Time ObjectiveStreaming replication Recovery Point ObjectiveS3 backups Some Tricks
  56. 56. Software RAID Limited by slowestRAID 0 volumeOffsite backupsEphemeral volumesAnd Offsite backups Complexity Recovery Time ObjectiveStreaming replication Recovery Point ObjectiveS3 backupsDatabase ServiceMySQL/Oracle RDS Some Tricks
  57. 57. Software RAID Limited by slowestRAID 0 volumeOffsite backupsEphemeral volumesAnd Offsite backups Complexity Recovery Time ObjectiveStreaming replication Recovery Point ObjectiveS3 backupsDatabase Service TrustMySQL/Oracle RDS RDS Outage 2 months ago Some Tricks
  58. 58. Network Block Storage Is The Dark Side
  59. 59. Network Block Storage Is The Dark Side Bait For Enterprise Customers
  60. 60. Network Block Storage Is The Dark Side Bait For Enterprise CustomersHard Problem For Cloud Providers
  61. 61. Don’t rely on networked block storageSmall data sets only if you have toDon’t trust data-at-restCopy, replicate, back upDo use S3 if you canObject semantics a limitationSlow but durable Some Do’s And Don’t
  62. 62. Compute
  63. 63. “Performance” Scale up Shard ACID Nodes BASE DIRT Add more Nodes Nodes Number Compute
  64. 64. Don’t rely on scale-upsLow memory a hard limit for DBsNoisy neighborsIndividual performance poor and jitteryScale outFirst scale upThen ShardParallelize across machinesVector-processing via GPUs Some Do’s And Don’t
  65. 65. Management
  66. 66. An API for everythingComputeStorageNetworkManagement
  67. 67. Do use the AWS APIsAlmost like magicRich librariesEver expandingDo use toolse.g. Chef, Puppet, cfengine, etc.DatadogDo Kill and RespawnLow-level debugging impossibleInstance creation is cheapSome Do’s And Don’t
  68. 68. New RulesNew ToolsNew PlaybookSame Fundamentals
  69. 69. Questions!http://datadoghq.com twitter: @alq

×