Reducing Latency and Increasing Performance while Cutting Infrastructure Costs

1,408 views

Published on

Discussion on Datadog’s experiences, both successes and challenges, as they built our monitoring solutions on top AWS Lambda and Amazon API gateway with the goal of reducing latency and increasing performance while cutting infrastructure costs.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,408
On SlideShare
0
From Embeds
0
Number of Embeds
39
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Reducing Latency and Increasing Performance while Cutting Infrastructure Costs

  1. 1. From Pull to Push Using Lambda to minimize latency and reduce operational costs tristan@datadoghq.com
  2. 2. Datadog gathers performance data from all your application components
  3. 3. What is Datadog?
  4. 4. What is Datadog? ELB
  5. 5. What is Datadog? ELB Web Servers
  6. 6. What is Datadog? ELB Web Servers Databases
  7. 7. What is Datadog? ELB Web Servers Databases Custom
  8. 8. What is Datadog? Datadog gathers performance data from all your application components. Detail metrics about one of the components
  9. 9. Agent based data collection Crawler based data collection
  10. 10. Agenda ● Pulling data via crawlers generates latency and operational cost ● Using Lambda to minimize latency and reduce operational costs
  11. 11. Agenda ● Pulling data via crawlers generates latency and operational cost ● Using Lambda to minimize latency and reduce operational costs
  12. 12. AWS is one of our most used integrations
  13. 13. What is Datadog?
  14. 14. What is Datadog?
  15. 15. Latency as seen by users Upstream Latency Data points are available after 2 to 11 minutes Scheduling Latency We schedule crawlers every 1 to 10 minutes Crawler Latency How much time it takes to fetch the data Crawler based metrics can’t be real time
  16. 16. Latency as seen by users Upstream Latency Data points are available after 2 to 11 minutes Scheduling Latency We schedule crawlers every 1 to 10 minutes Crawler Latency How much time it takes to fetch the data Crawler based metrics can’t be real time Throttling API call charges Infrastructure cost
  17. 17. How can we minimize metric latency?
  18. 18. Agenda ● Pulling data via crawlers generates latency and operational cost ● Using Lambda to minimize latency and reduce operational costs
  19. 19. AWS Lambda https://aws.amazon.com/lambda/
  20. 20. AWS Lambda - Serverless / Operation-less
  21. 21. - Serverless / Operation-less - Event driven architecture with many integrations AWS Lambda
  22. 22. AWS Lambda - Serverless / Operation-less - Event driven architecture with many integrations - Api Gateway for custom integrations
  23. 23. Agenda ● Pulling data via crawlers generates latency and operational cost ● Using Lambda to minimize latency and reduce operational costs ○ RDS enhanced monitoring with Lambda
  24. 24. Amazon RDS enhanced monitoring CloudWatch Metrics
  25. 25. Amazon RDS enhanced monitoring CloudWatch Metrics CloudWatch Logs
  26. 26. Amazon RDS enhanced monitoring
  27. 27. How to sync data between CWLogs and Datadog? Datadog CloudWatch Logs ?
  28. 28. How to sync data between CWLogs and Datadog? Datadog CloudWatch Logs Crawler Pull Data Submit Data
  29. 29. How to sync data between CWLogs and Datadog? Datadog CloudWatch Logs Push with Lambda Crawler Pull Data Submit Data
  30. 30. Lambda function that subscribes to RDSOS Metrics
  31. 31. Lambda function to process to RDSOS Metrics
  32. 32. Lambda function to process to RDSOS Metrics
  33. 33. Lambda allows sub minute latency for those metrics Crawler based Lambda based
  34. 34. Lambda allows sub minute latency for those metrics Crawler based Lambda based
  35. 35. Amazon RDS enhanced monitoring + Sub minute latency + No crawler to run and maintain + No internal state to remember which points to process + No Ops
  36. 36. Amazon RDS enhanced monitoring + Sub minute latency + No crawler to run and maintain + No internal state to remember which points to process + No Ops - Not as easy to setup and troubleshoot - Not easy to update - No ad hoc replay
  37. 37. Agenda ● Pulling data via crawlers generates latency and operational cost ● Using Lambda to minimize latency and reduce operational costs ○ Agent Release Process with Lambda
  38. 38. Datadog’s Agent releases
  39. 39. Datadog’s Agent releases: Invalidate the cache Amazon CloudFrontAmazon LambdaAmazon S3Jenkins Push package Trigger Lambda Invalidate cache
  40. 40. Datadog’s Agent releases: Notify the security team Amazon CloudFrontAmazon LambdaAmazon S3Jenkins Push package Trigger Lambda Notify Security Invalidate cache
  41. 41. Stateless, simple, event based tasks Stateful, complex tasks AWS Lambda +++ ???
  42. 42. Agenda ● Pulling data via crawlers generates latency and operational cost ● Using Lambda to minimize latency and reduce operational costs ○ Using Lambda to extract custom metrics from Lambda
  43. 43. Agent Lambda Aggregation Server Application Application Application Application host Push over UDP Datadog increment(metric, value)
  44. 44. Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Agent Aggregation Server Application Application Application Application Push over UDP Datadog increment(metric, value) host
  45. 45. Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Datadog ? Agent Aggregation Server Application Application Application Application Push over UDP Datadog increment(metric, value) host
  46. 46. Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda Lambda CloudWatch Logs Datadog ? log(MONITORING|user.submit|count|14733433|1|#demo) MONITORING|name|type|timestamp|value|#tags) Agent Aggregation Server Application Application Application Application Push over UDP Datadog increment(metric, value) host
  47. 47. Datadog Intake CloudWatch Logs How to submit data from CloudWatch Logs to Datadog?
  48. 48. Datadog Intake CloudWatch Logs How to submit data from CloudWatch Logs to Datadog? user.submit|1|timestamp1 user.submit|1|timestamp2 user.submit|1|timestamp1 Contains user.submit|1|timestamp2 user.submit|2|timestamp1 Expects
  49. 49. Datadog Intake CloudWatch Logs We need to aggregate the data points user.submit|1|timestamp1 user.submit|1|timestamp2 user.submit|1|timestamp1 Contains user.submit|1|timestamp2 user.submit|2|timestamp1 Expects Aggregation user.submit|timestamp1:[1, 1] user.submit|timestamp2:[1]
  50. 50. Cloud Watch Logs DatadogAggregate Data user.submit|1|timestamp2 user.submit|1|timestamp2 user.submit|1|timestamp1 user.submit|1|timestamp2 user.submit|2|timestamp1 Pull Data CRAWLER Submit Data user.submit|timestamp1:[1, 1] user.submit|timestamp2:[1] A crawler from CloudWatch Logs to Datadog
  51. 51. Datadog Intake CloudWatch Logs How can we build this with Lambda? user.submit|1|timestamp1 user.submit|1|timestamp2 user.submit|1|timestamp1 Contains user.submit|1|timestamp2 user.submit|2|timestamp1 Expects Aggregation user.submit|timestamp1:[1, 1] user.submit|timestamp2:[1] Push Push
  52. 52. Datadog Intake CloudWatch Logs How can we build this with Lambda? user.submit|1|timestamp1 user.submit|1|timestamp2 user.submit|1|timestamp1 Contains user.submit|1|timestamp2 user.submit|2|timestamp1 Expects Aggregation Service user.submit|timestamp1:[1, 1] user.submit|timestamp2:[1] Push Push
  53. 53. Datadog Intake CloudWatch Logs We need a stateful Lambda pipeline to aggregate metrics user.submit|1|timestamp1 user.submit|1|timestamp2 user.submit|1|timestamp1 Contains user.submit|1|timestamp2 user.submit|2|timestamp1 Expects Aggregation Service user.submit|timestamp1:[1, 1] user.submit|timestamp2:[1] Push Push
  54. 54. Building a stateful lambda pipeline Aggregation Service user.submit|timestamp2:[1] user.submit|timestamp1:[1, 1] user.submit|1|ts2 user.submit|2|ts1 user.submit|1|ts1 user.submit|1|ts2 user.submit|1|ts1
  55. 55. A Database stores the state Aggregation Service DynamoDB Kinesis ElastiCache user.submit|timestamp1:[1, 1] user.submit|1|ts2 user.submit|2|ts1 user.submit|timestamp2:[1] user.submit|1|ts1 user.submit|1|ts2 user.submit|1|ts1
  56. 56. user.submit|1|ts1 user.submit|1|ts2 user.submit|1|ts1 user.submit|1|ts2 user.submit|2|ts1 Aggregation Service API Gateway forwards the Data Points to the store Lambda Lambda DynamoDB Kinesis ElastiCache user.submit|timestamp1:[1, 1] user.submit|timestamp2:[1]Lambda API Gate way
  57. 57. user.submit|1|ts1 user.submit|1|ts2 user.submit|1|ts1 user.submit|1|ts2 user.submit|2|ts1 Aggregation Service Lambda subscribes to state updates and forwards to Datadog Lambda Lambda Lambda Lambda DynamoDB Kinesis ElastiCache user.submit|timestamp1:[1, 1] user.submit|timestamp2:[1]Lambda API Gate way
  58. 58. user.submit|1|ts1 user.submit|1|ts2 user.submit|1|ts1 user.submit|1|ts2 user.submit|2|ts1 Aggregation Service A Serverless stateful pipeline to aggregate Data Points Lambda Lambda Lambda Lambda DynamoDB Kinesis ElastiCache user.submit|timestamp1:[1, 1] user.submit|timestamp2:[1]Lambda API Gate way
  59. 59. Takeaways
  60. 60. Takeaways - Lambda allows us move towards a push system
  61. 61. Takeaways - Lambda allows us move towards a push system - Lambda is great for small stateless event based tasks
  62. 62. Takeaways - Lambda allows us move towards a push system - Lambda is great for small stateless event based tasks - We’re seeing adoption amongst our users
  63. 63. Thank you tristan@datadoghq.com

×