Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401 - re:Invent 2017

1,247 views

Published on

Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you'll learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.

  • Be the first to comment

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401 - re:Invent 2017

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How Netflix Monitors Applications in Near Real-time with Amazon Kinesis J o h n B e n n e t t , S r . S o f t w a r e E n g i n e e r , N e t f l i x R o y B e n - A l t a , P r i n c i p a l B u s i n e s s D e v e l o p m e n t M a n a g e r , A W S AWS re:INVENT A B D 4 0 1 N o v e m b e r 3 0 , 2 0 1 7
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Session Topics • Amazon Kinesis • Log Analytics Use Case • Netflix Use Case • Questions
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kinesis At Amazon Amazon CloudWatch logs Amazon S3 events AWS Metering Amazon.com’s catalog Fact—we all use Amazon Kinesis
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hourly server logs Weekly or monthly bills Daily website clickstream Daily fraud reports Real-time metrics Real-time spending alerts/caps Real-time clickstream analysis Real-time detection It’s all about the pace Batch processing Stream processing
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis—real-time analytics Easily collect, process, and analyze video and data streams in real time Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics New at re:Invent 2017
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log analytics use cases Application logs [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test • Operation intelligence • Security intelligence and event management • Application performance and monitoring • Business analytics • Monitoring and operation is becoming big data problem
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analyzing CloudTrail Event Logs AWS CloudTrail Amazon CloudWatch events trigger Amazon Kinesis Data Analytics AWS Lambda function Amazon S3 bucket for raw data Amazon S3 bucket for processed data Amazon DynamoDB Table(s) Chart.JS Dashboard Compute operational metrics Ingest and deliver raw log data Deliver to a real time dashboards and archival Amazon Kinesis Data Firehose Amazon Kinesis Data Firehose • http://amzn.to/2ApHXKr
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix Uses Amazon Kinesis Data Streams to Analyze 7M+ Network traffic events per second in Real-Time
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix Kinesis Use Case
  10. 10. Amazon Kinesis Customer Base Diversity Netflix Uses Kinesis Streams to Analyze Billions of Network Traffic Flows in Real-Time
  11. 11. What is Netflix’s decision latency?
  12. 12. Hint: It’s not the network What's wrong with the network?
  13. 13. Hint: Faraway places are far away Why is the network so slow?
  14. 14. Hint: Distributed systems are hard My service can’t connect to its dependencies.
  15. 15. Background
  16. 16. ● 104 million customers ● Over 190 countries ● 37% of U.S. Internet traffic ● 125 million hours of video Netflix is big.
  17. 17. ● Dozens of accounts ● Multiple regions ● 100s of microservices ● 1,000s of deployments ● > 100,000 instances And complex.
  18. 18. ● No access to the underlying network ● Large traffic volume ● Billions of flows per day ● Gigabytes of logs per second ● Dynamic environment ● Logs are limited, ex. IP-to-IP ● IP addresses are randomly assigned ● IP metadata varies over time, unpredictable Challenges
  19. 19. ● Good: Wide coverage of network traffic ● Good: Consolidated ● Good: Core info (source and destination IP) ● Bad: 10-minute capture window ● Ugly: Stateless AWS VPC Flow Logs
  20. 20. At time t Source IP 172.31.16.139 Destination IP 10.13.67.49
  21. 21. At time t Source Metadata Service A Account 1234567890 Zone us-east-1e Destination Metadata Service B Account 0987654321 Zone eu-west-1b
  22. 22. ● Develop a new datasource for network analytics ● Multiple dimensions (Netflix- and AWS-centric) ● Fast aggregations ● Enable ad-hoc OLAP-style queries ● Rollup, drill down, slicing and dicing ● Add observability to network ● Fill gap not addressed by existing tools Goal
  23. 23. Dredge Enriched and aggregated multi-dimensional network traffic data
  24. 24. Amazon Kinesis
  25. 25. ● Integration with AWS services ● VPC Flow Logs ● S3 ● Elasticsearch ● Handles scale ● Kinesis Client Library (KCL) ● Total Cost of Ownership (TCO) Why Amazon Kinesis?
  26. 26. ● Enables experimentation ● Load streaming data differently ● Batch with Kinesis Firehose ● Store in S3, Process with Lambda ● Elasticsearch as an intermediate store ● Stream with Kinesis Streams Strong AWS Integration
  27. 27. Cross-Account Log Sharing
  28. 28. VPC Flow Logs IncomingBytes per hour Example account and region over 1 week Handles event data at scale
  29. 29. ● Worker per Amazon EC2 instance ○ Multiple record processors per worker ○ Record processor per shard ● Load balancing between workers ● Checkpointing (with Amazon DynamoDB) ● Stream- and shard-level metrics Kinesis Client Library
  30. 30. ● Very little operational overhead ○ Monitor stream metrics and DynamoDB table ○ Leverage Auto-Scaling Utility for Kinesis Streams ● No overhead for Amazon Kinesis Firehose TCO
  31. 31. ● Per-shard limits ○ Increase shard count or fan out to other streams ● No log compaction ○ Up to 7-day max retention ○ Manual snapshots, increased complexity ○ Not ideal for changelog joins Limitations
  32. 32. Stream Processing Patterns
  33. 33. ● Delay: 24 hours (daily interval) ● Bounded, fixed-size input ● Measured by throughput (time to process input) ● Limitations ● Remote DB: Round-trip time, parallel queries could overload ● Local cache: Depends on distribution of data, how to handle invalidation ● Local DB: More effective, less contention and no network RTT Batch
  34. 34. Patterns: Batch + DB
  35. 35. Patterns: Batch + DB
  36. 36. ● Delay: 7 minutes, average case (capture window) ● Unbounded input as events happen ● Measured by how far consumer is behind ● Limitations (similar to batch) ● Remote DB: Round-trip time, parallel queries could overload ● Local cache: Depends on distribution of data, how to handle invalidation ● Local DB: More effective, less contention and no network RTT Stream
  37. 37. Patterns: Stream + DB
  38. 38. Patterns: Stream + DB
  39. 39. Patterns: Stream + DB + Cache
  40. 40. Patterns: Stream + DB + Cache
  41. 41. ● ex. Database indexes, caches, materialized views ● Transformed from source of truth ● Optimized for read queries, improve performance ● Built from a changelog of events Derived Data
  42. 42. ● Log-based message broker to send change events ● Expose changelog stream as 1st class citizen ● Consume and join streams instead of querying DB ● Alternative view to query efficiently ● Update when data changes ● Removes network round-trip time, resource contention ● Pre-computed cache Change Data Capture
  43. 43. Patterns: Stream + table join
  44. 44. Patterns: Stream + table join
  45. 45. Results
  46. 46. 7 million network flows Enriched per second 5 minutes Average delay from network flow occurrence 1 Kinesis stream With 100s of shards By the Numbers
  47. 47. What's wrong with the network? Dredge reduces mean-time-to-innocence.
  48. 48. Fault Domain 2 Account 0987654321, Zone eu-west-1a Fault Domain 1 Account 1234567890, Zone us-east-1e
  49. 49. Fault Domain 2 Account 0987654321, Zone eu-west-1a Fault Domain 1 Account 1234567890, Zone us-east-1e Bad code push?
  50. 50. Fault Domain 2 Account 0987654321, Zone eu-west-1a Fault Domain 1 Account 1234567890, Zone us-east-1e Network outage?
  51. 51. Why is the network so slow? Dredge identifies high-latency network flows.
  52. 52. Region us-east-1 Zone Affinity <1ms Zone us-east-1d Zone us-east-1e
  53. 53. Region us-east-1 Cross-zone < 2ms Zone us-east-1d Zone us-east-1e
  54. 54. Region us-east-1 Zone us-east-1d Zone us-east-1e Region us-west-2 Zone us-west-2a Zone us-west-2b Cross-region 30-300ms
  55. 55. Region us-east-1 Zone us-east-1d Region us-west-2 Zone us-west-2a Zone us-west-2b Cross-region fan-out 30-300ms
  56. 56. ● Estimate 23% of total traffic is cross-zone ● About 14% of total traffic is cross-region ● Some intentional cross-zone, cross-region traffic Initial Findings
  57. 57. My service can’t connect to its dependencies. Dredge classifies a service’s inbound and outbound dependencies.
  58. 58. Existing tools ● Distributed tracing via Salp ● Similar to Google’s Dapper ● Naive sampling ● JVM-centric ● Incomplete coverage ● Need to be a part of the main request path ● Difficult to capture startup dependencies ● Lacks support for protocols other than TCP IPv4
  59. 59. Outbound Dependencies using Tracing
  60. 60. Outbound Dependencies using Tracing Outbound Dependencies using Traffic Logs
  61. 61. Initial Findings ● Significant discrepancy between Dredge and Salp ● Sample of 100 services ● Dependencies from tracing are a subset ● Tracing is implemented inconsistently ● Higher coverage ● Connections to AWS services prove helpful
  62. 62. Security Use Cases ● Use network dependencies to audit security groups ● Reduce blast radius ● Only source of logs for Security Group rejected flows ● Reports communication with public Internet ● Threat detection, port scanning, etc. ● AWS resources (instances, load balancers) with increased exposure ● Risk profiles
  63. 63. Next
  64. 64. How can we do better? ● VPC Flow Logs give us a 10,000-ft view ● More detail and context ● Kernel-level metrics, eBPF ● Dynamic sampling rates ● Minimize variability ● Coordination
  65. 65. Conclusion
  66. 66. Enriched and aggregated traffic data can be a powerful source of information that adds visibility to the network.
  67. 67. Amazon Kinesis Streams and Firehose help us focus on processing events reliably and at scale.
  68. 68. We benefit from change data capture by consuming and joining streams using read-optimized data structures.
  69. 69. John Bennett Cloud Network Engineering bennett@netflix.com @yo_bennett
  70. 70. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU! b e n a l t a r @ a m a z o n . c o m b e n n e t t @ n e t f l i x . c o m

×