Successfully reported this slideshow.
Your SlideShare is downloading. ×

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale Networks in Real-time

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 78 Ad

More Related Content

Slideshows for you (20)

Similar to How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale Networks in Real-time (20)

Advertisement

More from Amazon Web Services (20)

Recently uploaded (20)

Advertisement

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale Networks in Real-time

  1. 1. The Connection Game How We Use Kinesis Streams to Analyze Billions of Network Traffic Flows in Real-Time John Bennett, Cloud Network Engineering Senior Software Engineer
  2. 2. ● 93 million customers ● Over 190 countries ● 37% of Internet traffic ● 125 million hours of video Netflix is big
  3. 3. ● 100s of microservices ● 1,000s of deployments ● More than 100,000 instances And complex
  4. 4. How do we optimize the design and use of the network at scale in a dynamic environment?
  5. 5. How is the network being used?
  6. 6. ● Immutable infrastructure ● Scaling events ● Internal (instances and containers) ● External (AWS S3, ELB, Internet, etc.) IPs in a Dynamic Environment
  7. 7. Metadata Changes Over Time
  8. 8. Metadata Changes Over Time
  9. 9. Metadata Changes Over Time
  10. 10. Metadata Changes Over Time
  11. 11. ● Slowly changing dimension ● Unpredictable ● Valid during a specific time interval Metadata in a Dynamic Environment
  12. 12. Source IP Destination IPat time t
  13. 13. Source Metadata Destination Metadataat time t
  14. 14. Dredge Transforms traffic logs into enriched and aggregated multi-dimensional data
  15. 15. ● Account ● Region ● Availability Zone ● VPC, Subnet ● Protocol (TCP, UDP) ● Accept or Reject ● Application ● Cluster ● Type • instance • container • AWS service Metadata Dimensions
  16. 16. ● Bytes transferred ● Packets sent ● Number of flows ● Latency Aggregated Metrics
  17. 17. ● OLAP-style (Online Analytical Processing) ● Rollup ● ex. All apps deployed to the same region rollup to that region ● Drill down ● ex. Which apps deployed to a region generate the most traffic? ● Slicing and dicing ● ex. Which apps generate the most traffic in a region by day? Queries
  18. 18. ● Large dataset (billions of events per day) ● Multiple dimensions and metrics ● Ad-hoc OLAP queries ● Fast aggregations ● Real-time New source for network analytics
  19. 19. Dredge Ingest Network data from the entire system Enrich Traffic logs with application metadata Aggregate Multi-dimensional metrics
  20. 20. Flow Logs AWS API for network traffic flows
  21. 21. ● Good: Wide coverage ● Good: Consolidated ● Good: Core info (src and dst IP, timestamp) ● Bad: 10-minute capture window ● Ugly: Stateless Flow Logs Overview
  22. 22. Example
  23. 23. { accountID: 123456789010, eniID: eni-abc123de, srcIP: 172.31.16.139, srcPort: 12345, dstIP: 10.13.67.49, dstPort: 80, protocol: 6, packets: 123, bytes, 42, start: 1490746336, end: 1490746369, action: ACCEPT, ... }
  24. 24. Given a VpcFlowLogEvent {srcIP: 172.31.16.139, dstIP: 10.13.67.49, …} Enriched with application metadata {srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata: {app: foo}, dstMetadata: {app: bar},…} Aggregated and indexed App foo sent 426718 bytes to app bar today
  25. 25. Given a VpcFlowLogEvent {srcIP: 172.31.16.139, dstIP: 10.13.67.49, …} Enriched with application metadata {srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata: {app: foo}, dstMetadata: {app: bar},…} Aggregated and indexed App bar received 8278392 bytes from apps foo and baz in the last week
  26. 26. Given a VpcFlowLogEvent {srcIP: 172.31.16.139, dstIP: 10.13.67.49, …} Enriched with application metadata {srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata: {app: foo}, dstMetadata: {app: bar},…} Aggregated and indexed App baz has outbound network dependencies on apps foo, bar, etc.
  27. 27. Patterns
  28. 28. Read This Book The following diagrams are adapted from Kleppman’s talks on patterns for real-time stream processing.
  29. 29. Streams of Immutable events
  30. 30. ● Database indexes and secondary indexes ● Materialized views ● Caching Derived Data, Read-Optimized
  31. 31. ● Separation of concerns for reading and writing ● Changelog stream is a 1st class citizen ● Consume and join streams instead of querying DB ● Maintain materialized views ● Pre-computed cache Unbundle the database
  32. 32. Ingest
  33. 33. ● Integration with AWS services ● Kinesis Client Library (KCL) ● Auto-scaling for elastic throughput ● Total Cost of Ownership (TCO) Kinesis Over Kafka
  34. 34. Cross-Account Log Sharing
  35. 35. ● Worker per EC2 instance ○ Multiple record processors per worker ○ Record processor per shard ● Load balancing between workers ● Checkpointing (with DynamoDB) ● Stream- and shard-level metrics Kinesis Client Library
  36. 36. VPC Flow Logs IncomingBytes per hour Example account and region over 1 week Elastic throughput
  37. 37. VPC Flow Logs IncomingBytes per minute Example account and region over 3 hours Elastic throughput
  38. 38. ● Very little operational overhead ○ Monitor stream metrics and DynamoDB table ○ Run and manage auto-scaling util ● No consultation from internal Kafka team ○ Capacity planning ○ Monitoring, failover, and replication TCO
  39. 39. ● Per-shard limits ○ Increase shard count or fan out to other streams ● No log compaction ○ Up to 7-day max retention ○ Manual snapshots, increased complexity ○ Not ideal for changelog joins Limitations
  40. 40. ● Kinesis enables us to focus ● Cross-account log sharing simplifies the system ● KCL does the boring stuff ● Auto-scaling improves efficiency ● Lower TCO Ingest: Lessons
  41. 41. Enrich
  42. 42. Address metadata is temporal
  43. 43. ● Hash table of sorted lists ● Key is IP, Value is metadata sorted by timestamp ● Recent updates (within capture window) or last ● Join with flow log events stream Address Metadata Changelog
  44. 44. Kafka Log Compaction
  45. 45. Direction Src Port Dst Port Inbound Ephemeral Non-Ephemeral Outbound Ephemeral Non-Ephemeral Return Non-Ephemeral Ephemeral Derive TCP State
  46. 46. ● Stream table join with changelog ● Log compaction for cold starts, bootstrapping ● Derive state from stateless Enrich: Lessons
  47. 47. Aggregate
  48. 48. Bucket deadline reached
  49. 49. … dataSchema: { dataSource: flowlogs, parser: { dimensionsSpec: { dimensions: [ srcApp, srcAccount, srcRegion, …, dstApp, dstAccount, dstRegion, … ], } } metricsSpec: [ { type: longSum, fieldName: packets }, { type: longSum, fieldName: bytes } ● Column-oriented ● Google BigQuery and PowerDrill ● Ad-hoc OLAP queries ● Fast aggregations ● Multi-dimensional metrics ● Scales to trillions of events
  50. 50. ● Pre-aggregate into timestamp buckets ● Druid is a great fit for exploratory analytics ● Fast ad-hoc queries, < 1 second Aggregate: Lessons
  51. 51. Results
  52. 52. Pivot / Swiv Demo Drag-and-drop UI
  53. 53. Pivot / Swiv Demo Contextual exploration
  54. 54. Pivot / Swiv Demo Comparison
  55. 55. Exploratory Analysis with Pivot / Swiv Demo Bytes sent per application, table
  56. 56. Exploratory Analysis with Pivot / Swiv Demo Bytes sent per application, split by hour, line chart
  57. 57. Exploratory Analysis with Pivot / Swiv Demo Bytes sent by example application, split by hour, line chart
  58. 58. Exploratory Analysis with Pivot / Swiv Demo Comparison of bytes, flows, and packets, split by day, line chart
  59. 59. ● Auditing AWS security groups (virtual firewalls) ● Anomaly and threat detection ● Deployment best practices ● Cost analysis Other Use Cases
  60. 60. . Example application as a network graph
  61. 61. . Example application as a network graph You are here
  62. 62. Enriched and aggregated traffic data is a powerful source of information for network design and optimization.
  63. 63. @yo_bennett

×