Successfully reported this slideshow.
Your SlideShare is downloading. ×

BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 89 Ad

BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis

Download to read offline

Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this talk, we’ll first discuss why Netflix chose Amazon Kinesis Streams over other streaming data solutions like Kafka to address these challenges at scale. We’ll then dive deep into how Netflix uses Amazon Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we will cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this talk, you’ll take away techniques and processes that you can apply to your large-scale networks and derive real-time, actionable insights.

Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this talk, we’ll first discuss why Netflix chose Amazon Kinesis Streams over other streaming data solutions like Kafka to address these challenges at scale. We’ll then dive deep into how Netflix uses Amazon Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we will cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this talk, you’ll take away techniques and processes that you can apply to your large-scale networks and derive real-time, actionable insights.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis (20)

Advertisement

More from Amazon Web Services (20)

Recently uploaded (20)

Advertisement

BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. John Bennett Sr. Software Engineer, Netflix Damian Wylie Amazon Kinesis Product Management The Connection Game How Netflix Uses Kinesis Streams to Analyze Billions of Network Traffic Flows in Real-Time April 19, 2017
  2. 2. What is your decision latency? OR
  3. 3. The value of data Recent data is valuable If you act on it in real time Capture all of the value from your data
  4. 4. Amazon Kinesis Streams Custom real-time processing Amazon Kinesis Firehose Load and transform your data Amazon Kinesis Analytics Easily analyze data streams using standard SQL queries Amazon Kinesis: Streaming data made easy
  5. 5. Capture all of the value from your data with Amazon Kinesis Amazon S3 Amazon Kinesis Analytics Amazon Kinesis– enabled app Amazon Kinesis Streams Ingest Process React Persist AWS Lambda 0ms 200ms 1-2 s Amazon QuickSight Amazon Kinesis Firehose Amazon Redshift
  6. 6. Amazon Kinesis customer base diversity 1 billion events/wk from connected devices | IoT 17 PB of game data per season | Entertainment 80 billion ad impressions/day, 30 ms response time | Ad Tech 100 GB/day click streams from 250+ sites | Enterprise 50 billion ad impressions/day sub-50 ms responses | Ad Tech 10 million events/day | Retail Amazon Kinesis as databus Migrated from Kafka to Kinesis | Enterprise Funnel all production events through Amazon Kinesis | High Tech
  7. 7. Why are these customers choosing Amazon Kinesis? Lower costs Performant without heavy lifting Scales elastically Increased agility Secure and visible Plug and play
  8. 8. “I don't know how we could have made our clickstream data pipeline work without Amazon Kinesis. It would have involved many weeks of engineering. Kinesis Streams and Firehose make the entire process extremely simple and reliable.” Peter Jaffe Data Scientist, Hearst Corporation
  9. 9. Netflix Uses Kinesis Streams to Analyze Billions of Network Traffic Flows in Real-Time
  10. 10. What is Netflix’s decision latency?
  11. 11. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. John Bennett Sr. Software Engineer, Netflix The Connection Game How Netflix Uses Kinesis Streams to Analyze Billions of Network Traffic Flows in Real-Time April 19, 2017
  12. 12. ● 93 million customers ● Over 190 countries ● 37% of Internet traffic ● 125 million hours of video Netflix is big
  13. 13. ● 100s of microservices ● 1,000s of deployments ● More than 100,000 instances And complex
  14. 14. How do we optimize the design and use of the network at scale in a dynamic environment?
  15. 15. How is the network being used?
  16. 16. ● Immutable infrastructure ● Scaling events ● Internal (instances and containers) ● External (AWS S3, ELB, Internet, etc.) IPs in a Dynamic Environment
  17. 17. Metadata Changes Over Time
  18. 18. Metadata Changes Over Time
  19. 19. Metadata Changes Over Time
  20. 20. Metadata Changes Over Time
  21. 21. ● Slowly changing dimension ● Unpredictable ● Valid during a specific time interval Metadata in a Dynamic Environment
  22. 22. Source IP Destination IPat time t
  23. 23. Source Metadata Destination Metadataat time t
  24. 24. Dredge Transforms traffic logs into enriched and aggregated multi-dimensional data
  25. 25. ● Account ● Region ● Availability Zone ● VPC, Subnet ● Protocol (TCP, UDP) ● Accept or Reject Metadata Dimensions ● Application ● Cluster ● Type • instance • container • AWS service
  26. 26. ● Bytes transferred ● Packets sent ● Number of flows ● Latency Aggregated Metrics
  27. 27. ● OLAP-style (Online Analytical Processing) ● Rollup ● ex. All apps deployed to the same region roll up to that region ● Drill down ● ex. Which apps deployed to a region generate the most traffic? ● Slicing and dicing ● ex. Which apps generate the most traffic in a region by day? Queries
  28. 28. ● Large dataset (billions of events per day) ● Multiple dimensions and metrics ● Ad-hoc OLAP queries ● Fast aggregations ● Real-time New source for network analytics
  29. 29. Dredge Ingest Network data from the entire system Enrich Traffic logs with application metadata Aggregate Multi-dimensional metrics
  30. 30. Flow Logs AWS API for network traffic flows
  31. 31. ● Good: Wide coverage ● Good: Consolidated ● Good: Core info (src and dst IP, timestamp) ● Bad: 10-minute capture window ● Ugly: Stateless Flow Logs Overview
  32. 32. Example
  33. 33. { accountID: 123456789010, eniID: eni-abc123de, srcIP: 172.31.16.139, srcPort: 12345, dstIP: 10.13.67.49, dstPort: 80, protocol: 6, packets: 123, bytes, 42, start: 1490746336, end: 1490746369, action: ACCEPT, ... }
  34. 34. Given a VpcFlowLogEvent {srcIP: 172.31.16.139, dstIP: 10.13.67.49, …} Enriched with application metadata {srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata: {app: foo}, dstMetadata: {app: bar},…} Aggregated and indexed App foo sent 426718 bytes to app bar today
  35. 35. Given a VpcFlowLogEvent {srcIP: 172.31.16.139, dstIP: 10.13.67.49, …} Enriched with application metadata {srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata: {app: foo}, dstMetadata: {app: bar},…} Aggregated and indexed App bar received 8278392 bytes from apps foo and baz in the last week
  36. 36. Given a VpcFlowLogEvent {srcIP: 172.31.16.139, dstIP: 10.13.67.49, …} Enriched with application metadata {srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata: {app: foo}, dstMetadata: {app: bar},…} Aggregated and indexed App baz has outbound network dependencies on apps foo, bar, etc.
  37. 37. Patterns
  38. 38. Read This Book The following diagrams are adapted from Kleppman’s talks on patterns for real-time stream processing.
  39. 39. Streams of Immutable events
  40. 40. ● Database indexes and secondary indexes ● Materialized views ● Caching Derived Data, Read-Optimized
  41. 41. ● Separation of concerns for reading and writing ● Changelog stream is a 1st class citizen ● Consume and join streams instead of querying DB ● Maintain materialized views ● Pre-computed cache Unbundle the database
  42. 42. Ingest
  43. 43. ● Integration with AWS services ● Kinesis Client Library (KCL) ● Auto scaling for elastic throughput ● Total Cost of Ownership (TCO) Kinesis Over Kafka
  44. 44. Cross-Account Log Sharing
  45. 45. ● Worker per EC2 instance ○ Multiple record processors per worker ○ Record processor per shard ● Load balancing between workers ● Checkpointing (with DynamoDB) ● Stream- and shard-level metrics Kinesis Client Library
  46. 46. VPC Flow Logs IncomingBytes per hour Example account and region over 1 week Elastic throughput
  47. 47. VPC Flow Logs IncomingBytes per minute Example account and region over 3 hours Elastic throughput
  48. 48. ● Very little operational overhead ○ Monitor stream metrics and DynamoDB table ○ Run and manage auto-scaling util ● No consultation from internal Kafka team ○ Capacity planning ○ Monitoring, failover, and replication TCO
  49. 49. ● Per-shard limits ○ Increase shard count or fan out to other streams ● No log compaction ○ Up to 7-day max retention ○ Manual snapshots, increased complexity ○ Not ideal for changelog joins Limitations
  50. 50. ● Kinesis enables us to focus ● Cross-account log sharing simplifies the system ● KCL does the boring stuff ● Auto scaling improves efficiency ● Lower TCO Ingest: Lessons
  51. 51. Enrich
  52. 52. Address metadata is temporal
  53. 53. ● Hash table of sorted lists ● Key is IP, Value is metadata sorted by timestamp ● Recent updates (within capture window) or last ● Join with flow log events stream Address Metadata Changelog
  54. 54. Kafka Log Compaction
  55. 55. Direction Src Port Dst Port Inbound Ephemeral Non-Ephemeral Outbound Ephemeral Non-Ephemeral Return Non-Ephemeral Ephemeral Derive TCP State
  56. 56. ● Stream table join with changelog ● Log compaction for cold starts, bootstrapping ● Derive state from stateless Enrich: Lessons
  57. 57. Aggregate
  58. 58. Bucket deadline reached
  59. 59. … dataSchema: { dataSource: flowlogs, parser: { dimensionsSpec: { dimensions: [ srcApp, srcAccount, srcRegion, …, dstApp, dstAccount, dstRegion, … ], } } metricsSpec: [ { type: longSum, fieldName: packets }, { type: longSum, fieldName: bytes } ● Column-oriented ● Google BigQuery and PowerDrill ● Ad-hoc OLAP queries ● Fast aggregations ● Multi-dimensional metrics ● Scales to trillions of events
  60. 60. ● Pre-aggregate into timestamp buckets ● Druid is a great fit for exploratory analytics ● Fast ad-hoc queries, < 1 second Aggregate: Lessons
  61. 61. Results
  62. 62. Pivot / Swiv Demo Drag-and-drop UI
  63. 63. Pivot / Swiv Demo Contextual exploration
  64. 64. Pivot / Swiv Demo Comparison
  65. 65. Exploratory Analysis with Pivot / Swiv Demo Bytes sent per application, table
  66. 66. Exploratory Analysis with Pivot / Swiv Demo Bytes sent per application, split by hour, line chart
  67. 67. Exploratory Analysis with Pivot / Swiv Demo Bytes sent by example application, split by hour, line chart
  68. 68. Exploratory Analysis with Pivot / Swiv Demo Comparison of bytes, flows, and packets, split by day, line chart
  69. 69. ● Auditing AWS security groups (virtual firewalls) ● Anomaly and threat detection ● Deployment best practices ● Cost analysis Other Use Cases
  70. 70. . Example application as a network graph
  71. 71. . Example application as a network graph You are here
  72. 72. Enriched and aggregated traffic data is a powerful source of information for network design and optimization.
  73. 73. @yo_bennett
  74. 74. Thank you!

×