Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you'll learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
17. ● 104 million customers
● Over 190 countries
● 37% of U.S. Internet traffic
● 125 million hours of video
Netflix is big.
18. ● Dozens of accounts
● Multiple regions
● 100s of microservices
● 1,000s of deployments
● > 100,000 instances
And complex.
19. ● No access to the underlying network
● Large traffic volume
● Billions of flows per day
● Gigabytes of logs per second
● Dynamic environment
● Logs are limited, ex. IP-to-IP
● IP addresses are randomly assigned
● IP metadata varies over time, unpredictable
Challenges
23. ● Develop a new datasource for network analytics
● Multiple dimensions (Netflix- and AWS-centric)
● Fast aggregations
● Enable ad-hoc OLAP-style queries
● Rollup, drill down, slicing and dicing
● Add observability to network
● Fill gap not addressed by existing tools
Goal
27. ● Enables experimentation
● Load streaming data differently
● Batch with Kinesis Firehose
● Store in S3, Process with Lambda
● Elasticsearch as an intermediate store
● Stream with Kinesis Streams
Strong AWS Integration
29. VPC Flow Logs IncomingBytes per hour
Example account and region over 1 week
Handles event data at scale
30. ● Worker per Amazon EC2 instance
○ Multiple record processors per worker
○ Record processor per shard
● Load balancing between workers
● Checkpointing (with Amazon DynamoDB)
● Stream- and shard-level metrics
Kinesis Client Library
31. ● Very little operational overhead
○ Monitor stream metrics and DynamoDB table
○ Leverage Auto-Scaling Utility for Kinesis Streams
● No overhead for Amazon Kinesis Firehose
TCO
32. ● Per-shard limits
○ Increase shard count or fan out to other streams
● No log compaction
○ Up to 7-day max retention
○ Manual snapshots, increased complexity
○ Not ideal for changelog joins
Limitations
35. ● Delay: 24 hours (daily interval)
● Bounded, fixed-size input
● Measured by throughput (time to process input)
● Limitations
● Remote DB: Round-trip time, parallel queries could overload
● Local cache: Depends on distribution of data, how to handle invalidation
● Local DB: More effective, less contention and no network RTT
Batch
38. ● Delay: 7 minutes, average case (capture window)
● Unbounded input as events happen
● Measured by how far consumer is behind
● Limitations (similar to batch)
● Remote DB: Round-trip time, parallel queries could overload
● Local cache: Depends on distribution of data, how to handle invalidation
● Local DB: More effective, less contention and no network RTT
Stream
43. ● ex. Database indexes, caches, materialized views
● Transformed from source of truth
● Optimized for read queries, improve performance
● Built from a changelog of events
Derived Data
44. ● Log-based message broker to send change events
● Expose changelog stream as 1st class citizen
● Consume and join streams instead of querying DB
● Alternative view to query efficiently
● Update when data changes
● Removes network round-trip time, resource contention
● Pre-computed cache
Change Data Capture
51. 7 million network flows
Enriched per second
5 minutes
Average delay from network flow occurrence
1 Kinesis stream
With 100s of shards
By the Numbers
52.
53. What's wrong with the network?
Dredge reduces
mean-time-to-innocence.
54. Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e
55. Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e
Bad code
push?
56. Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e
Network
outage?
57. Why is the network so slow?
Dredge identifies
high-latency network flows.
62. ● Estimate 23% of total traffic is cross-zone
● About 14% of total traffic is cross-region
● Some intentional cross-zone, cross-region traffic
Initial Findings
63. My service can’t connect
to its dependencies.
Dredge classifies a service’s
inbound and outbound dependencies.
64. Existing tools
● Distributed tracing via Salp
● Similar to Google’s Dapper
● Naive sampling
● JVM-centric
● Incomplete coverage
● Need to be a part of the main request path
● Difficult to capture startup dependencies
● Lacks support for protocols other than TCP IPv4
67. Initial Findings
● Significant discrepancy between Dredge and Salp
● Sample of 100 services
● Dependencies from tracing are a subset
● Tracing is implemented inconsistently
● Higher coverage
● Connections to AWS services prove helpful
68. Security Use Cases
● Use network dependencies to audit security groups
● Reduce blast radius
● Only source of logs for Security Group rejected flows
● Reports communication with public Internet
● Threat detection, port scanning, etc.
● AWS resources (instances, load balancers) with increased exposure
● Risk profiles
70. How can we do better?
● VPC Flow Logs give us a 10,000-ft view
● More detail and context
● Kernel-level metrics, eBPF
● Dynamic sampling rates
● Minimize variability
● Coordination