How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401 - re:Invent 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How Netflix Monitors Applications in
Near Real-time with Amazon Kinesis
J o h n B e n n e t t , S r . S o f t w a r e E n g i n e e r , N e t f l i x
R o y B e n - A l t a , P r i n c i p a l B u s i n e s s D e v e l o p m e n t M a n a g e r , A W S
AWS re:INVENT
A B D 4 0 1
N o v e m b e r 3 0 , 2 0 1 7

Session Topics
• Amazon Kinesis
• Log Analytics Use Case
• Netflix Use Case
• Questions

Kinesis At Amazon
Amazon
CloudWatch logs
Amazon
S3 events
AWS Metering Amazon.com’s
catalog
Fact—we all use Amazon Kinesis

Hourly server logs
Weekly or monthly bills
Daily website clickstream
Daily fraud reports
Real-time metrics
Real-time spending alerts/caps
Real-time clickstream analysis
Real-time detection
It’s all about the pace
Batch processing Stream processing

Amazon Kinesis—real-time analytics
Easily collect, process, and analyze video and data streams in real time
Capture, process,
and store video
streams for analytics
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
New at re:Invent 2017

Log analytics use cases
Application logs
[Wed Oct 11 14:32:52 2000]
[error] [client 127.0.0.1]
client denied by server
configuration:
/export/home/live/ap/htdocs/test
• Operation intelligence
• Security intelligence and event management
• Application performance and monitoring
• Business analytics
• Monitoring and operation is becoming big data problem

Analyzing CloudTrail Event Logs
AWS
CloudTrail
Amazon
CloudWatch
events trigger
Amazon
Kinesis
Data Analytics
AWS
Lambda
function
Amazon S3
bucket for raw
data
Amazon S3
bucket for
processed data
Amazon
DynamoDB
Table(s)
Chart.JS
Dashboard
Compute
operational metrics
Ingest and deliver raw
log data
Deliver to a real time
dashboards and archival
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Firehose
• http://amzn.to/2ApHXKr

Netflix Uses Amazon Kinesis Data Streams to Analyze 7M+ Network
traffic events per second in Real-Time

Netflix Kinesis Use Case

Amazon Kinesis Customer Base Diversity
Netflix Uses Kinesis Streams
to Analyze Billions of Network
Traffic Flows in Real-Time

What is Netflix’s decision latency?

Hint: It’s not the network
What's wrong with the network?

Hint: Faraway places are far away
Why is the network so slow?

Hint: Distributed systems are hard
My service can’t connect
to its dependencies.

● 104 million customers
● Over 190 countries
● 37% of U.S. Internet traffic
● 125 million hours of video
Netflix is big.

● Dozens of accounts
● Multiple regions
● 100s of microservices
● 1,000s of deployments
● > 100,000 instances
And complex.

● No access to the underlying network
● Large traffic volume
● Billions of flows per day
● Gigabytes of logs per second
● Dynamic environment
● Logs are limited, ex. IP-to-IP
● IP addresses are randomly assigned
● IP metadata varies over time, unpredictable
Challenges

● Good: Wide coverage of network traffic
● Good: Consolidated
● Good: Core info (source and destination IP)
● Bad: 10-minute capture window
● Ugly: Stateless
AWS VPC Flow Logs

At time t
Source
IP
172.31.16.139
Destination
IP
10.13.67.49

At time t
Source
Metadata
Service
A
Account
1234567890
Zone
us-east-1e
Destination
Metadata
Service
B
Account
0987654321
Zone
eu-west-1b

● Develop a new datasource for network analytics
● Multiple dimensions (Netflix- and AWS-centric)
● Fast aggregations
● Enable ad-hoc OLAP-style queries
● Rollup, drill down, slicing and dicing
● Add observability to network
● Fill gap not addressed by existing tools
Goal

Dredge
Enriched and aggregated
multi-dimensional
network traffic data

● Integration with AWS services
● VPC Flow Logs
● S3
● Elasticsearch
● Handles scale
● Kinesis Client Library (KCL)
● Total Cost of Ownership (TCO)
Why Amazon Kinesis?

● Enables experimentation
● Load streaming data differently
● Batch with Kinesis Firehose
● Store in S3, Process with Lambda
● Elasticsearch as an intermediate store
● Stream with Kinesis Streams
Strong AWS Integration

VPC Flow Logs IncomingBytes per hour
Example account and region over 1 week
Handles event data at scale

● Worker per Amazon EC2 instance
○ Multiple record processors per worker
○ Record processor per shard
● Load balancing between workers
● Checkpointing (with Amazon DynamoDB)
● Stream- and shard-level metrics
Kinesis Client Library

● Very little operational overhead
○ Monitor stream metrics and DynamoDB table
○ Leverage Auto-Scaling Utility for Kinesis Streams
● No overhead for Amazon Kinesis Firehose
TCO

● Per-shard limits
○ Increase shard count or fan out to other streams
● No log compaction
○ Up to 7-day max retention
○ Manual snapshots, increased complexity
○ Not ideal for changelog joins
Limitations

● Delay: 24 hours (daily interval)
● Bounded, fixed-size input
● Measured by throughput (time to process input)
● Limitations
● Remote DB: Round-trip time, parallel queries could overload
● Local cache: Depends on distribution of data, how to handle invalidation
● Local DB: More effective, less contention and no network RTT
Batch

● Delay: 7 minutes, average case (capture window)
● Unbounded input as events happen
● Measured by how far consumer is behind
● Limitations (similar to batch)
● Remote DB: Round-trip time, parallel queries could overload
● Local cache: Depends on distribution of data, how to handle invalidation
● Local DB: More effective, less contention and no network RTT
Stream

● ex. Database indexes, caches, materialized views
● Transformed from source of truth
● Optimized for read queries, improve performance
● Built from a changelog of events
Derived Data

● Log-based message broker to send change events
● Expose changelog stream as 1st class citizen
● Consume and join streams instead of querying DB
● Alternative view to query efficiently
● Update when data changes
● Removes network round-trip time, resource contention
● Pre-computed cache
Change Data Capture

7 million network flows
Enriched per second
5 minutes
Average delay from network flow occurrence
1 Kinesis stream
With 100s of shards
By the Numbers

What's wrong with the network?
Dredge reduces
mean-time-to-innocence.

Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e

Bad code
push?

Network
outage?

Why is the network so slow?
Dredge identifies
high-latency network flows.

Region us-east-1
Zone Affinity
<1ms
Zone us-east-1d
Zone us-east-1e

Region us-east-1
Cross-zone
< 2ms
Zone us-east-1d
Zone us-east-1e

Region us-east-1
Zone us-east-1d
Zone us-east-1e
Region us-west-2
Zone us-west-2a
Zone us-west-2b
Cross-region
30-300ms

Region us-east-1
Zone us-east-1d
Region us-west-2
Zone us-west-2a
Zone us-west-2b
Cross-region fan-out
30-300ms

● Estimate 23% of total traffic is cross-zone
● About 14% of total traffic is cross-region
● Some intentional cross-zone, cross-region traffic
Initial Findings

My service can’t connect
to its dependencies.
Dredge classifies a service’s
inbound and outbound dependencies.

Existing tools
● Distributed tracing via Salp
● Similar to Google’s Dapper
● Naive sampling
● JVM-centric
● Incomplete coverage
● Need to be a part of the main request path
● Difficult to capture startup dependencies
● Lacks support for protocols other than TCP IPv4

Outbound Dependencies using Tracing

Outbound Dependencies using Tracing
Outbound Dependencies using Traffic Logs

Initial Findings
● Significant discrepancy between Dredge and Salp
● Sample of 100 services
● Dependencies from tracing are a subset
● Tracing is implemented inconsistently
● Higher coverage
● Connections to AWS services prove helpful

Security Use Cases
● Use network dependencies to audit security groups
● Reduce blast radius
● Only source of logs for Security Group rejected flows
● Reports communication with public Internet
● Threat detection, port scanning, etc.
● AWS resources (instances, load balancers) with increased exposure
● Risk profiles

How can we do better?
● VPC Flow Logs give us a 10,000-ft view
● More detail and context
● Kernel-level metrics, eBPF
● Dynamic sampling rates
● Minimize variability
● Coordination

Enriched and aggregated traffic data
can be a powerful source of
information that adds visibility to the
network.

Amazon Kinesis Streams and Firehose
help us focus on processing events
reliably and at scale.

We benefit from change data capture
by consuming and joining streams
using read-optimized data structures.

John Bennett
Cloud Network Engineering
bennett@netflix.com
@yo_bennett

THANK YOU!
b e n a l t a r @ a m a z o n . c o m
b e n n e t t @ n e t f l i x . c o m

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401 - re:Invent 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401 - re:Invent 2017

Similar to How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401 - re:Invent 2017 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401 - re:Invent 2017