DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro

Netﬂix Data Pipeline
Sudhir Tonse (@stonse)
Danny Yuan (@g9yuayon)

photo credit: http://www.ﬂickr.com/photos/decade_null/142235888/sizes/o/in/photostream/!
Netﬂix is a log generating company
that also happens to stream movies
- Adrian Cockroft

Data Is the most important asset
at Netﬂix

If all the data is easily available to all
teams, it can be leveraged in new and
exciting ways

~1000 Device Types
~500 Apps/Web Services
Dashboard

~1000 Device Types
~500 Apps/Web Services
~100 Billion Events/Day
!
3.2M messages per
second at peak time
!
3GB per second at peak
time
Dashboard

Type of Events
• User
Interface
Events

• Search
Event
(‘Matrix’
using
PS3
…)

• Star
Ra>ng
Event
(HoC
:
5
stars,
Xbox,
US,
…)

!
• Infrastructural
Events

• RPC
Call
(API
-‐>
Billing
Service,
‘/bill/..’,
200,
…)

• Log
Errors
(NPE,
“Movie
is
null”,
…,
…)

!
• Other
Events
…

!
!

Making Sense of Billions of Events

http://netﬂix.github.io
+
ElasticSearch
Druid

Application
Application
Application Application
Application
Application
Application
Application
ApplicationApplication

We Want to Process
App Data in Hadoop

S3MPER
S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
S3mper is a library that provides an
additional layer of consistency
checking on top of Amazon's S3 index through
use of a consistent, secondary index.

Eﬃcient ETL with Cassandra
Cassandra

hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket

We Want to Aggregate, Index, and
Query Data in Real Time

Let’s walk through some use cases

client activity event
*
/name = “movieStarts”

Pipeline Challenges
• App owners: send and forget

Pipeline Challenges
• Data scientists: validation, ETL, batch
processing

Pipeline Challenges
• Data scientists: validation, ETL, batch
processing
• DevOps: stream processing, targeted search

We Want to Consume Data
Selectively in Diﬀerent Ways

• Message broker!
• High-throughput!
• Persistent and replicated

Guided Debugging in the Right Context

•Ad-hoc query with different dimensions
What We Need

•Quick aggregations and Top-N queries
What We Need

•Time series with ﬂexible ﬁlters
What We Need

•Time series with ﬂexible ﬁlters
•Quick access to raw data using boolean
queries
What We Need

Druid
• Rapid exploration of high dimensional data!
• Fast ingestion and querying!
• Time series

• Real-time indexing of event streams!
• Killer feature: boolean search!
• Great UI: Kibana

It’s Not All About Counters and Time Series

RequestId Parent Id Node Id Service Name Status
4965-4a74 0 123 Edge Service 200
4965-4a74 123 456 Gateway 200
4965-4a74 456 789 Service A 200
4965-4a74e 456 abc Service B 200
Status:200

A System that Supports All These

A Data Pipeline To Glue
Them All

Message Producing
• Simple and Uniform API
• messageBus.publish(event)

Consumption Is Simple Too

consumer.observe().subscribe(new
Subscriber<>()
{

@Override

public
void
onNext(Ackable<IncomingMessage>
ackable)
{

process(ackable.getEntity(MyEventType.class));

ackable.ack();

}

});

!
consumer.pause();

consumer.resume()

RxJava
• Functional reactive programming model!
• Powerful streaming API!
• Separation of logic and threading model

Design Decisions
• Top Priority: app stability and throughput

Design Decisions
• Asynchronous operations

Design Decisions
• Aggressive buﬀering

Design Decisions
• Aggressive buﬀering
• Drops messages if necessary

Fault Tolerance Features
• Write and forward with auto-reattached EBS
(Amazon’s Elastic Block Storage)

• disk-backed queue: big-queue

• disk-backed queue: big-queue
• Customized scaling down

There’s More to Do
• Contribute to @NetﬂixOSS !
• Join us :-)

Summary
http://netﬂix.github.io
+
ElasticSearch
Druid

You can build your own web-scale data
pipeline using open source components

Thank You!
Sudhir Tonse 
http://www.linkedin.com/in/sudhirtonse 
Twitter: @stonse
Danny Yuan  
http://www.linkedin.com/pub/danny-
yuan/4/374/862
Twitter: @g9yuayon

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro

More Related Content

What's hot

Viewers also liked

Similar to DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro

More from Gaurav "GP" Pal

Recently uploaded

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro