Netflix Data Pipeline
Sudhir Tonse (@stonse)
Danny Yuan (@g9yuayon)
photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/o/in/photostream/!
Netflix is a log generating company
that also happens to stream movies
- Adrian Cockroft
Data Is the most important asset
at Netflix
If all the data is easily available to all
teams, it can be leveraged in new and
exciting ways
Dashboard
~1000 Device Types
Dashboard
~1000 Device Types
~500 Apps/Web Services
Dashboard
~1000 Device Types
~500 Apps/Web Services
~100 Billion Events/Day
!
3.2M messages per
second at peak time
!
3GB per second at peak
time
Dashboard
Type of Events
• User	
  Interface	
  Events	
  
• Search	
  Event	
  (‘Matrix’	
  using	
  PS3	
  …)	
  
• Star	
  Ra>ng	
  Event	
  (HoC	
  :	
  5	
  stars,	
  Xbox,	
  US,	
  …)	
  
!
• Infrastructural	
  Events	
  
• RPC	
  Call	
  (API	
  -­‐>	
  Billing	
  Service,	
  ‘/bill/..’,	
  200,	
  …)	
  
• Log	
  Errors	
  (NPE,	
  “Movie	
  is	
  null”,	
  …,	
  …)	
  
!
• Other	
  Events	
  …	
  
!
!
Making Sense of Billions of Events
http://netflix.github.io
+
ElasticSearch
Druid
A Humble Beginning
Evolution …Scale!
Application
Application
Application Application
Application
Application
Application
Application
ApplicationApplication
We Want to Process
App Data in Hadoop
Our Hadoop Ecosystem
@NetflixOSS Big Data Tools
Hadoop as a Service
Pig Scripting on Steroids
Pig Married to Clojure
S3MPER
S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
S3mper is a library that provides an
additional layer of consistency
checking on top of Amazon's S3 index through
use of a consistent, secondary index.
Efficient ETL with Cassandra
Cassandra
Offline Analysis
Evolution … Speed!
hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
We Want to Aggregate, Index, and
Query Data in Real Time
Interactive Exploration
Let’s walk through some use cases
client activity event
*
/name = “movieStarts”
Pipeline Challenges
Pipeline Challenges
• App owners: send and forget
Pipeline Challenges
• App owners: send and forget
• Data scientists: validation, ETL, batch
processing
Pipeline Challenges
• App owners: send and forget
• Data scientists: validation, ETL, batch
processing
• DevOps: stream processing, targeted search
Message Routing
We Want to Consume Data
Selectively in Different Ways
• Message broker!
• High-throughput!
• Persistent and replicated
There Is More
Intelligent Alerts
Intelligent Alerts
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
What We Need
•Ad-hoc query with different dimensions
What We Need
•Ad-hoc query with different dimensions
•Quick aggregations and Top-N queries
What We Need
•Ad-hoc query with different dimensions
•Quick aggregations and Top-N queries
•Time series with flexible filters
What We Need
•Ad-hoc query with different dimensions
•Quick aggregations and Top-N queries
•Time series with flexible filters
•Quick access to raw data using boolean
queries
What We Need
Druid
• Rapid exploration of high dimensional data!
• Fast ingestion and querying!
• Time series
• Real-time indexing of event streams!
• Killer feature: boolean search!
• Great UI: Kibana
The Old Pipeline
The New Pipeline
There Is More
It’s Not All About Counters and Time Series
RequestId Parent Id Node Id Service Name Status
4965-4a74 0 123 Edge Service 200
4965-4a74 123 456 Gateway 200
4965-4a74 456 789 Service A 200
4965-4a74e 456 abc Service B 200
Status:200
Distributed Tracing
Distributed Tracing
Distributed Tracing
Distributed Tracing
Distributed Tracing
Distributed Tracing
A System that Supports All These
A Data Pipeline To Glue
Them All
Make It Simple
Message Producing
Message Producing
• Simple and Uniform API
• messageBus.publish(event)
Consumption Is Simple Too
	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   	
  
consumer.observe().subscribe(new	
  Subscriber<>()	
  {	
  
	
   	
  
	
   @Override	
  
	
   public	
  void	
  onNext(Ackable<IncomingMessage>	
  ackable)	
  {	
  
	
   	
  
	
   	
   process(ackable.getEntity(MyEventType.class));	
  
	
   	
   ackable.ack();	
  
	
   }	
  
});	
  
!
consumer.pause();	
  
consumer.resume()
RxJava
• Functional reactive programming model!
• Powerful streaming API!
• Separation of logic and threading model
Design Decisions
Design Decisions
• Top Priority: app stability and throughput
Design Decisions
• Top Priority: app stability and throughput
• Asynchronous operations
Design Decisions
• Top Priority: app stability and throughput
• Asynchronous operations
• Aggressive buffering
Design Decisions
• Top Priority: app stability and throughput
• Asynchronous operations
• Aggressive buffering
• Drops messages if necessary
Anything Can Fail
Cloud Resiliency
Fault Tolerance Features
Fault Tolerance Features
• Write and forward with auto-reattached EBS
(Amazon’s Elastic Block Storage)
Fault Tolerance Features
• Write and forward with auto-reattached EBS
(Amazon’s Elastic Block Storage)
• disk-backed queue: big-queue
Fault Tolerance Features
• Write and forward with auto-reattached EBS
(Amazon’s Elastic Block Storage)
• disk-backed queue: big-queue
• Customized scaling down
There’s More to Do
• Contribute to @NetflixOSS !
• Join us :-)
Summary
http://netflix.github.io
+
ElasticSearch
Druid
You can build your own web-scale data
pipeline using open source components
Thank You!
Sudhir Tonse

http://www.linkedin.com/in/sudhirtonse

Twitter: @stonse
Danny Yuan 

http://www.linkedin.com/pub/danny-
yuan/4/374/862
Twitter: @g9yuayon

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro