Hadoop and Cloud at Netflix

Hadoop
&
Cloud
@
Ne.lix:

Taming
the
Social
Data

Firehose

06/13/2012

Mohammad
Sabah

Senior
Data
ScienFst
(@mohammad_sabah

)

Algorithms

Everything is personalized 3

§ Plays
Data / User
§ Behavior
§ Geo-
Information
§ Time
§ Ratings
§ Searches
4

Big Data §  25M+ subscribers
@Netflix §  Ratings: 4M/day
§  Searches: 3M/day
§  Plays: 30M/day
§  Impressions
§  Device info
§  Metadata
§  Social
5

Interesting § 2B hours
Tidbit streamed in Q4
2011
§ 75% select
movies based on
recommendations
§ Moral: We need
to scale
algorithms.
6

Modeling
§ Markov Chains
§  Collaborative Filtering
§  Large-scale Matching
§  LSA
§  Clustering
§  Row Selection
§  Query Categorization
§  Auto-tagging
§  Sentiment Analysis

9

Markov Chain: Example I
0.90 0.08
0.80

0.15
Bull Market Bear Market

0.02

0.25 0.25 0.05

Recession

0.50
10

Markov Chain: Example II

0.8

0.3

0.3

0.4 0.3

0.2 0.7

11

Markov Chain: Formal Definition

§  A Markov chain describes a discrete time
stochastic process over a set of states
S = {s1, s2, … sn}
according to a transition probability matrix P = {Pij}
§  Pij = probability of moving to state j when at
state i
§  Uses temporal ordering to estimated
relatedness
§  The future only depends on today and not the
past

12

The Math
§  Time Series Aggregation
<u1, m1, t1>,<u1, m2, t2>,<u2, m3, t3>, …

<u1> => <m1, t1>, <m2, t2>, <m3, t3>, …
§  Co-occurrence
n( ) = 24,000 n( ) = 30,000

§  Transition Probability
p( ) = 0.8
13

Baseline Implementation & Inefficiencies

§  RDBMS/DW-based §  SQL Limitation
§  Stored procedures §  Expensive Copy
§  Once a week §  Does not exploit
(weekend) inherent parallelism
§  Does not scale well
(region, models)
§  4B+ rows – run out of
memory/space
§  Convoluted Joins
(maintenance
nightmare!)
14

MapReduce Implementation - I
§  Exploits the inherent parallelism in algorithm.
§  Scale: 25M * 50K (* 50K) ~ 100B+ keys
§  Time Series Aggregation
U1, T1, M1 U1=><T1,M1>

U1 => <T3, M5>, U1 =>
U1 => <T1, M1>, <T1, M1>,
… <T3, M5>,…
U1, T1, M1
U2, T2, M3
U3, T3, M1
U1, T3, M5 U1=><T3,M5> U1=><T1,M1>,…
U2=><T2,M3>,…
U1, T3, M5
U3=><T3,M4> …

U2 => <T2, M3>, U2=>
… <T2, M3>,…

U2, T2, M3 U2=><T2,M3>

Input Shuffle Reduce Result
Split Map

15

MapReduce Implementation - II

§  Transition Probability Matrix

U1=>T1,M1, M1,M2=>1
… M1,M3=>1

M1,M3=>1
M1,M3=>1 M1,M3=>3
M1,M3=>1

U1 => T1,M1,…
M1,M3=>1 M1,M2=>.2
U2 => T2,M1,… U2=>T2,M1,
… M2,M3=>1 M1,M3=>.3
U3 => T3,M3,…
M2,M3=>.5
M2,M3>1
M2,M3=>2
M2,M3=>1

U3=>T3,M3 M2,M3=>1
… M1,M3=>1

Input Split Map Shuffle Reduce Result

16

In a Nutshell
§  You end up with a N * N matrix

0 0.3 … 0.7

0.3 0 … 0.7

…

0.2 0.1 … 0

17

But…there is a catch!

18

Solution!

§  Odds Ratio

§  Optimizations
§  Decay
§  Reward
§  In-Window
§  Noise

19

Markov Chain Migration Summary
RDBMS/DW Hadoop
Limited by SQL syntax and Can be arbitrarily complex
semantics
Expensive Data copy from Data copy avoided
data source to data center
Does not scale to new Scales beautifully.
models and regions
Maintenance nightmare Easy to maintain (written in
(stored procedures + high-level language e.g.
convoluted joins) Java, Pig)
Resource constraints No special handling
needed.

20

Other Algorithms & Challenges

Entity Forms
Star Trek strtrek, startrek, start
trek, star trek, star treck
South Park southpark, sothpark,
south parl, souh park
Doctor Who docter who, doctor wh,
docot who, doctor who:
Prison Break prision break, prison
brake, prison breal

21

§  Think Parallel!
§  Optimize
§  ML + Hadoop
§  Visualize
§  Experiment
§  Bucket Test
§  Iterative Processing

22

Big Data +
Hadoop +
Machine Learning
=>
Great Customer Experience!
23

I HAD AN IDEA

I BUILT IT

I PUSHED IT TO TEST

THE TEST WAS POSITIVE

I PUSHED IT LIVE!
We’re hiring!
24

@mohammad_sabah
msabah@netflix.com

Hadoop and Cloud at Netflix

More Related Content

What's hot

Similar to Hadoop and Cloud at Netflix

More from DataWorks Summit

Recently uploaded

Hadoop and Cloud at Netflix