Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hadoop	  &	  Cloud	  @	  Ne.lix:	  Taming	  the	  Social	  Data	  Firehose	  	  	  	  06/13/2012	  	  	  Mohammad	  Sabah	...
Algorithms      Everything is personalized   3
§ PlaysData / User              § Behavior              § Geo-                 Information              § Time        ...
Big Data   §  25M+ subscribers@Netflix   §  Ratings: 4M/day           §  Searches: 3M/day           §  Plays: 30M/day ...
Interesting   § 2B hoursTidbit           streamed in Q4                 2011              § 75% select                 m...
7
Technology             8
Modeling           § Markov Chains           §  Collaborative Filtering           §  Large-scale Matching           § ...
Markov Chain: Example I    0.90                   0.08                                                            0.80    ...
Markov Chain: Example II            0.8                                      0.3                    0.3                   ...
Markov Chain: Formal Definition§  A Markov chain describes a discrete time    stochastic process over a set of states    ...
The Math§  Time Series Aggregation   <u1, m1, t1>,<u1, m2, t2>,<u2, m3, t3>, …   <u1> => <m1, t1>, <m2, t2>, <m3, t3>, …§...
Baseline Implementation & Inefficiencies§  RDBMS/DW-based      §  SQL Limitation§  Stored procedures   §  Expensive Co...
MapReduce Implementation - I   §  Exploits the inherent parallelism in algorithm.   §  Scale: 25M * 50K (* 50K) ~ 100B+ ...
MapReduce Implementation - II   §  Transition Probability Matrix                U1=>T1,M1,   M1,M2=>1                …   ...
In a Nutshell§  You end up with a N * N matrix                    0     0.3   …   0.7                    0.3   0     …   ...
But…there is a catch!                        18
Solution!§  Odds Ratio§  Optimizations  §  Decay  §  Reward  §  In-Window  §  Noise                    19
Markov Chain Migration Summary RDBMS/DW                   Hadoop Limited by SQL syntax and Can be arbitrarily complex sema...
Other Algorithms & Challenges                       Entity        Forms                       Star Trek     strtrek, start...
§  Think Parallel!§  Optimize§  ML + Hadoop§  Visualize§  Experiment§  Bucket Test§  Iterative Processing          ...
Big Data +      Hadoop +   Machine Learning         =>Great Customer Experience!                             23
I HAD AN IDEA       I BUILT IT  I PUSHED IT TO TESTTHE TEST WAS POSITIVEI PUSHED IT LIVE!     We’re hiring!               ...
@mohammad_sabahmsabah@netflix.com
Hadoop and Cloud at Netflix
Upcoming SlideShare
Loading in …5
×

Hadoop and Cloud at Netflix

4,557 views

Published on

Published in: Technology, Education

Hadoop and Cloud at Netflix

  1. 1. Hadoop  &  Cloud  @  Ne.lix:  Taming  the  Social  Data  Firehose        06/13/2012      Mohammad  Sabah  Senior  Data  ScienFst  (@mohammad_sabah                )  
  2. 2. Algorithms Everything is personalized 3
  3. 3. § PlaysData / User § Behavior § Geo- Information § Time § Ratings § Searches 4
  4. 4. Big Data §  25M+ subscribers@Netflix §  Ratings: 4M/day §  Searches: 3M/day §  Plays: 30M/day §  Impressions §  Device info §  Metadata §  Social 5
  5. 5. Interesting § 2B hoursTidbit streamed in Q4 2011 § 75% select movies based on recommendations § Moral: We need to scale algorithms. 6
  6. 6. 7
  7. 7. Technology 8
  8. 8. Modeling § Markov Chains §  Collaborative Filtering §  Large-scale Matching §  LSA §  Clustering §  Row Selection §  Query Categorization §  Auto-tagging §  Sentiment Analysis 9
  9. 9. Markov Chain: Example I 0.90 0.08 0.80 0.15 Bull Market Bear Market 0.02 0.25 0.25 0.05 Recession 0.50 10
  10. 10. Markov Chain: Example II 0.8 0.3 0.3 0.4 0.3 0.2 0.7 11
  11. 11. Markov Chain: Formal Definition§  A Markov chain describes a discrete time stochastic process over a set of states S = {s1, s2, … sn}according to a transition probability matrix P = {Pij} §  Pij = probability of moving to state j when at state i§  Uses temporal ordering to estimated relatedness§  The future only depends on today and not the past 12
  12. 12. The Math§  Time Series Aggregation <u1, m1, t1>,<u1, m2, t2>,<u2, m3, t3>, … <u1> => <m1, t1>, <m2, t2>, <m3, t3>, …§  Co-occurrence n( ) = 24,000 n( ) = 30,000§  Transition Probability p( ) = 0.8 13
  13. 13. Baseline Implementation & Inefficiencies§  RDBMS/DW-based §  SQL Limitation§  Stored procedures §  Expensive Copy§  Once a week §  Does not exploit (weekend) inherent parallelism §  Does not scale well (region, models) §  4B+ rows – run out of memory/space §  Convoluted Joins (maintenance nightmare!) 14
  14. 14. MapReduce Implementation - I §  Exploits the inherent parallelism in algorithm. §  Scale: 25M * 50K (* 50K) ~ 100B+ keys §  Time Series Aggregation U1, T1, M1 U1=><T1,M1> U1 => <T3, M5>, U1 => U1 => <T1, M1>, <T1, M1>, … <T3, M5>,…U1, T1, M1U2, T2, M3U3, T3, M1 U1, T3, M5 U1=><T3,M5> U1=><T1,M1>,… U2=><T2,M3>,…U1, T3, M5 U3=><T3,M4> … U2 => <T2, M3>, U2=> … <T2, M3>,… U2, T2, M3 U2=><T2,M3>Input Shuffle Reduce Result Split Map 15
  15. 15. MapReduce Implementation - II §  Transition Probability Matrix U1=>T1,M1, M1,M2=>1 … M1,M3=>1 M1,M3=>1 M1,M3=>1 M1,M3=>3 M1,M3=>1U1 => T1,M1,… M1,M3=>1 M1,M2=>.2U2 => T2,M1,… U2=>T2,M1, … M2,M3=>1 M1,M3=>.3U3 => T3,M3,… M2,M3=>.5 M2,M3>1 M2,M3=>2 M2,M3=>1 U3=>T3,M3 M2,M3=>1 … M1,M3=>1 Input Split Map Shuffle Reduce Result 16
  16. 16. In a Nutshell§  You end up with a N * N matrix 0 0.3 … 0.7 0.3 0 … 0.7 … 0.2 0.1 … 0 17
  17. 17. But…there is a catch! 18
  18. 18. Solution!§  Odds Ratio§  Optimizations §  Decay §  Reward §  In-Window §  Noise 19
  19. 19. Markov Chain Migration Summary RDBMS/DW Hadoop Limited by SQL syntax and Can be arbitrarily complex semantics Expensive Data copy from Data copy avoided data source to data center Does not scale to new Scales beautifully. models and regions Maintenance nightmare Easy to maintain (written in (stored procedures + high-level language e.g. convoluted joins) Java, Pig) Resource constraints No special handling needed. 20
  20. 20. Other Algorithms & Challenges Entity Forms Star Trek strtrek, startrek, start trek, star trek, star treck South Park southpark, sothpark, south parl, souh park Doctor Who docter who, doctor wh, docot who, doctor who: Prison Break prision break, prison brake, prison breal 21
  21. 21. §  Think Parallel!§  Optimize§  ML + Hadoop§  Visualize§  Experiment§  Bucket Test§  Iterative Processing 22
  22. 22. Big Data + Hadoop + Machine Learning =>Great Customer Experience! 23
  23. 23. I HAD AN IDEA I BUILT IT I PUSHED IT TO TESTTHE TEST WAS POSITIVEI PUSHED IT LIVE! We’re hiring! 24
  24. 24. @mohammad_sabahmsabah@netflix.com

×