Hadoop	
  &	
  Cloud	
  @	
  Ne.lix:	
  
Taming	
  the	
  Social	
  Data	
  
Firehose	
  
	
  
	
  
	
  
06/13/2012	
  
	
  
	
  
Mohammad	
  Sabah	
  
Senior	
  Data	
  ScienFst	
  (@mohammad_sabah	
  	
  	
  	
  	
  	
  	
  	
  )	
  
Algorithms




      Everything is personalized   3
§ Plays
Data / User
              § Behavior
              § Geo-
                 Information
              § Time
              § Ratings
              § Searches
                               4
Big Data   §  25M+ subscribers
@Netflix   §  Ratings: 4M/day
           §  Searches: 3M/day
           §  Plays: 30M/day
           §  Impressions
           §  Device info
           §  Metadata
           §  Social
                                  5
Interesting   § 2B hours
Tidbit           streamed in Q4
                 2011
              § 75% select
                 movies based on
                 recommendations
              § Moral: We need
                 to scale
                 algorithms.
                                   6
7
Technology




             8
Modeling
           § Markov Chains
           §  Collaborative Filtering
           §  Large-scale Matching
           §  LSA
           §  Clustering
           §  Row Selection
           §  Query Categorization
           §  Auto-tagging
           §  Sentiment Analysis

                                         9
Markov Chain: Example I
    0.90                   0.08
                                                            0.80


                           0.15
      Bull Market                             Bear Market



                            0.02



               0.25                    0.25        0.05




                           Recession

                    0.50
                                                                   10
Markov Chain: Example II




            0.8


                                      0.3

                    0.3

                          0.4   0.3



              0.2                      0.7




                                             11
Markov Chain: Formal Definition

§  A Markov chain describes a discrete time
    stochastic process over a set of states
                S = {s1, s2, … sn}
according to a transition probability matrix P = {Pij}
  §  Pij = probability of moving to state j when at
      state i
§  Uses temporal ordering to estimated
    relatedness
§  The future only depends on today and not the
    past


                                                         12
The Math
§  Time Series Aggregation
   <u1, m1, t1>,<u1, m2, t2>,<u2, m3, t3>, …


   <u1> => <m1, t1>, <m2, t2>, <m3, t3>, …
§  Co-occurrence
   n(               ) = 24,000 n(        ) = 30,000


§  Transition Probability
              p(               ) = 0.8
                                                      13
Baseline Implementation & Inefficiencies

§  RDBMS/DW-based      §  SQL Limitation
§  Stored procedures   §  Expensive Copy
§  Once a week         §  Does not exploit
    (weekend)               inherent parallelism
                        §  Does not scale well
                            (region, models)
                        §  4B+ rows – run out of
                            memory/space
                        §  Convoluted Joins
                            (maintenance
                            nightmare!)
                                                    14
MapReduce Implementation - I
   §  Exploits the inherent parallelism in algorithm.
   §  Scale: 25M * 50K (* 50K) ~ 100B+ keys
   §  Time Series Aggregation
               U1, T1, M1    U1=><T1,M1>




                                           U1 => <T3, M5>,   U1 =>
                                           U1 => <T1, M1>,   <T1, M1>,
                                           …                 <T3, M5>,…
U1, T1, M1
U2, T2, M3
U3, T3, M1
               U1, T3, M5    U1=><T3,M5>                                  U1=><T1,M1>,…
                                                                          U2=><T2,M3>,…
U1, T3, M5
                                                                          U3=><T3,M4> …

                                           U2 => <T2, M3>,   U2=>
                                           …                 <T2, M3>,…



                U2, T2, M3   U2=><T2,M3>




Input                                        Shuffle         Reduce        Result
                Split         Map

                                                                                     15
MapReduce Implementation - II

   §  Transition Probability Matrix

                U1=>T1,M1,   M1,M2=>1
                …            M1,M3=>1


                                        M1,M3=>1
                                        M1,M3=>1   M1,M3=>3
                                        M1,M3=>1

U1 => T1,M1,…
                             M1,M3=>1                         M1,M2=>.2
U2 => T2,M1,…   U2=>T2,M1,
                …            M2,M3=>1                         M1,M3=>.3
U3 => T3,M3,…
                                                              M2,M3=>.5
                                        M2,M3>1
                                                   M2,M3=>2
                                        M2,M3=>1


                U3=>T3,M3    M2,M3=>1
                …            M1,M3=>1



  Input         Split         Map       Shuffle    Reduce     Result



                                                                      16
In a Nutshell
§  You end up with a N * N matrix

                    0     0.3   …   0.7

                    0.3   0     …   0.7

                    …

                    0.2   0.1   …   0




                                          17
But…there is a catch!




                        18
Solution!

§  Odds Ratio



§  Optimizations
  §  Decay
  §  Reward
  §  In-Window
  §  Noise


                    19
Markov Chain Migration Summary
 RDBMS/DW                   Hadoop
 Limited by SQL syntax and Can be arbitrarily complex
 semantics
 Expensive Data copy from Data copy avoided
 data source to data center
 Does not scale to new      Scales beautifully.
 models and regions
 Maintenance nightmare      Easy to maintain (written in
 (stored procedures +       high-level language e.g.
 convoluted joins)          Java, Pig)
 Resource constraints       No special handling
                            needed.

                                                           20
Other Algorithms & Challenges



                       Entity        Forms
                       Star Trek     strtrek, startrek, start
                                     trek, star trek, star treck
                       South Park    southpark, sothpark,
                                     south parl, souh park
                       Doctor Who    docter who, doctor wh,
                                     docot who, doctor who:
                       Prison Break prision break, prison
                                    brake, prison breal




                                                                   21
§  Think Parallel!
§  Optimize
§  ML + Hadoop
§  Visualize
§  Experiment
§  Bucket Test
§  Iterative Processing


                       22
Big Data +
      Hadoop +
   Machine Learning
         =>
Great Customer Experience!
                             23
I HAD AN IDEA



       I BUILT IT

  I PUSHED IT TO TEST

THE TEST WAS POSITIVE

I PUSHED IT LIVE!
     We’re hiring!
                        24
@mohammad_sabah
msabah@netflix.com

Hadoop and Cloud at Netflix

  • 1.
    Hadoop  &  Cloud  @  Ne.lix:   Taming  the  Social  Data   Firehose         06/13/2012       Mohammad  Sabah   Senior  Data  ScienFst  (@mohammad_sabah                )  
  • 3.
    Algorithms Everything is personalized 3
  • 4.
    § Plays Data / User § Behavior § Geo- Information § Time § Ratings § Searches 4
  • 5.
    Big Data §  25M+ subscribers @Netflix §  Ratings: 4M/day §  Searches: 3M/day §  Plays: 30M/day §  Impressions §  Device info §  Metadata §  Social 5
  • 6.
    Interesting § 2B hours Tidbit streamed in Q4 2011 § 75% select movies based on recommendations § Moral: We need to scale algorithms. 6
  • 7.
  • 8.
  • 9.
    Modeling § Markov Chains §  Collaborative Filtering §  Large-scale Matching §  LSA §  Clustering §  Row Selection §  Query Categorization §  Auto-tagging §  Sentiment Analysis 9
  • 10.
    Markov Chain: ExampleI 0.90 0.08 0.80 0.15 Bull Market Bear Market 0.02 0.25 0.25 0.05 Recession 0.50 10
  • 11.
    Markov Chain: ExampleII 0.8 0.3 0.3 0.4 0.3 0.2 0.7 11
  • 12.
    Markov Chain: FormalDefinition §  A Markov chain describes a discrete time stochastic process over a set of states S = {s1, s2, … sn} according to a transition probability matrix P = {Pij} §  Pij = probability of moving to state j when at state i §  Uses temporal ordering to estimated relatedness §  The future only depends on today and not the past 12
  • 13.
    The Math §  TimeSeries Aggregation <u1, m1, t1>,<u1, m2, t2>,<u2, m3, t3>, … <u1> => <m1, t1>, <m2, t2>, <m3, t3>, … §  Co-occurrence n( ) = 24,000 n( ) = 30,000 §  Transition Probability p( ) = 0.8 13
  • 14.
    Baseline Implementation &Inefficiencies §  RDBMS/DW-based §  SQL Limitation §  Stored procedures §  Expensive Copy §  Once a week §  Does not exploit (weekend) inherent parallelism §  Does not scale well (region, models) §  4B+ rows – run out of memory/space §  Convoluted Joins (maintenance nightmare!) 14
  • 15.
    MapReduce Implementation -I §  Exploits the inherent parallelism in algorithm. §  Scale: 25M * 50K (* 50K) ~ 100B+ keys §  Time Series Aggregation U1, T1, M1 U1=><T1,M1> U1 => <T3, M5>, U1 => U1 => <T1, M1>, <T1, M1>, … <T3, M5>,… U1, T1, M1 U2, T2, M3 U3, T3, M1 U1, T3, M5 U1=><T3,M5> U1=><T1,M1>,… U2=><T2,M3>,… U1, T3, M5 U3=><T3,M4> … U2 => <T2, M3>, U2=> … <T2, M3>,… U2, T2, M3 U2=><T2,M3> Input Shuffle Reduce Result Split Map 15
  • 16.
    MapReduce Implementation -II §  Transition Probability Matrix U1=>T1,M1, M1,M2=>1 … M1,M3=>1 M1,M3=>1 M1,M3=>1 M1,M3=>3 M1,M3=>1 U1 => T1,M1,… M1,M3=>1 M1,M2=>.2 U2 => T2,M1,… U2=>T2,M1, … M2,M3=>1 M1,M3=>.3 U3 => T3,M3,… M2,M3=>.5 M2,M3>1 M2,M3=>2 M2,M3=>1 U3=>T3,M3 M2,M3=>1 … M1,M3=>1 Input Split Map Shuffle Reduce Result 16
  • 17.
    In a Nutshell § You end up with a N * N matrix 0 0.3 … 0.7 0.3 0 … 0.7 … 0.2 0.1 … 0 17
  • 18.
  • 19.
    Solution! §  Odds Ratio § Optimizations §  Decay §  Reward §  In-Window §  Noise 19
  • 20.
    Markov Chain MigrationSummary RDBMS/DW Hadoop Limited by SQL syntax and Can be arbitrarily complex semantics Expensive Data copy from Data copy avoided data source to data center Does not scale to new Scales beautifully. models and regions Maintenance nightmare Easy to maintain (written in (stored procedures + high-level language e.g. convoluted joins) Java, Pig) Resource constraints No special handling needed. 20
  • 21.
    Other Algorithms &Challenges Entity Forms Star Trek strtrek, startrek, start trek, star trek, star treck South Park southpark, sothpark, south parl, souh park Doctor Who docter who, doctor wh, docot who, doctor who: Prison Break prision break, prison brake, prison breal 21
  • 22.
    §  Think Parallel! § Optimize §  ML + Hadoop §  Visualize §  Experiment §  Bucket Test §  Iterative Processing 22
  • 23.
    Big Data + Hadoop + Machine Learning => Great Customer Experience! 23
  • 24.
    I HAD ANIDEA I BUILT IT I PUSHED IT TO TEST THE TEST WAS POSITIVE I PUSHED IT LIVE! We’re hiring! 24
  • 25.