SlideShare a Scribd company logo
1 of 40
Download to read offline
Large-scale Data Mining:
MapReduce and Beyond
              Part 2: Algorithms

   Spiros Papadimitriou, IBM Research
          Jimeng Sun, IBM Research
                 Rong Yan, Facebook
Part 2:Mining using MapReduce
       Mining algorithms using MapReduce
         Information  retrieval
         Graph algorithms: PageRank
         Clustering: Canopy clustering, KMeans
         Classification: kNN, Naï Bayes
                                 ve
       MapReduce Mining Summary




2
MapReduce Interface and Data Flow
              Map: (K1, V1)  list(K2, V2)
              Combine: (K2, list(V2))  list(K2, V2)
              Partition: (K2, V2)  reducer_id
              Reduce: (K2, list(V2))  list(K3, V3)
id, doc                list(w, id)   list(unique_w, id)
    Host 1




                                                                                       Host 3
                 Map             Combine     w1, list(id)             w1, list(unique_id)
                                                             Reduce
                 Map             Combine




                                                                                       Host 4
    Host 2




                                                              Map            Combine
                 Map             Combine      w2, list(id)
                                                             Reduce   w2, list(unique_id)
                                             partition
3
Information retrieval using
    MapReduce



4
IR: Distributed Grep
       Find the doc_id and line# of a matching pattern
       Map: (id, doc) list(id, line#)
       Reduce: None

    Grep “data mining”
                    Docs                Output
               1       2
                              Map1      <1, 123>

        3      4       5      Map2      <3, 717>, <5, 1231>

                       6      Map3       <6, 1012>




5
IR: URL Access Frequency
       Map: (null, log)  (URL, 1)
       Reduce: (URL, 1)  (URL, total_count)

    Logs
                             Map output                   Reduce output
                               <u1,1>
                   Map1        <u1,1>
                                                             <u1,2>
                               <u2,1>            Reduce
                                                             <u2,1>
                   Map2                                      <u3,2>
                               <u3,1>

                   Map3         <u3,1>



                      Also described in Part 1
6
IR: Reverse Web-Link Graph
       Map: (null, page)  (target, source)
       Reduce: (target, source)  (target, list(source))

    Pages
                              Map output                    Reduce output

                    Map1        <t1,s2>
                                                               <t1,[s2]>
                    Map2
                                <t2,s3>           Reduce       <t2,[s3,s5]>
                                <t2,s5>                        <t3,[s5]>

                    Map3         <t3,s5>




                       It is the same as matrix transpose
7
IR: Inverted Index
       Map: (id, doc)  list(word, id)
       Reduce: (word, list(id))  (word, list(id))

    Doc
                              Map output              Reduce output

                    Map1        <w1,1>

                                                          <w1,[1,5]>
                    Map2
                                <w2,2>       Reduce       <w2,[2]>
                                <w3,3>                    <w3,[3]>

                    Map3        <w1,5>




8
Graph mining using MapReduce




9
PageRank
        PageRank vector q is defined as                 1         2

         where           T       1¡ c
                  q = cA q +         N    e              3         4
                        Browsing   Teleporting
                                                      0                 1
         A   is the source-by-destination               0 1   1       1
            adjacency matrix,                         B 0 0    1       1 C
                                                  A=B @ 0 0
                                                                         C
                                                               0       1 A
           e is all one vector.                         0 0   1       0
           N is the number of nodes
           c is the weight between 0 and 1 (eg.0.85)
        PageRank indicates the importance of a page.
        Algorithm: Iterative powering for finding the first
         eigen-vector
10
MapReduce: PageRank
                                            PageRank Map()
      1        2
                                             Input: key = page x,
                                                     value = (PageRank qx, links[y1…ym])
                                             Output: key = page x, value = partialx
      3        4                            1.   Emit(x, 0) //guarantee all pages will be emitted
 Map: distribute PageRank qi                2.   For each outgoing link yi:

                             q3                   •       Emit(yi, qx/m)
     q1            q2                  q4

                                            PageRank Reduce()
                                             Input: key = page x, value = the list of [partialx]
                                             Output: key = page x, value = PageRank qx
                                            1.   qx = 0
     q1   q2            q3        q4
                                            2.   For each partial value d in the list:

     Reduce: update new PageRank                  •       qx += d
                                            3.   qx = cqx+ (1-c)/N
                                            4.   Emit(x, qx)
     Check out Kang et al ICDM’09
11
Clustering using MapReduce




12
Canopy: single-pass clustering
 Canopy creation
  Construct overlapping clusters – canopies
  Make sure no two canopies with too much overlaps
          Key: no canopy centers are too close to each other



             C1      C3                                                                           T1


             C2      C4
                                                                                                   T2


     overlapping clusters                          too much overlap                      two thresholds
                                                                                             T1>T2


     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00
13
Canopy creation
                                             Input:1)points 2) threshold T1, T2 where T1>T2
                                             Output: cluster centroids
                                              Put all points into a queue Q
                                              While (Q is not empty)
                                                – p = dequeue(Q)
                                                – For each canopy c:
                                                      if dist(p,c)< T1: c.add(p)

                                                      if dist(p,c)< T2: strongBound = true

                                                – If not strongBound: create canopy at p
                                              For all canopy c:
                                                – Set centroid to the mean of all points in c


     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

14
Canopy creation
                                              Input:1)points 2) threshold T1, T2 where T1>T2
                                              Output: cluster centroids
                                               Put all points into a queue Q
                                               While (Q is not empty)
                                                 – p = dequeue(Q)
     T1 T2
                                                 – For each canopy c:
      Strongly marked                                  if dist(p,c)< T1: c.add(p)
      points
                                                       if dist(p,c)< T2: strongBound = true

      C1                   C2                    – If not strongBound: create canopy at p
                                               For all canopy c:
                                                 – Set centroid to the mean of all points in c


      McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
      Matching“, KDD’00

15
Canopy creation
                                               Input:1)points 2) threshold T1, T2 where T1>T2
                                               Output: cluster centroids
                                                Put all points into a queue Q
Other points
in the cluster                                  While (Q is not empty)
                                                  – p = dequeue(Q)
                                                  – For each canopy c:
                          Canopy center
       Strongly marked                                  if dist(p,c)< T1: c.add(p)
       points
                                                        if dist(p,c)< T2: strongBound = true

                                                  – If not strongBound: create canopy at p
                                                For all canopy c:
                                                  – Set centroid to the mean of all points in c


       McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
       Matching“, KDD’00

16
Canopy creation
                                             Input:1)points 2) threshold T1, T2 where T1>T2
                                             Output: cluster centroids
                                              Put all points into a queue Q
                                              While (Q is not empty)
                                                – p = dequeue(Q)
                                                – For each canopy c:
                                                      if dist(p,c)< T1: c.add(p)

                                                      if dist(p,c)< T2: strongBound = true

                                                – If not strongBound: create canopy at p
                                              For all canopy c:
                                                – Set centroid to the mean of all points in c


     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

17
Canopy creation
                                             Input:1)points 2) threshold T1, T2 where T1>T2
                                             Output: cluster centroids
                                              Put all points into a queue Q
                                              While (Q is not empty)
                                                – p = dequeue(Q)
                                                – For each canopy c:
                                                      if dist(p,c)< T1: c.add(p)

                                                      if dist(p,c)< T2: strongBound = true

                                                – If not strongBound: create canopy at p
                                              For all canopy c:
                                                – Set centroid to the mean of all points in c


     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

18
MapReduce - Canopy Map()
                                                  Canopy creation Map()
                                                   Input: A set of points P, threshold T1, T2
                                                   Output: key = null; value = a list of local canopies
                                                    (total, count)
                                                   For each p in P:
                                                         For each canopy c:
                                                            • if dist(p,c)< T1 then c.total+=p, c.count++;
                  Map1                                      • if dist(p,c)< T2 then strongBound = true
                  Map2                                • If not strongBound then create canopy at p
                                                  Close()
                                                   For each canopy c:
                                                      • Emit(null, (total, count))

     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

19
MapReduce - Canopy Reduce()
                                                    Reduce()
                                                     Input: key = null; input values (total, count)
                                                     Output: key = null; value = cluster centroids
                                                     For each intermediate values
                                                            p = total/count
                                                            For each canopy c:
                                                               • if dist(p,c)< T1 then c.total+=p,
                  Map1 results                                   c.count++;
                  Map2 results
                                                               • if dist(p,c)< T2 then strongBound = true
                                                            If not strongBound then create canopy at p
                                                    Close()
     For simplicity we assume only one reducer
                                                     For each canopy c: emit(null, c.total/c.count)
     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00
20
MapReduce - Canopy Reduce()
                                                    Reduce()
                                                     Input: key = null; input values (total, count)
                                                     Output: key = null; value = cluster centroids
                                                     For each intermediate values
                                                           p = total/count
                                                           For each canopy c:
                                                               • if dist(p,c)< T1 then c.total+=p,
                  Reducer results                                c.count++;
                                                               • if dist(p,c)< T2 then strongBound = true
                                                           If not strongBound then create canopy at p
                                                    Close()
     Remark: This assumes only one reducer
                                                     For each canopy c: emit(null, c.total/c.count)

      McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
      Matching“, KDD’00
21
Clustering Assignment
                                                         Clustering assignment
                                                          For each point p
                                                             Assign p to the closest
                                                              canopy center




     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

22
MapReduce: Cluster Assignment
                                                    Cluster assignment Map()
                                                     Input: Point p; cluster centriods
                                                     Output:
                                                            Output key = cluster id
                                                            Output value =point id
                                                     currentDist = inf
                                                     For each cluster centroids c
                  Canopy center
                                                            If dist(p,c)<currentDist
                                                             then bestCluster=c, currentDist=dist(p,c);
                                                     Emit(bestCluster, p)
                        Results can be directly written back to HDFS without a reducer. Or an
                        identity reducer can be applied to have output sorted on cluster id.

     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

23
KMeans: Multi-pass clustering
 Traditional                                             AssignCluster()



AssignCluster():
• For each point p
                              Kmeans ()
     Assign p the closest c
                               While not converge:
                                    AssignCluster()     UpdateCentroid()
UpdateCentroids ():
                                    UpdateCentroids()
 For each cluster
     Update cluster center




24
MapReduce – KMeans
                         KmeansIter()
                         Map(p) // Assign Cluster
                          For c in clusters:
                                If dist(p,c)<minDist,
                                 then minC=c, minDist = dist(p,c)
                          Emit(minC.id, (p, 1))
                         Reduce() //Update Centroids
                          For all values (p, c) :
                                total += p; count += c;
     Map1                 Emit(key, (total, count))

     Map2
     Initial centroids




25
MapReduce – KMeans
                                                             KmeansIter()
                                                             Map(p) // Assign Cluster
                                   Map: assign each p
                   1
                             3                                For c in clusters:
                                   to closest centroids
                                                                    If dist(p,c)<minDist,
                                                                     then minC=c, minDist = dist(p,c)
                                   Reduce: update each
                                                              Emit(minC.id, (p, 1))
                  2                centroid with its new
                         4                                   Reduce() //Update Centroids
                                   location (total, count)
                                                              For all values (p, c) :
                                                                    total += p; count += c;
     Map1                                                     Emit(key, (total, count))
                                 Reduce1
     Map2
                                 Reduce2
     Initial centroids




26
Classification using MapReduce




27
MapReduce kNN
     K=3                                      Map()
                                                Input:
            1       0                              – All points
                   0         1       1
                 1          1                      – query point p
                 0 1                            Output: k nearest neighbors (local)
                   1
                                                Emit the k closest points to p
                        1            0
                  0 0       1
             0                   0

                                              Reduce()
                                                Input:
     Map1                                          – Key: null; values: local neighbors
                                                   – query point p
     Map2                            Reduce     Output: k nearest neighbors (global)
     Query point                                Emit the k closest points to p among all
                                                 local neighbors




28
Naï Bayes
       ve
                                                       Q
        Formulation: P (cjd) / P (c) w2d P (wjc)
        Parameter estimation c is a class label, d is a doc, w is a word
          Class         ^
                 prior: P (c) = Nc            where Nc is #docs in c, N is #docs
                                   N
          Conditional probability:
             ^             Tcw
             P (wjc) =   P          Tcw is # of occurrences of w in class c
                          w0 Tcw0
d1
d2                       c1:        T1w           Goals:
d3                       N1=3                     1.   total number of docs N
d4                                                2.   number of docs in c:Nc
                                    T2w           3.   word count histogram in c:Tcw
d5                       c2:
d6                       N2=4
                                                  4.   total word count in c:Tcw’
d7
         Term vector
30
MapReduce: Naï Bayes
                  ve
                                             ClassPrior()
                                             Map(doc):
        Goals:                                 Emit(class_id, (doc_id, doc_length)
                                             Combine()/Reduce()
        1.   total number of docs N
                                              Nc = 0; sTcw = 0
        2.   number of docs in c:Nc
        3.   word count histogram in c:Tcw    For each doc_id:

        4.   total word count in c:Tcw’           Nc++; sTcw +=doc_length
                                              Emit(c, Nc)

                                             ConditionalProbability()
    Naï Bayes can be implemented
         ve                                  Map(doc):

     using MapReduce jobs of histogram        For each word w in doc:
                                                   Emit(pair(c, w) , 1)
     computation
                                             Combine()/Reduce()
                                              Tcw = 0
                                              For each value v: Tcw += v
                                              Emit(pair(c, w) , Tcw)
31
MapReduce Mining Summary




32
Taxonomy of MapReduce algorithms
                         One Iteration           Multiple Iterations Not good for
                                                                     MapReduce
     Clustering          Canopy                  KMeans

     Classification      Naï Bayes, kNN
                           ve                    Gaussian Mixture        SVM

     Graphs                                      PageRank

     Information         Inverted Index          Topic modeling
     Retrieval                                   (PLSI, LDA)
        One-iteration algorithms are perfect fits
        Multiple-iteration algorithms are OK fits
             but small shared info have to be synchronized across iterations (typically through
              filesytem)
        Some algorithms are not good for MapReduce framework
             Those algorithms typically require large shared info with a lot of synchronization.
             Traditional parallel framework like MPI is better suited for those.

33
MapReduce for machine learning algorithms

        The key is to convert into summation form
         (Statistical Query model [Kearns’94])
            y=  f(x) where f(x) corresponds to map(),
              corresponds to reduce().
        Naï Bayes
           ve
          MR job: P(c)
          MR job: P(w|c)

        Kmeans
          MR   job: split data into subgroups and compute partial
             sum in Map(), then sum up in Reduce()


34   Map-Reduce for Machine Learning on Multicore [NIPS’06]
Machine learning algorithms using MapReduce

          Linear Regression:                    y = (XT X)¡ 1 XT y
                                                 ^
                      where X 2 Rm£n and y 2 Rm
                                           P T
            MR job1:             A = X X = i xi xi T
                                       T
                                           P
            MR job2:             b = X y = i xi y(i)
          Finally, solve             A^ = b
                                       y

          Locally Weighted Linear Regression:
                            P
            MR job1:           A = i wi xi xi T
                                   P
            MR job2:           b = i wi xi y(i)
35   Map-Reduce for Machine Learning on Multicore [NIPS’06]
Machine learning algorithms using MapReduce
     (cont.)
      Logistic Regression
      Neural Networks
      PCA
      ICA
      EM for Gaussian Mixture Model




36   Map-Reduce for Machine Learning on Multicore [NIPS’06]
MapReduce Mining Resources




37
Mahout: Hadoop data mining library
        Mahout: http://lucene.apache.org/mahout/
           scalable data mining libraries: mostly implemented in Hadoop
        Data structure for vectors and matrices
           Vectors
               Dense vectors as double[]

               Sparse vectors as HashMap<Integer, Double>

               Operations: assign, cardinality, copy, divide, dot, get,
                 haveSharedCells, like, minus, normalize, plus, set, size, times,
                 toArray, viewPart, zSum and cross
           Matrices
               Dense matrix as a double[][]

               SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the
                 rows or columns of the matrix in a SparseVector
               SparseMatrix as a HashMap<Integer, Vector>

               Operations: assign, assignColumn, assignRow, cardinality, copy,
                 divide, get, haveSharedCells, like, minus, plus, set, size, times,
38               transpose, toArray, viewPart and zSum
MapReduce Mining Papers
        [Chu et al. NIPS’06] Map-Reduce for Machine Learning on Multicore
            General framework under MapReduce
        [Papadimitriou et al ICDM’08] DisCo: Distributed Co-clustering with Map-
         Reduce
            Co-clustering
        [Kang et al. ICDM’09] PEGASUS: A Peta-Scale Graph Mining System
         - Implementation and Observations
            Graph algorithms
        [Das et al. WWW’07] Google news personalization: scalable online
         collaborative filtering
            PLSI EM
        [Grossman+Gu KDD’08] Data Mining Using High Performance Data
         Clouds: Experimental Studies Using Sector and Sphere. KDD 2008
            Alternative to Hadoop which supports wide-area data collection and
             distribution.
39
Summary: algorithms
        Best for MapReduce:
          Single   pass, keys are uniformly distributed.
        OK for MapReduce:
          Multiple   pass, intermediate states are small
        Bad for MapReduce
          Key distribution is skewed
          Fine-grained synchronization is required.




40
Large-scale Data Mining:
MapReduce and Beyond
              Part 2: Algorithms

   Spiros Papadimitriou, IBM Research
          Jimeng Sun, IBM Research
                 Rong Yan, Facebook

More Related Content

What's hot

Jamie Pullar- Cats MTL in action
Jamie Pullar- Cats MTL in actionJamie Pullar- Cats MTL in action
Jamie Pullar- Cats MTL in actionRyan Adams
 
JavaFX 2.0 With Alternative Languages - JavaOne 2011
JavaFX 2.0 With Alternative Languages - JavaOne 2011JavaFX 2.0 With Alternative Languages - JavaOne 2011
JavaFX 2.0 With Alternative Languages - JavaOne 2011Stephen Chin
 
R graphics by Novi Reandy Sasmita
R graphics by Novi Reandy SasmitaR graphics by Novi Reandy Sasmita
R graphics by Novi Reandy Sasmitabeasiswa
 
Procedural Content Generation with Clojure
Procedural Content Generation with ClojureProcedural Content Generation with Clojure
Procedural Content Generation with ClojureMike Anderson
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
23 industrial engineering
23 industrial engineering23 industrial engineering
23 industrial engineeringmloeb825
 
Summary (chapter 1 chapter6)
Summary (chapter 1  chapter6)Summary (chapter 1  chapter6)
Summary (chapter 1 chapter6)Nizam Mustapha
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing EconomyMining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing Economytnoulas
 
D3 svg & angular
D3 svg & angularD3 svg & angular
D3 svg & angular500Tech
 
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool FeaturesMongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Featuresajhannan
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 

What's hot (12)

Jamie Pullar- Cats MTL in action
Jamie Pullar- Cats MTL in actionJamie Pullar- Cats MTL in action
Jamie Pullar- Cats MTL in action
 
JavaFX 2.0 With Alternative Languages - JavaOne 2011
JavaFX 2.0 With Alternative Languages - JavaOne 2011JavaFX 2.0 With Alternative Languages - JavaOne 2011
JavaFX 2.0 With Alternative Languages - JavaOne 2011
 
R graphics by Novi Reandy Sasmita
R graphics by Novi Reandy SasmitaR graphics by Novi Reandy Sasmita
R graphics by Novi Reandy Sasmita
 
Procedural Content Generation with Clojure
Procedural Content Generation with ClojureProcedural Content Generation with Clojure
Procedural Content Generation with Clojure
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
23 industrial engineering
23 industrial engineering23 industrial engineering
23 industrial engineering
 
Summary (chapter 1 chapter6)
Summary (chapter 1  chapter6)Summary (chapter 1  chapter6)
Summary (chapter 1 chapter6)
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing EconomyMining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
 
D3 svg & angular
D3 svg & angularD3 svg & angular
D3 svg & angular
 
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool FeaturesMongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 

Viewers also liked

Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Yahoo Developer Network
 
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud ComputingMarket Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud ComputingJongwook Woo
 
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop Jongwook Woo
 
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security PresentationSuper Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security PresentationSystem ID Warehouse
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World VerticaOmer Trajman
 
DFA Minimization in Map-Reduce
DFA Minimization in Map-ReduceDFA Minimization in Map-Reduce
DFA Minimization in Map-ReduceIraj Hedayati
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoopDavid Chiu
 
Wireless hacking and security
Wireless hacking and securityWireless hacking and security
Wireless hacking and securityAdel Zalok
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
2nd year pre clinical RPD Terminology, Components and Classification of parti...
2nd year pre clinical RPD Terminology, Components and Classification of parti...2nd year pre clinical RPD Terminology, Components and Classification of parti...
2nd year pre clinical RPD Terminology, Components and Classification of parti...Sajjad Hussain
 
Wireless Security null seminar
Wireless Security null seminarWireless Security null seminar
Wireless Security null seminarNilesh Sapariya
 
Presentation on Wireless border security system
Presentation on  Wireless border security systemPresentation on  Wireless border security system
Presentation on Wireless border security systemStudent
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Dental surveying of Removal partial denture
Dental surveying of Removal partial dentureDental surveying of Removal partial denture
Dental surveying of Removal partial dentureAli Alarasy
 
Mouth preparation for removable partial denture/ dental education in india
Mouth preparation for removable partial denture/ dental education in indiaMouth preparation for removable partial denture/ dental education in india
Mouth preparation for removable partial denture/ dental education in indiaIndian dental academy
 
Diagnosis and treatment planning of Removable Partial Denture
Diagnosis and treatment planning of Removable Partial Denture Diagnosis and treatment planning of Removable Partial Denture
Diagnosis and treatment planning of Removable Partial Denture dwijk
 
Mouth preparation for removable partial dentures /certified fixed orthodontic...
Mouth preparation for removable partial dentures /certified fixed orthodontic...Mouth preparation for removable partial dentures /certified fixed orthodontic...
Mouth preparation for removable partial dentures /certified fixed orthodontic...Indian dental academy
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 

Viewers also liked (20)

Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
 
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud ComputingMarket Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
 
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security PresentationSuper Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World Vertica
 
DFA Minimization in Map-Reduce
DFA Minimization in Map-ReduceDFA Minimization in Map-Reduce
DFA Minimization in Map-Reduce
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
 
Wireless hacking and security
Wireless hacking and securityWireless hacking and security
Wireless hacking and security
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
2nd year pre clinical RPD Terminology, Components and Classification of parti...
2nd year pre clinical RPD Terminology, Components and Classification of parti...2nd year pre clinical RPD Terminology, Components and Classification of parti...
2nd year pre clinical RPD Terminology, Components and Classification of parti...
 
Wireless Security null seminar
Wireless Security null seminarWireless Security null seminar
Wireless Security null seminar
 
Presentation on Wireless border security system
Presentation on  Wireless border security systemPresentation on  Wireless border security system
Presentation on Wireless border security system
 
types of dental surveyor
types of dental surveyortypes of dental surveyor
types of dental surveyor
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Dental surveying of Removal partial denture
Dental surveying of Removal partial dentureDental surveying of Removal partial denture
Dental surveying of Removal partial denture
 
Mouth preparation for removable partial denture/ dental education in india
Mouth preparation for removable partial denture/ dental education in indiaMouth preparation for removable partial denture/ dental education in india
Mouth preparation for removable partial denture/ dental education in india
 
Diagnosis and treatment planning of Removable Partial Denture
Diagnosis and treatment planning of Removable Partial Denture Diagnosis and treatment planning of Removable Partial Denture
Diagnosis and treatment planning of Removable Partial Denture
 
Mouth preparation for removable partial dentures /certified fixed orthodontic...
Mouth preparation for removable partial dentures /certified fixed orthodontic...Mouth preparation for removable partial dentures /certified fixed orthodontic...
Mouth preparation for removable partial dentures /certified fixed orthodontic...
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 

Similar to 07 2

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsJonny Daenen
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Austin Benson
 
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Austin Benson
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
Sorting and Routing on Hypercubes and Hypercubic Architectures
Sorting and Routing on Hypercubes and Hypercubic ArchitecturesSorting and Routing on Hypercubes and Hypercubic Architectures
Sorting and Routing on Hypercubes and Hypercubic ArchitecturesCTOGreenITHub
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Robert Metzger
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasetsAshic Mahtab
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Austin Benson
 
On the Odd Gracefulness of Cyclic Snakes With Pendant Edges
On the Odd Gracefulness of Cyclic Snakes With Pendant EdgesOn the Odd Gracefulness of Cyclic Snakes With Pendant Edges
On the Odd Gracefulness of Cyclic Snakes With Pendant EdgesGiselleginaGloria
 

Similar to 07 2 (20)

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
 
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
 
Geospatial Data in R
Geospatial Data in RGeospatial Data in R
Geospatial Data in R
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Sorting and Routing on Hypercubes and Hypercubic Architectures
Sorting and Routing on Hypercubes and Hypercubic ArchitecturesSorting and Routing on Hypercubes and Hypercubic Architectures
Sorting and Routing on Hypercubes and Hypercubic Architectures
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Chap 6 Graph.ppt
Chap 6 Graph.pptChap 6 Graph.ppt
Chap 6 Graph.ppt
 
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
 
Dijkstra
DijkstraDijkstra
Dijkstra
 
d
dd
d
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
 
On the Odd Gracefulness of Cyclic Snakes With Pendant Edges
On the Odd Gracefulness of Cyclic Snakes With Pendant EdgesOn the Odd Gracefulness of Cyclic Snakes With Pendant Edges
On the Odd Gracefulness of Cyclic Snakes With Pendant Edges
 

07 2

  • 1. Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
  • 2. Part 2:Mining using MapReduce  Mining algorithms using MapReduce  Information retrieval  Graph algorithms: PageRank  Clustering: Canopy clustering, KMeans  Classification: kNN, Naï Bayes ve  MapReduce Mining Summary 2
  • 3. MapReduce Interface and Data Flow  Map: (K1, V1)  list(K2, V2)  Combine: (K2, list(V2))  list(K2, V2)  Partition: (K2, V2)  reducer_id  Reduce: (K2, list(V2))  list(K3, V3) id, doc list(w, id) list(unique_w, id) Host 1 Host 3 Map Combine w1, list(id) w1, list(unique_id) Reduce Map Combine Host 4 Host 2 Map Combine Map Combine w2, list(id) Reduce w2, list(unique_id) partition 3
  • 5. IR: Distributed Grep  Find the doc_id and line# of a matching pattern  Map: (id, doc) list(id, line#)  Reduce: None Grep “data mining” Docs Output 1 2 Map1 <1, 123> 3 4 5 Map2 <3, 717>, <5, 1231> 6 Map3 <6, 1012> 5
  • 6. IR: URL Access Frequency  Map: (null, log)  (URL, 1)  Reduce: (URL, 1)  (URL, total_count) Logs Map output Reduce output <u1,1> Map1 <u1,1> <u1,2> <u2,1> Reduce <u2,1> Map2 <u3,2> <u3,1> Map3 <u3,1> Also described in Part 1 6
  • 7. IR: Reverse Web-Link Graph  Map: (null, page)  (target, source)  Reduce: (target, source)  (target, list(source)) Pages Map output Reduce output Map1 <t1,s2> <t1,[s2]> Map2 <t2,s3> Reduce <t2,[s3,s5]> <t2,s5> <t3,[s5]> Map3 <t3,s5> It is the same as matrix transpose 7
  • 8. IR: Inverted Index  Map: (id, doc)  list(word, id)  Reduce: (word, list(id))  (word, list(id)) Doc Map output Reduce output Map1 <w1,1> <w1,[1,5]> Map2 <w2,2> Reduce <w2,[2]> <w3,3> <w3,[3]> Map3 <w1,5> 8
  • 9. Graph mining using MapReduce 9
  • 10. PageRank  PageRank vector q is defined as 1 2 where T 1¡ c q = cA q + N e 3 4 Browsing Teleporting 0 1 A is the source-by-destination 0 1 1 1 adjacency matrix, B 0 0 1 1 C A=B @ 0 0 C 0 1 A  e is all one vector. 0 0 1 0  N is the number of nodes  c is the weight between 0 and 1 (eg.0.85)  PageRank indicates the importance of a page.  Algorithm: Iterative powering for finding the first eigen-vector 10
  • 11. MapReduce: PageRank PageRank Map() 1 2  Input: key = page x, value = (PageRank qx, links[y1…ym])  Output: key = page x, value = partialx 3 4 1. Emit(x, 0) //guarantee all pages will be emitted Map: distribute PageRank qi 2. For each outgoing link yi: q3 • Emit(yi, qx/m) q1 q2 q4 PageRank Reduce()  Input: key = page x, value = the list of [partialx]  Output: key = page x, value = PageRank qx 1. qx = 0 q1 q2 q3 q4 2. For each partial value d in the list: Reduce: update new PageRank • qx += d 3. qx = cqx+ (1-c)/N 4. Emit(x, qx) Check out Kang et al ICDM’09 11
  • 13. Canopy: single-pass clustering Canopy creation  Construct overlapping clusters – canopies  Make sure no two canopies with too much overlaps  Key: no canopy centers are too close to each other C1 C3 T1 C2 C4 T2 overlapping clusters too much overlap two thresholds T1>T2 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 13
  • 14. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 14
  • 15. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) T1 T2 – For each canopy c: Strongly marked  if dist(p,c)< T1: c.add(p) points  if dist(p,c)< T2: strongBound = true C1 C2 – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 15
  • 16. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q Other points in the cluster  While (Q is not empty) – p = dequeue(Q) – For each canopy c: Canopy center Strongly marked  if dist(p,c)< T1: c.add(p) points  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 16
  • 17. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 17
  • 18. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 18
  • 19. MapReduce - Canopy Map() Canopy creation Map()  Input: A set of points P, threshold T1, T2  Output: key = null; value = a list of local canopies (total, count)  For each p in P:  For each canopy c: • if dist(p,c)< T1 then c.total+=p, c.count++; Map1 • if dist(p,c)< T2 then strongBound = true Map2 • If not strongBound then create canopy at p Close()  For each canopy c: • Emit(null, (total, count)) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 19
  • 20. MapReduce - Canopy Reduce() Reduce()  Input: key = null; input values (total, count)  Output: key = null; value = cluster centroids  For each intermediate values  p = total/count  For each canopy c: • if dist(p,c)< T1 then c.total+=p, Map1 results c.count++; Map2 results • if dist(p,c)< T2 then strongBound = true  If not strongBound then create canopy at p Close() For simplicity we assume only one reducer  For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 20
  • 21. MapReduce - Canopy Reduce() Reduce()  Input: key = null; input values (total, count)  Output: key = null; value = cluster centroids  For each intermediate values  p = total/count  For each canopy c: • if dist(p,c)< T1 then c.total+=p, Reducer results c.count++; • if dist(p,c)< T2 then strongBound = true  If not strongBound then create canopy at p Close() Remark: This assumes only one reducer  For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 21
  • 22. Clustering Assignment Clustering assignment  For each point p  Assign p to the closest canopy center McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 22
  • 23. MapReduce: Cluster Assignment Cluster assignment Map()  Input: Point p; cluster centriods  Output:  Output key = cluster id  Output value =point id  currentDist = inf  For each cluster centroids c Canopy center  If dist(p,c)<currentDist then bestCluster=c, currentDist=dist(p,c);  Emit(bestCluster, p) Results can be directly written back to HDFS without a reducer. Or an identity reducer can be applied to have output sorted on cluster id. McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 23
  • 24. KMeans: Multi-pass clustering Traditional AssignCluster() AssignCluster(): • For each point p Kmeans () Assign p the closest c  While not converge:  AssignCluster() UpdateCentroid() UpdateCentroids ():  UpdateCentroids()  For each cluster Update cluster center 24
  • 25. MapReduce – KMeans KmeansIter() Map(p) // Assign Cluster  For c in clusters:  If dist(p,c)<minDist, then minC=c, minDist = dist(p,c)  Emit(minC.id, (p, 1)) Reduce() //Update Centroids  For all values (p, c) :  total += p; count += c; Map1  Emit(key, (total, count)) Map2 Initial centroids 25
  • 26. MapReduce – KMeans KmeansIter() Map(p) // Assign Cluster Map: assign each p 1 3  For c in clusters: to closest centroids  If dist(p,c)<minDist, then minC=c, minDist = dist(p,c) Reduce: update each  Emit(minC.id, (p, 1)) 2 centroid with its new 4 Reduce() //Update Centroids location (total, count)  For all values (p, c) :  total += p; count += c; Map1  Emit(key, (total, count)) Reduce1 Map2 Reduce2 Initial centroids 26
  • 28. MapReduce kNN K=3 Map()  Input: 1 0 – All points 0 1 1 1 1 – query point p 0 1  Output: k nearest neighbors (local) 1  Emit the k closest points to p 1 0 0 0 1 0 0 Reduce()  Input: Map1 – Key: null; values: local neighbors – query point p Map2 Reduce  Output: k nearest neighbors (global) Query point  Emit the k closest points to p among all local neighbors 28
  • 29. Naï Bayes ve Q  Formulation: P (cjd) / P (c) w2d P (wjc)  Parameter estimation c is a class label, d is a doc, w is a word  Class ^ prior: P (c) = Nc where Nc is #docs in c, N is #docs N  Conditional probability: ^ Tcw P (wjc) = P Tcw is # of occurrences of w in class c w0 Tcw0 d1 d2 c1: T1w Goals: d3 N1=3 1. total number of docs N d4 2. number of docs in c:Nc T2w 3. word count histogram in c:Tcw d5 c2: d6 N2=4 4. total word count in c:Tcw’ d7 Term vector 30
  • 30. MapReduce: Naï Bayes ve ClassPrior() Map(doc): Goals: Emit(class_id, (doc_id, doc_length) Combine()/Reduce() 1. total number of docs N  Nc = 0; sTcw = 0 2. number of docs in c:Nc 3. word count histogram in c:Tcw  For each doc_id: 4. total word count in c:Tcw’  Nc++; sTcw +=doc_length  Emit(c, Nc) ConditionalProbability()  Naï Bayes can be implemented ve Map(doc): using MapReduce jobs of histogram  For each word w in doc:  Emit(pair(c, w) , 1) computation Combine()/Reduce()  Tcw = 0  For each value v: Tcw += v  Emit(pair(c, w) , Tcw) 31
  • 32. Taxonomy of MapReduce algorithms One Iteration Multiple Iterations Not good for MapReduce Clustering Canopy KMeans Classification Naï Bayes, kNN ve Gaussian Mixture SVM Graphs PageRank Information Inverted Index Topic modeling Retrieval (PLSI, LDA)  One-iteration algorithms are perfect fits  Multiple-iteration algorithms are OK fits  but small shared info have to be synchronized across iterations (typically through filesytem)  Some algorithms are not good for MapReduce framework  Those algorithms typically require large shared info with a lot of synchronization.  Traditional parallel framework like MPI is better suited for those. 33
  • 33. MapReduce for machine learning algorithms  The key is to convert into summation form (Statistical Query model [Kearns’94])  y=  f(x) where f(x) corresponds to map(),  corresponds to reduce().  Naï Bayes ve  MR job: P(c)  MR job: P(w|c)  Kmeans  MR job: split data into subgroups and compute partial sum in Map(), then sum up in Reduce() 34 Map-Reduce for Machine Learning on Multicore [NIPS’06]
  • 34. Machine learning algorithms using MapReduce  Linear Regression: y = (XT X)¡ 1 XT y ^ where X 2 Rm£n and y 2 Rm P T  MR job1: A = X X = i xi xi T T P  MR job2: b = X y = i xi y(i)  Finally, solve A^ = b y  Locally Weighted Linear Regression: P  MR job1: A = i wi xi xi T P  MR job2: b = i wi xi y(i) 35 Map-Reduce for Machine Learning on Multicore [NIPS’06]
  • 35. Machine learning algorithms using MapReduce (cont.)  Logistic Regression  Neural Networks  PCA  ICA  EM for Gaussian Mixture Model 36 Map-Reduce for Machine Learning on Multicore [NIPS’06]
  • 37. Mahout: Hadoop data mining library  Mahout: http://lucene.apache.org/mahout/  scalable data mining libraries: mostly implemented in Hadoop  Data structure for vectors and matrices  Vectors  Dense vectors as double[]  Sparse vectors as HashMap<Integer, Double>  Operations: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross  Matrices  Dense matrix as a double[][]  SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the rows or columns of the matrix in a SparseVector  SparseMatrix as a HashMap<Integer, Vector>  Operations: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times, 38 transpose, toArray, viewPart and zSum
  • 38. MapReduce Mining Papers  [Chu et al. NIPS’06] Map-Reduce for Machine Learning on Multicore  General framework under MapReduce  [Papadimitriou et al ICDM’08] DisCo: Distributed Co-clustering with Map- Reduce  Co-clustering  [Kang et al. ICDM’09] PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations  Graph algorithms  [Das et al. WWW’07] Google news personalization: scalable online collaborative filtering  PLSI EM  [Grossman+Gu KDD’08] Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere. KDD 2008  Alternative to Hadoop which supports wide-area data collection and distribution. 39
  • 39. Summary: algorithms  Best for MapReduce:  Single pass, keys are uniformly distributed.  OK for MapReduce:  Multiple pass, intermediate states are small  Bad for MapReduce  Key distribution is skewed  Fine-grained synchronization is required. 40
  • 40. Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook