SlideShare a Scribd company logo
1 of 60
Download to read offline
Matrix Methods
with Hadoop
Slides bit.ly/10SIe1A
Code github.com/dgleich/matrix-hadoop-tutorial


DAVID F. GLEICH
ASSISTANT PROFESSOR "
COMPUTER SCIENCE "
PURDUE UNIVERSITY





                                                                                    1
                                         David Gleich · Purdue
   bit.ly/10SIe1A
2
David Gleich · Purdue
   bit.ly/10SIe1A
A bit of philosophy …




    Image from rockysprings, deviantart, CC share-alike




                                                           3
4
David Gleich · Purdue
   bit.ly/10SIe1A
Matrix computations
              2                                                     3
                  A1,1    A1,2        ···             A1,n
        6                                              . 7
                                                       . 7
        6 A2,1            A2,2        ···              . 7
      A=6 .
        6                                                   7
        4 .               ..          ..
           .                 .           .           Am 1,n 5
          Am,1            ···     Am,n        1       Am,n
  Ax          Ax = b         min kAx              bk           Ax = x
Operations
    Linear "          Least squares
               Eigenvalues
              systems




                                                                               5
                                    David Gleich · Purdue
   bit.ly/10SIe1A
Outcomes
Recognize relationships between matrix methods and
things you’ve already been doing"
   Example SQL queries as matrix computations

Understand how to use Hadoop to compute these
matrix methods at scale for BigData"
  Example Recommenders with social network info

Understand some of the issues that could arise.




                                                                       6
                            David Gleich · Purdue
   bit.ly/10SIe1A
Ideal outcomes



                  How to use techniques from "
                  matrix computations in order "
                  to solve your problems quickly!




     1986




                                                                     7
                          David Gleich · Purdue
   bit.ly/10SIe1A
Taking the red pill …




    Image from rockysprings, deviantart, CC share-alike




                                                           8
Matrix computations
                Physics
            Databases
               Statistics
       Machine learning
              Engineering
     Information retrieval
               Graphics
         Computer vision
             Bioinformatics
     Social networks
                    
                   


bit.ly/10SIe1A




                                                        9
David Gleich · Purdue
matrix computations "
          ≠"
   linear algebra




                                                      10
           David Gleich · Purdue
   bit.ly/10SIe1A
A SQL statement as a "
matrix computation




         http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql


How do I find the
average rating for
each product?




                                                                                                      11
                                                   David Gleich · Purdue
           bit.ly/10SIe1A
A SQL statement as a "
matrix computation




                                    SELECT!
                                             p.product_id,!
                                             p.name,!
                                             AVG(pr.rating) AS rating_average!
          http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql
                                    FROM products p!
How do I find the                   INNER JOIN product_ratings pr!
average rating for                  ON pr.product_id = p.product_id!
                                    GROUP BY p.product_id!
each product?                       ORDER BY rating_average DESC!




                                                                                                12
                                                   David Gleich · Purdue
    bit.ly/10SIe1A
This SQL statement is a "
matrix computation!




                                                           13
    Image from rockysprings, deviantart, CC share-alike
SELECT!
     ...!
     AVG(pr.rating)!
...!
GROUP BY p.product_id!
product_ratings

pid8 uid2 4
                     pid1
pid9 uid9 1
                     pid2
pid2 uid9 5
                     pid3
pid9 uid5 5
                     pid4
pid6 uid8 4
                     pid5
pid1 uid2 4
                     pid6
pid3 uid4 4
       Is a matrix!
 pid7
pid5 uid9 2
                     pid8
pid9 uid8 4
                     pid9
pid9 uid9 1




                                                                                    14
                                         David Gleich · Purdue
   bit.ly/10SIe1A
But it’s a weird matrix"


product_ratings

pid8 uid2 4
                     pid1
pid9 uid9 1
                     pid2
pid2 uid9 5
                     pid3
pid9 uid5 5
                     pid4
pid6 uid8 4
                     pid5
pid1 uid2 4
                     pid6
pid3 uid4 4
       Is a matrix!
 pid7
pid5 uid9 2
                     pid8
pid9 uid8 4
                     pid9
pid9 uid9 1

                                           Missing entries!




                                                                                    15
                                         David Gleich · Purdue
   bit.ly/10SIe1A
But it’s a weird matrix"

                                                                              Average"
                                                                              of ratings
product_ratings

pid8 uid2 4
                     pid1
    4
pid9 uid9 1
                     pid2
pid2 uid9 5
                     pid3
            4
pid9 uid5 5
                     pid4
pid6 uid8 4
                     pid5
pid1 uid2 4
                     pid6
                            4 SELECT
pid3 uid4 4
       Is a matrix!
 pid7
                                                                      AVG(r)
pid5 uid9 2
                     pid8
   4
                         ...
pid9 uid8 4
                     pid9
                5
          4 GROUP BY
                                                                      pid

                                               Matrix
                         Vector




                                                                                    16
                                         David Gleich · Purdue
   bit.ly/10SIe1A
But it’s a weird matrix"
and not a linear operator
                              2                                              3
                 I
product_ratings
 s a matrix
                                  A1,1   A1,2         ···           A1,n
                           !
6                                       . 7
pid8 uid2 4
           6 A2,1            A2,2         ···            . 7
                                                                     . 7
pid9 uid9 1
         A=6 .
                       6                                                  7
                       4 .               ..           ..
pid2 uid9 5
              .                 .            .         Am 1,n 5
pid9 uid5 5
pid6 uid8 4
             Am,1            ···      Am,n       1      Am,n
pid1 uid2 4
                 2 P           P                                          3
                                  j A1,j / Pj “A1,j 6= 0”
pid3 uid4 4
pid5 uid9 2
                   P
                                  j A2,j /    j “A2,j 6= 0”
pid9 uid8 4
                 6                                                        7
                             6                                                        7
pid9 uid9 1
        avg(A) = 6              .                                         7
                             4              .
                                            .                                         5
                               P           P
                                 j Am,j /     j “Am,j 6= 0”




                                                                                      17
                                          David Gleich · Purdue
    bit.ly/10SIe1A
matrix computations "
          ≠"
   linear algebra




                                                      18
           David Gleich · Purdue
   bit.ly/10SIe1A
… but there is a linear
operator hiding …

    2     P                               P                                   3
    A1,1 / j “A1,j 6= 0”            A1,2 / j “A1,j 6= 0”                 ···
          P                               P
  6A2,1 /                                                                · · ·7
P=4         j “A2,j 6= 0”           A2,2 / j “A2,j 6= 0”                      5
              .
              .                            ..
              .                               .

                avg(A) = Pe

                    e is the vector of all ones




                                                                              19
                                   David Gleich · Purdue
   bit.ly/10SIe1A
Hadoop, MapReduce,
and Matrix Methods




                                                       20
            David Gleich · Purdue
   bit.ly/10SIe1A
MapReduce




                                                        21
             David Gleich · Purdue
   bit.ly/10SIe1A
The MapReduce Framework
Originated at Google for indexing web   Data scalable
pages and computing PageRank.
                Maps
                          M        M
                                                                            1
        2
                                        1
     M

                                       2
     M
                                                    Reduce
                                                                             M        M
                                                      R                     3
        4
                                               M
Express algorithms in "
                                        3
                                                      R
                                        4
     M                                 M
“data-local operations”.
               5
     M Shuffle
                         5



Implement one type of                   Fault-tolerance by design
communication: shuffle.
                      Input stored in triplicate
                                                                    Reduce input/"
                                                        M
Shuffle moves all data with                              M
                                                                    output on disk
                                                                 R
the same key to the same                                M
                                                                 R
                                                        M
reducer.
                                                   Map output"
                                                            persisted to disk"




                                                                                           22
                                                            before shuffle
                                         David Gleich · Purdue
   bit.ly/10SIe1A
wordcount "
is a matrix computation too
                    map(document) :
                        for word in document

   D
        D
                            emit (word, 1)
   1
        2

                  matrix,1
       bigdata,1
            hadoop,1
   D
        D
    matrix,1
        bigdata,1
            hadoop,1
   3
        4
     matrix,1
         bigdata,1
           hadoop,1
                      matrix,1
         bigdata,1
          hadoop,1
                                          bigdata,1
          hadoop,1
        D
                                  bigdata,1
         hadoop,1
        5
                                    bigdata,1
        hadoop,1
                                                bigdata,1

                    reduce(word, counts) :
                        emit (word, sum(counts))




                                                                                23
                                     David Gleich · Purdue
   bit.ly/10SIe1A
wordcount "
is a matrix computation too

    2                                      3
doc1
   A1,1   A1,2    ···         A1,n
    6                               . 7
                                    . 7
    6
doc2
 A2,1     A2,2    ···          . 7
A=6 .
    6                                    7 = A
    4 .        ..      ..
       .          .       .       Am 1,n 5
docm
 Am,1     ···    Am,n    1    Am,n


word count       =
   colsum(A)
       =        AT e
                                         e is the vector of all ones




                                                                                    24
                                    David Gleich · Purdue
        bit.ly/10SIe1A
inverted index"
is a matrix computation too

    2                                      3
doc1
   A1,1   A1,2    ···         A1,n
    6                               . 7
                                    . 7
    6
doc2
 A2,1     A2,2    ···          . 7
A=6 .
    6                                    7 = A
    4 .        ..      ..
       .          .       .       Am 1,n 5
docm
      Am,1     ···    Am,n    1    Am,n




                                                                               25
                                    David Gleich · Purdue
   bit.ly/10SIe1A
inverted index"
is a matrix computation too

    2                                      3
term1
   A1,1   A2,1    ···        Am,1
      6                             . 7
                                    . 7
      6A1,2     A2,2    ···         . 7
      6
 term2
      6 .                                7 = AT
      4 .       ..      ..
        .          .       .      Am,n 1 5
termm
 A1,n     ···    Am   1,n    Am,n




                                                                               26
                                    David Gleich · Purdue
   bit.ly/10SIe1A
A recommender system "
with social info

product_ratings
   friends_links

pid8 uid2 4
       uid6 uid1
pid9 uid9 1
       uid8 uid9
pid2 uid9 5
       uid7 uid7
pid9 uid5 5
       uid7 uid4
pid6 uid8 4
       uid6 uid2
pid1 uid2 4
       uid7 uid1
pid3 uid4 4
       uid3 uid1
pid5 uid9 2
       uid1 uid8
pid9 uid8 4
       uid7 uid3
pid9 uid9 1
       uid9 uid1




                                                                27
                     David Gleich · Purdue
   bit.ly/10SIe1A
A recommender system "
with social info

product_ratings
                    friends_links

pid8 uid2 4
                        uid6 uid1
pid9 uid9 1
                        uid8 uid9
pid2 uid9 5
                        uid7 uid7
     2
pid9 uid5 5
                  3     uid7 uid4
 2                                 3
 pid1
 A
pid6 uid8 4
           1,1
pid1 uid2 4
                   A2,1   ···       uid6 uid2
                                          uid1
                                    uid7 uid1
                                                 A1,1          A2,1        ···
     6A
pid3 uid4 4
 pid2
     1,2     A2,2   · · ·7    uid3 uid1
 6A1,2           A2,2        · · ·7
     4
pid5 uid9 2
                    5         uid2
                                    uid1 uid8
 4                                 5
           .
pid9 uid8 4
           .       ..     ..        uid7 uid3
    .
                                                  .            ..          ..
           .
pid9 uid9 1
          .       .     uid9 uid1
    .               .            .




                                                                                 28
                                      David Gleich · Purdue
   bit.ly/10SIe1A
A recommender system "
with social info

product_ratings
    friends_links

pid8 uid2 4
        uid6 uid1
pid9 uid9 1
        uid8 uid9
pid2 uid9 5
        uid7 uid7
pid9 uid5 5
        uid7 uid4
pid6 uid8 4
        uid6 uid2


               R
                       S
pid1 uid2 4
        uid7 uid1
pid3 uid4 4
        uid3 uid1
pid5 uid9 2
        uid1 uid8
pid9 uid8 4
        uid7 uid3
pid9 uid9 1
        uid9 uid1




                                                                 29
                      David Gleich · Purdue
   bit.ly/10SIe1A
A recommender system "
with social info
Recommend each item based                                               with something that is"
on the average rating of all           “X = S RT”
                      almost a matrix-matrix"
trusted users
                                                          product




      2                                   3                         2                                     3
        A1,1         A2,1           ···                            A1,1            A2,1         ···
                    R
                                                          S
  pid1
                                                      uid1
      6
  pid2
 A1,2         A2,2           · · ·7                       6
                                                             uid2
 A1,2            A2,2         · · ·7
      4                                   5                      4                                    5
         .
         .           ..             ..                              .
                                                                    .              ..           ..
         .              .               .                           .                 .             .
                                                 !                                              !     1
                    X                                      X
       Xuid,pid =          Suid,uid2 Ruid2,pid       ·            “Suid,uid2 and Ruid2,pid 6= 0”
                    uid2                                   uid2




                                                                                                          30
                                                         David Gleich · Purdue
     bit.ly/10SIe1A
Tools I like


      hadoop streaming
        dumbo
        mrjob
        hadoopy
        C++




                                                                     31
                          David Gleich · Purdue
   bit.ly/10SIe1A
Tools I don’t use but other
people seem to like …

      pig
      java
      hbase
        Mahout is the closest thing to a library
                    for matrix computations in Hadoop. If
                    you like Java, you should probably
      mahout
       start there.
                    
      Eclipse
      I’m a low-level guy

      Cassandra
      




                                                                    32
                         David Gleich · Purdue
   bit.ly/10SIe1A
hadoop streaming

     the map function is a program"
     (key,value) pairs are sent via stdin"
     output (key,value) pairs goes to stdout
     
     the reduce function is a program"
     (key,value) pairs are sent via stdin"
     keys are grouped"
     output (key,value) pairs goes to stdout




                                                                     33
                          David Gleich · Purdue
   bit.ly/10SIe1A
mrjob from 

     a wrapper around hadoop streaming for
     map and reduce functions in python
     class MRWordFreqCount(MRJob):
         def mapper(self, _, line):
             for word in line.split():
                 yield (word.lower(), 1)

        def reducer(self, word, counts):
            yield (word, sum(counts))

      if __name__ == '__main__':
          MRWordFreqCount.run()




                                                                         34
                              David Gleich · Purdue
   bit.ly/10SIe1A
How can Hadoop streaming
 Synthetic data test 100,000,000-by-500 matrix (~500GB)
 Codes implemented in MapReduce streaming
possibly be fast?
 Matrix stored as TypedBytes lists of doubles
 Python frameworks use Numpy+Atlas
 Custom C++ TypedBytes reader/writer with Atlas
500 GB matrix. Computing the R in a QR factorization. "
See my non-streaming Java implementation too
 New next talk!
                 Iter 1            Iter 1            Iter 2                   Overall
                 QR (secs.)        Total (secs.)     Total (secs.)            Total (secs.)
Dumbo            67725             960               217                      1177
Hadoopy          70909             612               118                      730
C++              15809             350               37                       387
Java                               436               66                       502

       C++ in streaming beats a native Java implementation.
                                                          All timing results from the Hadoop job tracker
mrjob could be faster if it used
David Gleich (Sandia)               MapReduce 2011                                                16/22
                                                                            Example available from 
typedbytes for intermediate storage see
                                github.com/dgleich/mrtsqr"
https://github.com/Yelp/mrjob/pull/447
                                             for verification




                                                                                                       35
                                               David Gleich · Purdue
         bit.ly/10SIe1A
Ax = y
                                                  X
Matrix-vector product
                       yi =    Aik xk
Follow along!                                           k
matrix-hadoop/codes/smatvec.py!




              x


  A




                                                                                 36
                                  David Gleich · Purdue
       bit.ly/10SIe1A
Where do matrix-vector
products arise?
Google’s PageRank 

Computing cosine-similarity between one
document and all other documents

Predictions from kernel methods

Computing averages (the example above)




                                                                      37
                           David Gleich · Purdue
   bit.ly/10SIe1A
Ax = y
                                                        X
Matrix-vector product
                             yi =    Aik xk
Follow along!                                                 k
matrix-hadoop/codes/smatvec.py!


                    A is stored by row
              x
    

                    $   head samples/smat_5_5.txt !
                    0   0 0.125 3 1.024 4 0.121!


  A
                    1   0 0.597!
                    2   2 1.247!
                    3   4 -1.45!   x is stored entry-wise
                                   !
                    4   2 0.061!     $ head samples/vec_5.txt!
                    
               0   0.241!
                                    1   -0.98!
                                    2   0.237!
                                    3   -0.32!
                                    4   0.080!




                                                                                       38
                                        David Gleich · Purdue
       bit.ly/10SIe1A
Matrix-vector product                     Ax = y
                                            X
(in pictures)
                         yi =    Aik xk
                                                     k




                  x




                                        x
          x


 A
              A
                   A
                            y


 Input
        Map 1!               Reduce 1!                Reduce 2!
               Align on columns    Output Aik xk           Output 
               
                    keyed on row i
          sum(Aik xk)




                                                                            39
                            David Gleich · Purdue
           
                                                         bit.ly/10SIe1A
Matrix-vector product                     Ax = y
                                            X
(in pictures)
                         yi =    Aik xk
                                                     k




                  x
                       def joinmap(self, key, line):!
                         vals = line.split()!
          x
             if len(vals) == 2:!
                           # the vector!
                           yield (vals[0],       # row!


 A
              A
                             (float(vals[1]),)) # xi!
                         else:!
                           # the matrix!
                           row = vals[0]!
                           for i in xrange(1,len(vals),2):!
                             yield (vals[i],      # column!
                                (row,             # i,Aij!
                                  float(vals[i+1])))!
 Input
        Map 1!
               Align on columns
               




                                                                           40
                            David Gleich · Purdue
       bit.ly/10SIe1A
Matrix-vector product                                     Ax = y
                                                            X
(in pictures)
                                         yi =    Aik xk
                                                                   k




                                x




                                                        x
   def joinred(self, key, vals):!
     vecval = 0. !
                 x
     matvals = []!
     for val in vals:!
       if len(val) == 1:!


   A
                          A
                      A
         vecval += val[0]!
       else:!
         matvals.append(val)                     !
     for val in matvals:!
       yield (val[0], val[1]*vecval)!



Note that you should use a
   Input
secondary sort to avoid      Map 1!                  Reduce 1!
reading both in memory	

    Align on columns       Output Aik xk
                             
                       keyed on row i




                                                                                         41
                                          David Gleich · Purdue
       bit.ly/10SIe1A
Matrix-vector product                          Ax = y
                                                 X
(in pictures)
                              yi =    Aik xk
                                                          k




                       x




                                             x
          x


 A
               def sumred(self, key, vals):!


                      A
                   A
                            y
                  yield (key, sum(vals))!




 Input
             Map 1!               Reduce 1!                Reduce 2!
                    Align on columns    Output Aik xk           Output 
                    
                    keyed on row i
          sum(Aik xk)




                                                                                 42
                                 David Gleich · Purdue
           
                                                              bit.ly/10SIe1A
AB = C
Matrix-matrix product
                         Cij =
                                                     X
                                                       Aik Bkj
Follow along!                                          k
matrix-hadoop/codes/matmat.py!




               B
  A




                                                                                43
                                 David Gleich · Purdue
       bit.ly/10SIe1A
AB = C
Matrix-matrix product
                             Cij =
                                                         X
                                                           Aik Bkj
Follow along!                                              k
matrix-hadoop/codes/matmat.py!

                     A is stored by row

               B
                     

                     $    head samples/smat_10_5_A.txt !
                     0    0 0.599 4 -1.53!
                     1!

  A
                 2
                     3!
                     4
                          2 0.260!

                          0 0.267 1 0.839 

                     B is stored by row
                     

                     $    head samples/smat_5_5.txt !
                     0    0 0.125 3 1.024 4 0.121!
                     1    0 0.597!
                     2    2 1.247!




                                                                                    44
                     
                                     David Gleich · Purdue
       bit.ly/10SIe1A
Matrix-matrix product 
                                               AB = C
(in pictures)
                             Cij =
                                                 X
                                                   Aik Bkj




               B
                                 B
                                                      k




        B
 A
              A
 A
 C
               Map 1!            Reduce 1!            Reduce 2!
               Align on columns Output Aik Bkj      Output 
               
                 keyed on (i,j)
      sum(Aik Bkj)




                                                                        45
                             David Gleich · Purdue
   
                                                      bit.ly/10SIe1A
Matrix-matrix product 
                                              AB = C
(in code)
                                Cij =
                                                X
                                                  Aik Bkj




               B
                                                     k



                       def joinmap(self, key, line):!


        B
                         mtype = self.parsemat()!
                         vals = line.split()!
                         row = vals[0]!
                         rowvals =  !


 A
              A
                          [(vals[i],float(vals[i+1])) !
                           for i in xrange(1,len(vals),2)]!
                         if mtype==1:!
                           # matrix A, output by col!
                           for val in rowvals:!
                             yield (val[0], (row, val[1]))!
                         else:!
                           yield (row, (rowvals,))!
               Map 1!
               Align on columns
               




                                                                           46
                            David Gleich · Purdue
       bit.ly/10SIe1A
Matrix-matrix product 
                                                           AB = C
(in pictures)
                                         Cij =
                                                             X
                                                               Aik Bkj




                          B
                                             B
                                                                  k



def joinred(self, key, line):!



               B
  # load the data into memory        !
  brow = []!
  acol = []!
  for val in vals:!



   A
    if len(val) == 1:!


                            A
 A
      brow.extend(val[0])!
    else:!
      acol.append(val)!
        !
  for (bcol,bval) in brow:!
    for (arow,aval) in acol:!
      yield ((arow,bcol),aval*bval)!
                          Map 1!            Reduce 1!
                          Align on columns Output Aik Bkj
                          
                 keyed on (i,j)




                                                                                        47
                                         David Gleich · Purdue
       bit.ly/10SIe1A
Matrix-matrix product 
                                                  AB = C
(in pictures)
                                Cij =
                                                    X
                                                      Aik Bkj




                  B
                                    B
                                                         k




        B
 A
                 A
 A
 C
           def sumred(self, key, vals):!
              yield (key, sum(vals))!




                  Map 1!            Reduce 1!            Reduce 2!
                  Align on columns Output Aik Bkj      Output 
                  
                 keyed on (i,j)
      sum(Aik Bkj)




                                                                           48
                                David Gleich · Purdue
   
                                                         bit.ly/10SIe1A
Our social recommender
Follow along! 
matrix-hadoop/recsys/recsys.py!


                     R is stored entry-wise

 S
         RT
      !
                     $ gunzip –c data/rating.txt.gz!
                     139431556 591156          5!
                     139431556 1312460676      5!
                     139431556 204358          4
                      Object ID! 368725
                     139431556    User ID!   Rating!
                                               5!

                     S is stored entry-wise
                     !
                     $ gunzip –c data/rating.txt.gz!
                     3287060356    232085    -1!
                     3288305540    709420    1!
                     3290337156    204418    -1!
                      My ID!       Other ID! Trust!
                     3294138244    269243    -1!




                                                                             49
                                  David Gleich · Purdue
   bit.ly/10SIe1A
Social recommender 
(in code)




                              B
Conceptually, the first step
is the same as the matrix-
matrix product. 

                                      def joinmap(self, key, line):!


                 B
We reorganize the data by               parts = line.split('t')!
user-id to be able to map               if len(parts) == 8: # ratings!
the trust relationships
                  objid = parts[0].strip()!
                                          uid = parts[1].strip()!


   A
                           A
                                          rat = int(parts[2])!
                                          yield (uid, (objid, rat))!
                                        else len(parts) == 4: # trust!
                                          myid = parts[0].strip()!
                                          otherid = parts[1].strip()!
                                          value = int(parts[2])!
                                          if value  0:!
                                            yield (otherid, (myid,))!
                              Map 1!
                              Align on columns
                              




                                                                                      50
                                           David Gleich · Purdue
   bit.ly/10SIe1A
Matrix-matrix product 
    (in pictures)




                            B
                                              B
                                                                       Conceptually,
def joinred(self, key, vals):!                                     the second step


                 B
  tusers = [] # uids that trust key!                                 is the same as
  ratobjs = [] # objs rated by uid=key!                                   the matrix-
  for val in vals:!
                                                                     matrix product
    if len(val) == 1:!
                                                                      too, we “map”


     A
      tusers.append(val[0])!


                              A
 A
    else:!                                                          the ratings from
      ratobjs.append(val)!                                              each trusted
!                                                                  user back to the
    for (objid, rat) in ratobjs:!                                            source.
      for uid in tusers:!
        yield ((uid, objid), rat)!


                            Map 1!            Reduce 1!
                            Align on columns Output Aik Bkj
                            
                 keyed on (i,j)




                                                                                        51
                                          David Gleich · Purdue
    bit.ly/10SIe1A
Matrix-matrix product 
                                                      AB = C
(in pictures)
                                    Cij =
                                                        X
                                                          Aik Bkj




                      B
                                        B
                                                             k




           B
     def avgred(self, key, vals):!



 A
        s = 0.!


                        A
 A
 C
        n = 0!
        for val in vals:!
          s += val!
          n += 1!
        # the smoothed average of ratings!
        yield key, !
          (s+self.options.avg)/float(n+1) !
      !
                      Map 1!            Reduce 1!            Reduce 2!
                      Align on columns Output Aik Bkj      Output 
                      
                 keyed on (i,j)
      sum(Aik Bkj)




                                                                               52
                                    David Gleich · Purdue
   
                                                             bit.ly/10SIe1A
Better ways to store 
matrices in Hadoop 
               Block matrices minimize the
                                   number of intermediate keys
                                   and values used. I’d form them
No need for “integer” keys that    based on the first reduce 
fall between 1 and n!




                B
                             B
  A
                              A




                                                                             53
                                  David Gleich · Purdue
   bit.ly/10SIe1A
Tall-and-skinny matrices are
common in BigData

A : m x n, m ≫ n
                                             A1

Key is an arbitrary row-id
                                                              A2
Value is the 1 x n array 
for a row
                                                              A3

Each submatrix Ai is an 
                                                              A4 
the input to a map task.




                                                                         54
                              David Gleich · Purdue
   bit.ly/10SIe1A
Double-precision floating point
was designed for the era
where “big” was 1000-10000




                                                           55
                David Gleich · Purdue
   bit.ly/10SIe1A
Error analysis of summation

s = 0; for i=1 to n: s = s + x[i]

fl(x + y ) = (x + y )(1 + )

 X          X            X
                                                         16
fl(

     xi )       xi  nµ    |xi |         µ ⇡ 10
    i         i             i

A simple summation formula has 
error that is not always small if n is a billion




                                                                           56
                                David Gleich · Purdue
   bit.ly/10SIe1A
If your application matters
then watch out for this issue.

Use quad-precision arithmetic
or compensated summation
instead.




                                                             57
                  David Gleich · Purdue
   bit.ly/10SIe1A
Compensated Summation
“Kahan summation algorithm” on Wikipedia

s = 0.; c = 0.;
        Mathematically, c is always zero.
                        
for i=1 to n: 
         On a computer, c can be non-zero
                        
    y = x[i] – c 
      The parentheses matter!
                        
    t = s + y
          
              X
                                                          2
                                                            X
                        fl(csum(x))
                        
                   xi  (µ + nµ )    |xi |
    c = (t – s) – y 
                       i                   i
                                       16
                             µ ⇡ 10
    s = t




                                                                         58
                              David Gleich · Purdue
   bit.ly/10SIe1A
Collaborators, Friends, and
People who have taught me

MRTSQR!                     Sandia MapReduce!
Paul Constantine (Stanford)
 Todd Plantenga
Austin Benson (Stanford)
   Tammy Kolda
James Demmel (Berkeley)
 Justin Basilico (now Netflix)
Simform!                    Others!
Jeremy Templeton (Sandia)
 Margot Gerritsen (Stanford)
Joe Ruthruff (Sandia)
Yangyang Hou (Purdue) 
         Grants
                                Sandia CSAR
Joe Nichols (Stanford)




                                                                           59
                                David Gleich · Purdue
   bit.ly/10SIe1A
Questions?




                                                       60
Image from rockysprings, deviantart, CC share-alike

More Related Content

Similar to Matrix methods for Hadoop

Possibilities of generative models
Possibilities of generative modelsPossibilities of generative models
Possibilities of generative modelsAlison B. Lowndes
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Massimo Gaetano Panunzio
 
How Will Knowledge Graphs Improve Clinical Reporting Workflows
How Will Knowledge Graphs Improve Clinical Reporting WorkflowsHow Will Knowledge Graphs Improve Clinical Reporting Workflows
How Will Knowledge Graphs Improve Clinical Reporting WorkflowsNeo4j
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd
 
Accelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsAccelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsRenee Yao
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World
 
Skew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationSkew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationDavid Gleich
 
Big Analytics Without Big Hassles
Big Analytics Without Big HasslesBig Analytics Without Big Hassles
Big Analytics Without Big HasslesParadigm4
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
 
Reproducibility and Dataverse
Reproducibility and DataverseReproducibility and Dataverse
Reproducibility and Dataversephilipdurbin
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsJason Riedy
 
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev PROIDEA
 
Spatial data supply chains in Australia and New Zealand
Spatial data supply chains in Australia and New ZealandSpatial data supply chains in Australia and New Zealand
Spatial data supply chains in Australia and New ZealandMaurits van der Vlugt
 
AI in the Financial Services Industry
AI in the Financial Services IndustryAI in the Financial Services Industry
AI in the Financial Services IndustryAlison B. Lowndes
 
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...Matt Stubbs
 
Physical Design for Non-Relational Data Systems
Physical Design for Non-Relational Data SystemsPhysical Design for Non-Relational Data Systems
Physical Design for Non-Relational Data SystemsMichael Mior
 
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Neo4j
 

Similar to Matrix methods for Hadoop (19)

Possibilities of generative models
Possibilities of generative modelsPossibilities of generative models
Possibilities of generative models
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
 
How Will Knowledge Graphs Improve Clinical Reporting Workflows
How Will Knowledge Graphs Improve Clinical Reporting WorkflowsHow Will Knowledge Graphs Improve Clinical Reporting Workflows
How Will Knowledge Graphs Improve Clinical Reporting Workflows
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 
Accelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsAccelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANs
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Skew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregationSkew-symmetric matrix completion for rank aggregation
Skew-symmetric matrix completion for rank aggregation
 
Big Analytics Without Big Hassles
Big Analytics Without Big HasslesBig Analytics Without Big Hassles
Big Analytics Without Big Hassles
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Reproducibility and Dataverse
Reproducibility and DataverseReproducibility and Dataverse
Reproducibility and Dataverse
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
 
Forecast 2014: Open Your Datacenter
Forecast 2014: Open Your DatacenterForecast 2014: Open Your Datacenter
Forecast 2014: Open Your Datacenter
 
Spatial data supply chains in Australia and New Zealand
Spatial data supply chains in Australia and New ZealandSpatial data supply chains in Australia and New Zealand
Spatial data supply chains in Australia and New Zealand
 
AI in the Financial Services Industry
AI in the Financial Services IndustryAI in the Financial Services Industry
AI in the Financial Services Industry
 
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
 
Physical Design for Non-Relational Data Systems
Physical Design for Non-Relational Data SystemsPhysical Design for Non-Relational Data Systems
Physical Design for Non-Relational Data Systems
 
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
 

More from David Gleich

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksDavid Gleich
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresDavid Gleich
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansDavid Gleich
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsDavid Gleich
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph miningDavid Gleich
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresDavid Gleich
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structuresDavid Gleich
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksDavid Gleich
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detectionDavid Gleich
 

More from David Gleich (20)

Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-means
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Localized methods in graph mining
Localized methods in graph miningLocalized methods in graph mining
Localized methods in graph mining
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detection
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Matrix methods for Hadoop

  • 1. Matrix Methods with Hadoop Slides bit.ly/10SIe1A Code github.com/dgleich/matrix-hadoop-tutorial DAVID F. GLEICH ASSISTANT PROFESSOR " COMPUTER SCIENCE " PURDUE UNIVERSITY 1 David Gleich · Purdue bit.ly/10SIe1A
  • 2. 2 David Gleich · Purdue bit.ly/10SIe1A
  • 3. A bit of philosophy … Image from rockysprings, deviantart, CC share-alike 3
  • 4. 4 David Gleich · Purdue bit.ly/10SIe1A
  • 5. Matrix computations 2 3 A1,1 A1,2 ··· A1,n 6 . 7 . 7 6 A2,1 A2,2 ··· . 7 A=6 . 6 7 4 . .. .. . . . Am 1,n 5 Am,1 ··· Am,n 1 Am,n Ax Ax = b min kAx bk Ax = x Operations Linear " Least squares Eigenvalues systems 5 David Gleich · Purdue bit.ly/10SIe1A
  • 6. Outcomes Recognize relationships between matrix methods and things you’ve already been doing" Example SQL queries as matrix computations Understand how to use Hadoop to compute these matrix methods at scale for BigData" Example Recommenders with social network info Understand some of the issues that could arise. 6 David Gleich · Purdue bit.ly/10SIe1A
  • 7. Ideal outcomes How to use techniques from " matrix computations in order " to solve your problems quickly! 1986 7 David Gleich · Purdue bit.ly/10SIe1A
  • 8. Taking the red pill … Image from rockysprings, deviantart, CC share-alike 8
  • 9. Matrix computations Physics Databases Statistics Machine learning Engineering Information retrieval Graphics Computer vision Bioinformatics Social networks bit.ly/10SIe1A 9 David Gleich · Purdue
  • 10. matrix computations " ≠" linear algebra 10 David Gleich · Purdue bit.ly/10SIe1A
  • 11. A SQL statement as a " matrix computation http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql How do I find the average rating for each product? 11 David Gleich · Purdue bit.ly/10SIe1A
  • 12. A SQL statement as a " matrix computation SELECT! p.product_id,! p.name,! AVG(pr.rating) AS rating_average! http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql FROM products p! How do I find the INNER JOIN product_ratings pr! average rating for ON pr.product_id = p.product_id! GROUP BY p.product_id! each product? ORDER BY rating_average DESC! 12 David Gleich · Purdue bit.ly/10SIe1A
  • 13. This SQL statement is a " matrix computation! 13 Image from rockysprings, deviantart, CC share-alike
  • 14. SELECT! ...! AVG(pr.rating)! ...! GROUP BY p.product_id! product_ratings pid8 uid2 4 pid1 pid9 uid9 1 pid2 pid2 uid9 5 pid3 pid9 uid5 5 pid4 pid6 uid8 4 pid5 pid1 uid2 4 pid6 pid3 uid4 4 Is a matrix! pid7 pid5 uid9 2 pid8 pid9 uid8 4 pid9 pid9 uid9 1 14 David Gleich · Purdue bit.ly/10SIe1A
  • 15. But it’s a weird matrix" product_ratings pid8 uid2 4 pid1 pid9 uid9 1 pid2 pid2 uid9 5 pid3 pid9 uid5 5 pid4 pid6 uid8 4 pid5 pid1 uid2 4 pid6 pid3 uid4 4 Is a matrix! pid7 pid5 uid9 2 pid8 pid9 uid8 4 pid9 pid9 uid9 1 Missing entries! 15 David Gleich · Purdue bit.ly/10SIe1A
  • 16. But it’s a weird matrix" Average" of ratings product_ratings pid8 uid2 4 pid1 4 pid9 uid9 1 pid2 pid2 uid9 5 pid3 4 pid9 uid5 5 pid4 pid6 uid8 4 pid5 pid1 uid2 4 pid6 4 SELECT pid3 uid4 4 Is a matrix! pid7 AVG(r) pid5 uid9 2 pid8 4 ... pid9 uid8 4 pid9 5 4 GROUP BY pid Matrix Vector 16 David Gleich · Purdue bit.ly/10SIe1A
  • 17. But it’s a weird matrix" and not a linear operator 2 3 I product_ratings s a matrix A1,1 A1,2 ··· A1,n ! 6 . 7 pid8 uid2 4 6 A2,1 A2,2 ··· . 7 . 7 pid9 uid9 1 A=6 . 6 7 4 . .. .. pid2 uid9 5 . . . Am 1,n 5 pid9 uid5 5 pid6 uid8 4 Am,1 ··· Am,n 1 Am,n pid1 uid2 4 2 P P 3 j A1,j / Pj “A1,j 6= 0” pid3 uid4 4 pid5 uid9 2 P j A2,j / j “A2,j 6= 0” pid9 uid8 4 6 7 6 7 pid9 uid9 1 avg(A) = 6 . 7 4 . . 5 P P j Am,j / j “Am,j 6= 0” 17 David Gleich · Purdue bit.ly/10SIe1A
  • 18. matrix computations " ≠" linear algebra 18 David Gleich · Purdue bit.ly/10SIe1A
  • 19. … but there is a linear operator hiding … 2 P P 3 A1,1 / j “A1,j 6= 0” A1,2 / j “A1,j 6= 0” ··· P P 6A2,1 / · · ·7 P=4 j “A2,j 6= 0” A2,2 / j “A2,j 6= 0” 5 . . .. . . avg(A) = Pe e is the vector of all ones 19 David Gleich · Purdue bit.ly/10SIe1A
  • 20. Hadoop, MapReduce, and Matrix Methods 20 David Gleich · Purdue bit.ly/10SIe1A
  • 21. MapReduce 21 David Gleich · Purdue bit.ly/10SIe1A
  • 22. The MapReduce Framework Originated at Google for indexing web Data scalable pages and computing PageRank. Maps M M 1 2 1 M 2 M Reduce M M R 3 4 M Express algorithms in " 3 R 4 M M “data-local operations”. 5 M Shuffle 5 Implement one type of Fault-tolerance by design communication: shuffle. Input stored in triplicate Reduce input/" M Shuffle moves all data with M output on disk R the same key to the same M R M reducer. Map output" persisted to disk" 22 before shuffle David Gleich · Purdue bit.ly/10SIe1A
  • 23. wordcount " is a matrix computation too map(document) : for word in document D D emit (word, 1) 1 2 matrix,1 bigdata,1 hadoop,1 D D matrix,1 bigdata,1 hadoop,1 3 4 matrix,1 bigdata,1 hadoop,1 matrix,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 D bigdata,1 hadoop,1 5 bigdata,1 hadoop,1 bigdata,1 reduce(word, counts) : emit (word, sum(counts)) 23 David Gleich · Purdue bit.ly/10SIe1A
  • 24. wordcount " is a matrix computation too 2 3 doc1 A1,1 A1,2 ··· A1,n 6 . 7 . 7 6 doc2 A2,1 A2,2 ··· . 7 A=6 . 6 7 = A 4 . .. .. . . . Am 1,n 5 docm Am,1 ··· Am,n 1 Am,n word count = colsum(A) = AT e e is the vector of all ones 24 David Gleich · Purdue bit.ly/10SIe1A
  • 25. inverted index" is a matrix computation too 2 3 doc1 A1,1 A1,2 ··· A1,n 6 . 7 . 7 6 doc2 A2,1 A2,2 ··· . 7 A=6 . 6 7 = A 4 . .. .. . . . Am 1,n 5 docm Am,1 ··· Am,n 1 Am,n 25 David Gleich · Purdue bit.ly/10SIe1A
  • 26. inverted index" is a matrix computation too 2 3 term1 A1,1 A2,1 ··· Am,1 6 . 7 . 7 6A1,2 A2,2 ··· . 7 6 term2 6 . 7 = AT 4 . .. .. . . . Am,n 1 5 termm A1,n ··· Am 1,n Am,n 26 David Gleich · Purdue bit.ly/10SIe1A
  • 27. A recommender system " with social info product_ratings friends_links pid8 uid2 4 uid6 uid1 pid9 uid9 1 uid8 uid9 pid2 uid9 5 uid7 uid7 pid9 uid5 5 uid7 uid4 pid6 uid8 4 uid6 uid2 pid1 uid2 4 uid7 uid1 pid3 uid4 4 uid3 uid1 pid5 uid9 2 uid1 uid8 pid9 uid8 4 uid7 uid3 pid9 uid9 1 uid9 uid1 27 David Gleich · Purdue bit.ly/10SIe1A
  • 28. A recommender system " with social info product_ratings friends_links pid8 uid2 4 uid6 uid1 pid9 uid9 1 uid8 uid9 pid2 uid9 5 uid7 uid7 2 pid9 uid5 5 3 uid7 uid4 2 3 pid1 A pid6 uid8 4 1,1 pid1 uid2 4 A2,1 ··· uid6 uid2 uid1 uid7 uid1 A1,1 A2,1 ··· 6A pid3 uid4 4 pid2 1,2 A2,2 · · ·7 uid3 uid1 6A1,2 A2,2 · · ·7 4 pid5 uid9 2 5 uid2 uid1 uid8 4 5 . pid9 uid8 4 . .. .. uid7 uid3 . . .. .. . pid9 uid9 1 . . uid9 uid1 . . . 28 David Gleich · Purdue bit.ly/10SIe1A
  • 29. A recommender system " with social info product_ratings friends_links pid8 uid2 4 uid6 uid1 pid9 uid9 1 uid8 uid9 pid2 uid9 5 uid7 uid7 pid9 uid5 5 uid7 uid4 pid6 uid8 4 uid6 uid2 R S pid1 uid2 4 uid7 uid1 pid3 uid4 4 uid3 uid1 pid5 uid9 2 uid1 uid8 pid9 uid8 4 uid7 uid3 pid9 uid9 1 uid9 uid1 29 David Gleich · Purdue bit.ly/10SIe1A
  • 30. A recommender system " with social info Recommend each item based with something that is" on the average rating of all “X = S RT” almost a matrix-matrix" trusted users product 2 3 2 3 A1,1 A2,1 ··· A1,1 A2,1 ··· R S pid1 uid1 6 pid2 A1,2 A2,2 · · ·7 6 uid2 A1,2 A2,2 · · ·7 4 5 4 5 . . .. .. . . .. .. . . . . . . ! ! 1 X X Xuid,pid = Suid,uid2 Ruid2,pid · “Suid,uid2 and Ruid2,pid 6= 0” uid2 uid2 30 David Gleich · Purdue bit.ly/10SIe1A
  • 31. Tools I like hadoop streaming dumbo mrjob hadoopy C++ 31 David Gleich · Purdue bit.ly/10SIe1A
  • 32. Tools I don’t use but other people seem to like … pig java hbase Mahout is the closest thing to a library for matrix computations in Hadoop. If you like Java, you should probably mahout start there. Eclipse I’m a low-level guy Cassandra 32 David Gleich · Purdue bit.ly/10SIe1A
  • 33. hadoop streaming the map function is a program" (key,value) pairs are sent via stdin" output (key,value) pairs goes to stdout the reduce function is a program" (key,value) pairs are sent via stdin" keys are grouped" output (key,value) pairs goes to stdout 33 David Gleich · Purdue bit.ly/10SIe1A
  • 34. mrjob from a wrapper around hadoop streaming for map and reduce functions in python class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() 34 David Gleich · Purdue bit.ly/10SIe1A
  • 35. How can Hadoop streaming Synthetic data test 100,000,000-by-500 matrix (~500GB) Codes implemented in MapReduce streaming possibly be fast? Matrix stored as TypedBytes lists of doubles Python frameworks use Numpy+Atlas Custom C++ TypedBytes reader/writer with Atlas 500 GB matrix. Computing the R in a QR factorization. " See my non-streaming Java implementation too New next talk! Iter 1 Iter 1 Iter 2 Overall QR (secs.) Total (secs.) Total (secs.) Total (secs.) Dumbo 67725 960 217 1177 Hadoopy 70909 612 118 730 C++ 15809 350 37 387 Java 436 66 502 C++ in streaming beats a native Java implementation. All timing results from the Hadoop job tracker mrjob could be faster if it used David Gleich (Sandia) MapReduce 2011 16/22 Example available from typedbytes for intermediate storage see github.com/dgleich/mrtsqr" https://github.com/Yelp/mrjob/pull/447 for verification 35 David Gleich · Purdue bit.ly/10SIe1A
  • 36. Ax = y X Matrix-vector product yi = Aik xk Follow along! k matrix-hadoop/codes/smatvec.py! x A 36 David Gleich · Purdue bit.ly/10SIe1A
  • 37. Where do matrix-vector products arise? Google’s PageRank Computing cosine-similarity between one document and all other documents Predictions from kernel methods Computing averages (the example above) 37 David Gleich · Purdue bit.ly/10SIe1A
  • 38. Ax = y X Matrix-vector product yi = Aik xk Follow along! k matrix-hadoop/codes/smatvec.py! A is stored by row x $ head samples/smat_5_5.txt ! 0 0 0.125 3 1.024 4 0.121! A 1 0 0.597! 2 2 1.247! 3 4 -1.45! x is stored entry-wise ! 4 2 0.061! $ head samples/vec_5.txt! 0 0.241! 1 -0.98! 2 0.237! 3 -0.32! 4 0.080! 38 David Gleich · Purdue bit.ly/10SIe1A
  • 39. Matrix-vector product Ax = y X (in pictures) yi = Aik xk k x x x A A A y Input Map 1! Reduce 1! Reduce 2! Align on columns Output Aik xk Output keyed on row i sum(Aik xk) 39 David Gleich · Purdue bit.ly/10SIe1A
  • 40. Matrix-vector product Ax = y X (in pictures) yi = Aik xk k x def joinmap(self, key, line):! vals = line.split()! x if len(vals) == 2:! # the vector! yield (vals[0], # row! A A (float(vals[1]),)) # xi! else:! # the matrix! row = vals[0]! for i in xrange(1,len(vals),2):! yield (vals[i], # column! (row, # i,Aij! float(vals[i+1])))! Input Map 1! Align on columns 40 David Gleich · Purdue bit.ly/10SIe1A
  • 41. Matrix-vector product Ax = y X (in pictures) yi = Aik xk k x x def joinred(self, key, vals):! vecval = 0. ! x matvals = []! for val in vals:! if len(val) == 1:! A A A vecval += val[0]! else:! matvals.append(val) ! for val in matvals:! yield (val[0], val[1]*vecval)! Note that you should use a Input secondary sort to avoid Map 1! Reduce 1! reading both in memory Align on columns Output Aik xk keyed on row i 41 David Gleich · Purdue bit.ly/10SIe1A
  • 42. Matrix-vector product Ax = y X (in pictures) yi = Aik xk k x x x A def sumred(self, key, vals):! A A y yield (key, sum(vals))! Input Map 1! Reduce 1! Reduce 2! Align on columns Output Aik xk Output keyed on row i sum(Aik xk) 42 David Gleich · Purdue bit.ly/10SIe1A
  • 43. AB = C Matrix-matrix product Cij = X Aik Bkj Follow along! k matrix-hadoop/codes/matmat.py! B A 43 David Gleich · Purdue bit.ly/10SIe1A
  • 44. AB = C Matrix-matrix product Cij = X Aik Bkj Follow along! k matrix-hadoop/codes/matmat.py! A is stored by row B $ head samples/smat_10_5_A.txt ! 0 0 0.599 4 -1.53! 1! A 2 3! 4 2 0.260! 0 0.267 1 0.839 B is stored by row $ head samples/smat_5_5.txt ! 0 0 0.125 3 1.024 4 0.121! 1 0 0.597! 2 2 1.247! 44 David Gleich · Purdue bit.ly/10SIe1A
  • 45. Matrix-matrix product AB = C (in pictures) Cij = X Aik Bkj B B k B A A A C Map 1! Reduce 1! Reduce 2! Align on columns Output Aik Bkj Output keyed on (i,j) sum(Aik Bkj) 45 David Gleich · Purdue bit.ly/10SIe1A
  • 46. Matrix-matrix product AB = C (in code) Cij = X Aik Bkj B k def joinmap(self, key, line):! B mtype = self.parsemat()! vals = line.split()! row = vals[0]! rowvals = ! A A [(vals[i],float(vals[i+1])) ! for i in xrange(1,len(vals),2)]! if mtype==1:! # matrix A, output by col! for val in rowvals:! yield (val[0], (row, val[1]))! else:! yield (row, (rowvals,))! Map 1! Align on columns 46 David Gleich · Purdue bit.ly/10SIe1A
  • 47. Matrix-matrix product AB = C (in pictures) Cij = X Aik Bkj B B k def joinred(self, key, line):! B # load the data into memory ! brow = []! acol = []! for val in vals:! A if len(val) == 1:! A A brow.extend(val[0])! else:! acol.append(val)! ! for (bcol,bval) in brow:! for (arow,aval) in acol:! yield ((arow,bcol),aval*bval)! Map 1! Reduce 1! Align on columns Output Aik Bkj keyed on (i,j) 47 David Gleich · Purdue bit.ly/10SIe1A
  • 48. Matrix-matrix product AB = C (in pictures) Cij = X Aik Bkj B B k B A A A C def sumred(self, key, vals):! yield (key, sum(vals))! Map 1! Reduce 1! Reduce 2! Align on columns Output Aik Bkj Output keyed on (i,j) sum(Aik Bkj) 48 David Gleich · Purdue bit.ly/10SIe1A
  • 49. Our social recommender Follow along! matrix-hadoop/recsys/recsys.py! R is stored entry-wise S RT ! $ gunzip –c data/rating.txt.gz! 139431556 591156 5! 139431556 1312460676 5! 139431556 204358 4 Object ID! 368725 139431556 User ID! Rating! 5! S is stored entry-wise ! $ gunzip –c data/rating.txt.gz! 3287060356 232085 -1! 3288305540 709420 1! 3290337156 204418 -1! My ID! Other ID! Trust! 3294138244 269243 -1! 49 David Gleich · Purdue bit.ly/10SIe1A
  • 50. Social recommender (in code) B Conceptually, the first step is the same as the matrix- matrix product. def joinmap(self, key, line):! B We reorganize the data by parts = line.split('t')! user-id to be able to map if len(parts) == 8: # ratings! the trust relationships objid = parts[0].strip()! uid = parts[1].strip()! A A rat = int(parts[2])! yield (uid, (objid, rat))! else len(parts) == 4: # trust! myid = parts[0].strip()! otherid = parts[1].strip()! value = int(parts[2])! if value 0:! yield (otherid, (myid,))! Map 1! Align on columns 50 David Gleich · Purdue bit.ly/10SIe1A
  • 51. Matrix-matrix product (in pictures) B B Conceptually, def joinred(self, key, vals):! the second step B tusers = [] # uids that trust key! is the same as ratobjs = [] # objs rated by uid=key! the matrix- for val in vals:! matrix product if len(val) == 1:! too, we “map” A tusers.append(val[0])! A A else:! the ratings from ratobjs.append(val)! each trusted ! user back to the for (objid, rat) in ratobjs:! source. for uid in tusers:! yield ((uid, objid), rat)! Map 1! Reduce 1! Align on columns Output Aik Bkj keyed on (i,j) 51 David Gleich · Purdue bit.ly/10SIe1A
  • 52. Matrix-matrix product AB = C (in pictures) Cij = X Aik Bkj B B k B def avgred(self, key, vals):! A s = 0.! A A C n = 0! for val in vals:! s += val! n += 1! # the smoothed average of ratings! yield key, ! (s+self.options.avg)/float(n+1) ! ! Map 1! Reduce 1! Reduce 2! Align on columns Output Aik Bkj Output keyed on (i,j) sum(Aik Bkj) 52 David Gleich · Purdue bit.ly/10SIe1A
  • 53. Better ways to store matrices in Hadoop Block matrices minimize the number of intermediate keys and values used. I’d form them No need for “integer” keys that based on the first reduce fall between 1 and n! B B A A 53 David Gleich · Purdue bit.ly/10SIe1A
  • 54. Tall-and-skinny matrices are common in BigData A : m x n, m ≫ n A1 Key is an arbitrary row-id A2 Value is the 1 x n array for a row A3 Each submatrix Ai is an A4 the input to a map task. 54 David Gleich · Purdue bit.ly/10SIe1A
  • 55. Double-precision floating point was designed for the era where “big” was 1000-10000 55 David Gleich · Purdue bit.ly/10SIe1A
  • 56. Error analysis of summation s = 0; for i=1 to n: s = s + x[i] fl(x + y ) = (x + y )(1 + ) X X X 16 fl( xi ) xi  nµ |xi | µ ⇡ 10 i i i A simple summation formula has error that is not always small if n is a billion 56 David Gleich · Purdue bit.ly/10SIe1A
  • 57. If your application matters then watch out for this issue. Use quad-precision arithmetic or compensated summation instead. 57 David Gleich · Purdue bit.ly/10SIe1A
  • 58. Compensated Summation “Kahan summation algorithm” on Wikipedia s = 0.; c = 0.; Mathematically, c is always zero. for i=1 to n: On a computer, c can be non-zero y = x[i] – c The parentheses matter! t = s + y X 2 X fl(csum(x)) xi  (µ + nµ ) |xi | c = (t – s) – y i i 16 µ ⇡ 10 s = t 58 David Gleich · Purdue bit.ly/10SIe1A
  • 59. Collaborators, Friends, and People who have taught me MRTSQR! Sandia MapReduce! Paul Constantine (Stanford) Todd Plantenga Austin Benson (Stanford) Tammy Kolda James Demmel (Berkeley) Justin Basilico (now Netflix) Simform! Others! Jeremy Templeton (Sandia) Margot Gerritsen (Stanford) Joe Ruthruff (Sandia) Yangyang Hou (Purdue) Grants Sandia CSAR Joe Nichols (Stanford) 59 David Gleich · Purdue bit.ly/10SIe1A
  • 60. Questions? 60 Image from rockysprings, deviantart, CC share-alike