0
MapReduce

Distributed computing on large commodity
clusters

Dr. Spiros Denaxas, Epidemiology & Public
Health, UCL, 18 Fe...
Hello
 Introduction
 Who am I
 Structure of presentation
   Distributed computing
   MapReduce examples
   Amazon Web Serv...
Data and some more data
 Google processes > 20PB daily
 Facebook processes 15TB daily, more
 than 4TB new data per day
 Ar...
Data driven applications
 Fraud detection
 Web indexing
 Risk management
 Service personalization
 Spam detection
 Documen...
Fruits of some sort
 Consider a very simple example
 fruit_diary.log: a text file with fruit names
   “cat fruit_diary.txt...
Big Data problem
 1. Iterate over a large number of records
 2. Extract something of interest
 3. Shuffle and sort the int...
Big Data problem
 Majority of “big data” players rely heavily on data
 analysis, be it commercial or scientific.
 Ad hoc i...
Scalability of single entities
 The single disk/memory model does not scale
 on a single processing entity.
 Now what?
   ...
What is wrong with this?
    Worker 1:
    void work() {
      var2++;
      var1 = var2 + 5;
}
    Worker 2:
    void wor...
Parallel computing pitfalls
 “Parallel computing is a black art”
 Very hard to program, expensive,
 complicated ; how does...
Now what?
 Data needs to be processed on a massive scale
 in a distributed fashion as it does not fit a single
 node.
 Sol...
Distributed systems
 Fault tolerant
 Highly available
 Recoverable
 Consistent
 Scalable
Data storage
 Google FS (GFS) and Hadoop Distributed File
 System (HDFS)
 Data must be available by all processing nodes
 ...
MapReduce
 A framework for processing data using a cluster
 of computer nodes
 Created by Google in C++
 Two steps: map an...
k1 v1   k2 v2   k3 v3    k4 v4   k5 v5    k6 v6




 map                 map                   map                map


a ...
MapReduce map()
 map(in_key,in_value) => (out_key,
 intermediate_value) list
 Data (lines from files, database rows etc) a...
MapReduce map()
MapReduce reduce()
 A reducer is given a key and all values for this
 specific key.
 Once the map phase is over, intermedi...
MapReduce reduce()
Big Data problem
 1. Iterate over a large number of records
 2. map()
 3. Shuffle and sort the intermediate results
 4. re...
Term Frequency (TF) calculation
 The TF of a given term is the number of times it
 appears within a document collection.
 ...
Term Frequency (TF) calculation

 Stopword elimination
 sea reach Thames stretched before like
 beginning interminable wat...
Term Frequency (TF) calculation

 A generic map() function
 Input: a single line
 Ouput: <word,frequency> pairs
 map(line)...
Term Frequency (TF) calculation
 sea reach Thames stretched before like
 beginning
 Output would look like:
   sea,1
   re...
Term Frequency (TF) calculation
 A generic reducer() function
 Sums up the values which are the occurrence of
 each word.
...
Term Frequency (TF) calculation
 Output from reduce stage would look like:
 sky,2
 reach,1
 Thames,1
 stretched,1
 before,...
MapReduce indexing
 Map over all documents
   Emit term as key, (docno, tf) as value
   Emit other information as necessar...
Inverted index (Boolean)
Doc 1                     Doc 2                 Doc 3            Doc 4
 one fish, two fish       ...
Inverted index (ranked)
Doc 1                     Doc 2                  Doc 3            Doc 4
 one fish, two fish       ...
Doc 1                      Doc 2                        Doc 3
      one fish, two fish         red fish, blue fish        ...
Inverted Index (positional)
Doc 1                 Doc 2                     Doc 3            Doc 4
 one fish, two fish    ...
Doc 1                            Doc 2                                Doc 3
      one fish, two fish                red fi...
PageRank
 Named after Larry Page at Google
 Essentially a link analysis algorithm
 Measures the relative importance of a w...
PageRank
 How can we define how important page X
 is?
 One solution: quantify the incoming links
 from other pages to that...
PageRank
 Imagine your typical web surfer browsing
 page X
 Only two things can happen:
   A) Random link from X is clicke...
PageRank defined
Given page x with in-bound links t1…tn, where
    C(t) is the out-degree of t
    α is probability of ran...
PageRank defined


          t1

                             X

               t2

                    …
                ...
Computing PageRank
 Properties of PageRank
   Can be computed iteratively
   Effects of each iteration are local
 Sketch o...
Map: distribute PageRank “credit” to link targets




Reduce: gather up PageRank “credit” from multiple sources
to compute...
Graph algorithms in MapReduce
 General approach:
   Store graphs as adjacency lists
   Each map task receives a node and i...
Amazon Web Services (AWS)
 “A collection of remote computing services
 offered over the Internet by Amazon”
 Accessed over...
Amazon Simple Storage Service
(S3)
 “An online persistent data storage service
 offered by Amazon Web Services”
 Charged o...
Amazon Simple Storage Service
(S3)
 http://<bucket>.s3.amazonaws.com/<key>
 Like HDFS, data is replicated across
 nodes, e...
Amazon Elastic Computer Cloud
(EC2)
 Scalable deployment of virtual servers for
 large scale data processing.
 Billed by h...
Amazon Elastic Computer Cloud
(EC2)
 Amazon Machine Images (AMI)
   Sun, Oracle, IBM
   Windows, Linux
 Several sizes for ...
Amazon Elastic MapReduce
 Hadoop-ready virtual servers on EC2
 HDFS-esque input from S3
 Amazon Web Services Management
  ...
Live Demo
 Word counting
  Project Gutenberg
    1500 books from all languages
    Approx half a million lines of text
   ...
Thank you
Upcoming SlideShare
Loading in...5
×

MapReduce: distributed computing on large commodity clusters

2,108

Published on

MapReduce: distributed computing on large commodity clusters

Given at the University College London MediaFeatures group http://mediafutures.cs.ucl.ac.uk/reading_group/

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,108
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
70
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "MapReduce: distributed computing on large commodity clusters"

  1. 1. MapReduce Distributed computing on large commodity clusters Dr. Spiros Denaxas, Epidemiology & Public Health, UCL, 18 Feb 2010
  2. 2. Hello Introduction Who am I Structure of presentation Distributed computing MapReduce examples Amazon Web Services Live demo
  3. 3. Data and some more data Google processes > 20PB daily Facebook processes 15TB daily, more than 4TB new data per day Archive.org has 2PB of content Baidu, 3TB/week CERN LHC will generate 20GB/sec
  4. 4. Data driven applications Fraud detection Web indexing Risk management Service personalization Spam detection Document clustering
  5. 5. Fruits of some sort Consider a very simple example fruit_diary.log: a text file with fruit names “cat fruit_diary.txt | sort | uniq –c” Matches all lines from your eating diary Sorts all lines in memory uniq –c counts the unique occurrences What if fruit diary was 500GB? What if it was 500TB? 500PB ?
  6. 6. Big Data problem 1. Iterate over a large number of records 2. Extract something of interest 3. Shuffle and sort the intermediate results 4. Aggregate the intermediate results 5. Generate final output
  7. 7. Big Data problem Majority of “big data” players rely heavily on data analysis, be it commercial or scientific. Ad hoc investigation, trends, patterns, reporting. Timely manner Questions must be answered in hours, not weeks.
  8. 8. Scalability of single entities The single disk/memory model does not scale on a single processing entity. Now what? Lets add many disk/memory/processing entities! Parallel vs. Distributed computing Parallel: Multiple CPU’s in a single computer Distributed: Multiple CPU’s across multiple computers over the network
  9. 9. What is wrong with this? Worker 1: void work() { var2++; var1 = var2 + 5; } Worker 2: void work() { var1++; var2 = var1; }
  10. 10. Parallel computing pitfalls “Parallel computing is a black art” Very hard to program, expensive, complicated ; how does it scale? How do we know: when a worker has finished? when a worker has failed? How to synchronize?
  11. 11. Now what? Data needs to be processed on a massive scale in a distributed fashion as it does not fit a single node. Solution must be scalable Solution must be cheap Low cost hardware with redundancy Don’t worry about concurrency, focus on more serious problems.
  12. 12. Distributed systems Fault tolerant Highly available Recoverable Consistent Scalable
  13. 13. Data storage Google FS (GFS) and Hadoop Distributed File System (HDFS) Data must be available by all processing nodes Don’t move data to workers, move workers to the data Store data on local disks of nodes Start workers on node that has data locally Minimize meta-data by using large blocks
  14. 14. MapReduce A framework for processing data using a cluster of computer nodes Created by Google in C++ Two steps: map and reduce Automatic parallelization, distribution, failover, synchronization and and and … Clean abstraction layer for programmers Processes are isolated
  15. 15. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3
  16. 16. MapReduce map() map(in_key,in_value) => (out_key, intermediate_value) list Data (lines from files, database rows etc) are read, recorded and emitted as key / value pairs For example ”coconut,1” Map() produces one or more intermediate values along with an output key from the input data. Map() runs in parallel and independently.
  17. 17. MapReduce map()
  18. 18. MapReduce reduce() A reducer is given a key and all values for this specific key. Once the map phase is over, intermediate values of a given output key are collapsed into a list. Reduce() combines intermediate values into one or more final values for the same output key Bottleneck: reduce() stage cannot start until map() phase is done.
  19. 19. MapReduce reduce()
  20. 20. Big Data problem 1. Iterate over a large number of records 2. map() 3. Shuffle and sort the intermediate results 4. reduce() 5. Generate final output
  21. 21. Term Frequency (TF) calculation The TF of a given term is the number of times it appears within a document collection. ”The sea-reach of the Thames stretched before us like the beginning of an interminable waterway. In the offing the sea and the sky were welded together without a joint, and in the luminous space the tanned sails of the barges drifting up with the tide seemed to stand still in red clusters of canvas sharply peaked, with gleams of varnished sprits.”
  22. 22. Term Frequency (TF) calculation Stopword elimination sea reach Thames stretched before like beginning interminable waterway offing sea sky welded together joint luminous space tanned sails barges drifting tide seemed stand still red clusters canvas sharply peaked gleams varnished sprits
  23. 23. Term Frequency (TF) calculation A generic map() function Input: a single line Ouput: <word,frequency> pairs map(line) { @words = split / / line foreach word ( @words ) { print word, 1 } }
  24. 24. Term Frequency (TF) calculation sea reach Thames stretched before like beginning Output would look like: sea,1 reach,1 Thames,1 stretched,1 before,1 Like,1 beginning,1
  25. 25. Term Frequency (TF) calculation A generic reducer() function Sums up the values which are the occurrence of each word. Input: series of <word,frequency> pairs Output: series of <word,sum> pairs Reducer ( word, frequency ) { datastructure[ word ] ++; } foreach word ( datastructure ) { print word freq } }
  26. 26. Term Frequency (TF) calculation Output from reduce stage would look like: sky,2 reach,1 Thames,1 stretched,1 before,1 like,1 beginning,1 [...]
  27. 27. MapReduce indexing Map over all documents Emit term as key, (docno, tf) as value Emit other information as necessary (e.g., term position) Sort/shuffle: group by term Reduce Gather and sort (e.g., by docno or tf) Write to disk MapReduce does all the heavy lifting!
  28. 28. Inverted index (Boolean) Doc 1 Doc 2 Doc 3 Doc 4 one fish, two fish red fish, blue fish cat in the hat green eggs and ham 1 2 3 4 blue 1 blue 2 cat 1 cat 3 egg 1 egg 4 fish 1 1 fish 1 2 green 1 green 4 ham 1 ham 4 hat 1 hat 3 one 1 one 1 red 1 red 2 two 1 two 1
  29. 29. Inverted index (ranked) Doc 1 Doc 2 Doc 3 Doc 4 one fish, two fish red fish, blue fish cat in the hat green eggs and ham tf 1 2 3 4 df blue 1 1 blue 1 2,1 cat 1 1 cat 1 3,1 egg 1 1 egg 1 4,1 fish 2 2 2 fish 2 1,2 2,2 green 1 1 green 1 4,1 ham 1 1 ham 1 4,1 hat 1 1 hat 1 3,1 one 1 1 one 1 1,1 red 1 1 red 1 2,1 two 1 1 two 1 1,1
  30. 30. Doc 1 Doc 2 Doc 3 one fish, two fish red fish, blue fish cat in the hat one 1 1 red 2 1 cat 3 1 Map two 1 1 blue 2 1 hat 3 1 fish 1 2 fish 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 2 2 2 hat 3 1 one 1 1 two 1 1 red 2 1
  31. 31. Inverted Index (positional) Doc 1 Doc 2 Doc 3 Doc 4 one fish, two fish red fish, blue fish cat in the hat green eggs and ham blue 1 2,1 blue 1 2,1,[3] cat 1 3,1 cat 1 3,1,[1] egg 1 4,1 egg 1 4,1,[2] fish 2 1,2 2,2 fish 2 1,2,[2,4] 1,2,[2,4] green 1 4,1 green 1 4,1,[1] ham 1 4,1 ham 1 4,1,[3] hat 1 3,1 hat 1 3,1,[2] one 1 1,1 one 1 1,1,[1] red 1 2,1 red 1 2,1,[1] two 1 1,1 two 1 1,1,[3]
  32. 32. Doc 1 Doc 2 Doc 3 one fish, two fish red fish, blue fish cat in the hat one 1 1 [1] red 2 1 [1] cat 3 1 [1] Map two 1 1 [3] blue 2 1 [3] hat 3 1 [2] [2,4 [2,4 fish 1 2 ] fish 2 2 ] Shuffle and Sort: aggregate values by keys cat 3 1 [1] blue 2 1 [3] Reduce fish 1 2 [2,4 ] 2 2 [2,4 ] hat 3 1 [2] one 1 1 [1] two 1 1 [3] red 2 1 [1]
  33. 33. PageRank Named after Larry Page at Google Essentially a link analysis algorithm Measures the relative importance of a web page Algorithmically assesses and quantifies that “importance”
  34. 34. PageRank How can we define how important page X is? One solution: quantify the incoming links from other pages to that page Surely, more incoming links would mean a more authoritative status?
  35. 35. PageRank Imagine your typical web surfer browsing page X Only two things can happen: A) Random link from X is clicked (probability a) B) User teleports away (probability 1 –a )
  36. 36. PageRank defined Given page x with in-bound links t1…tn, where C(t) is the out-degree of t α is probability of random jump N is the total number of nodes in the graph 1 n PR(ti ) PR ( x) = α   + (1 − α )∑ N i =1 C (t i )
  37. 37. PageRank defined t1 X t2 … tn
  38. 38. Computing PageRank Properties of PageRank Can be computed iteratively Effects of each iteration are local Sketch of algorithm: Start with seed PRi values Each page distributes PRi “credit” to all pages it links to Each target page adds up “credit” from multiple in- bound links to compute PRi+1 Iterate until values converge (100 times?)
  39. 39. Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value Iterate until convergence
  40. 40. Graph algorithms in MapReduce General approach: Store graphs as adjacency lists Each map task receives a node and its adjacency list Map task compute some function of the link structure, emits value with target as the key Reduce task collects keys (target nodes) and aggregates Perform multiple MapReduce iterations until some termination condition
  41. 41. Amazon Web Services (AWS) “A collection of remote computing services offered over the Internet by Amazon” Accessed over HTTP using Representational State Transfer (REST) or Simple Object Access Protocol (SOAP) Cheap Scalable API implementations
  42. 42. Amazon Simple Storage Service (S3) “An online persistent data storage service offered by Amazon Web Services” Charged on data stored and transferred Block-based filesystem Data is organized in buckets Buckets are accessed using HTTP REST
  43. 43. Amazon Simple Storage Service (S3) http://<bucket>.s3.amazonaws.com/<key> Like HDFS, data is replicated across nodes, enabling the storage of very large files Several big players use S3 like Twitter and HP
  44. 44. Amazon Elastic Computer Cloud (EC2) Scalable deployment of virtual servers for large scale data processing. Billed by hour of processing and magnitude of resources needed. No persistent storage (that’s what S3 is for!) Automatic scalability
  45. 45. Amazon Elastic Computer Cloud (EC2) Amazon Machine Images (AMI) Sun, Oracle, IBM Windows, Linux Several sizes for all tastes From 1.5 to 65GB RAM From 1 core to 2 quad cores
  46. 46. Amazon Elastic MapReduce Hadoop-ready virtual servers on EC2 HDFS-esque input from S3 Amazon Web Services Management Hadoop in < 5 minutes!
  47. 47. Live Demo Word counting Project Gutenberg 1500 books from all languages Approx half a million lines of text 1500 files stored on S3 8 EC2 Instances deployed 14 minutes from start to finish Less than 2 USD
  48. 48. Thank you
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×