MapReduce withHadoop at MyLifeJune 6, 2013Speaker: Jeff Meister
Topics of Talk• What are MapReduce and Hadoop?• When would you want to use them?• How do they work?• What does Hadoop do f...
What are MapReduceand Hadoop?• MapReduce is a programming model forparallel processing of large datasets• An idea for how ...
Motivation:Why would you useMapReduce?
Background:Disk vs. Memory• Memory• Where the computerkeeps data it’scurrently working on• Fast response time,random acces...
Example Task onSmall DatasetsID Public recordR1 Steve Jones, 36, 12 Main St, 10001R2 John Brown, 72, 625 8th Ave, 90210R3 ...
Real World:Large Datasets• 290 million public records = 380 GB• 228 million phone records = 252 GB• We could improve previ...
Hadoop:What does it do?How do you work with it?
Components of theHadoop System• Hadoop Distributed File System(HDFS)• Splits up files into blocks, storesthem on multiple c...
The Map andReduce Functions• map : (K1, V1) List(K2, V2)• Take an input record and produce (emit) a list ofintermediate (k...
The “Magic”:A Fast Parallel Sort• The core of Hadoop MapReduce is adistributed parallel sorting algorithm• Hadoop guarante...
Why Is It Fast?• Imagine how you might sort a deck of cards• The most intuitive procedure for humans isvery inefficient for...
Example Taskwith MapReduce• map : (source_id, record) List(match_key, source_id)• For each input record, select the fields ...
Example Task onSmall DatasetsID Public recordR1 Steve Jones, 36, 12 Main St, 10001R2 John Brown, 72, 625 8th Ave, 90210R3 ...
When is MapReduceAppropriate?• To benefit from using Hadoop:• The data must be decomposable into many(key, value) pairs• Ea...
Common Applicationsof MapReduce• Many common distributed tasks are easilyexpressible with MapReduce.A few examples:• Term ...
MapReduce at MyLife
Applications ofMapReduce at MyLife• We regularly run computations over large sets ofpeople data• Who’s Searching ForYou• C...
Hadoop ClusterSpecifications• Currently 63 machines, each configured to run 4 or 6 map orreduce tasks at once (total capacit...
Other CompaniesUsing Hadoop• Yahoo! - Index calculations for Web search• Facebook - Analytics and machine learning• World’...
Further Reading• Google research papers• Google File System, SOSP 2003• MapReduce, OSDI 2004• BigTable, OSDI 2006• Hadoop ...
Upcoming SlideShare
Loading in …5
×

Map reduce and hadoop at mylife

617 views

Published on

A brief talk introducing and explaining MapReduce and Hadoop along with describing part of how we use Hadoop MapReduce at MyLife.com.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
617
On SlideShare
0
From Embeds
0
Number of Embeds
121
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Map reduce and hadoop at mylife

  1. 1. MapReduce withHadoop at MyLifeJune 6, 2013Speaker: Jeff Meister
  2. 2. Topics of Talk• What are MapReduce and Hadoop?• When would you want to use them?• How do they work?• What does Hadoop do for you?• How do you write MapReduce programsto take advantage of that?• What do we use them for at MyLife?
  3. 3. What are MapReduceand Hadoop?• MapReduce is a programming model forparallel processing of large datasets• An idea for how to write programs undercertain constraints• Hadoop is an open-source implementationof MapReduce• Designed for clusters of commoditymachines
  4. 4. Motivation:Why would you useMapReduce?
  5. 5. Background:Disk vs. Memory• Memory• Where the computerkeeps data it’scurrently working on• Fast response time,random accesssupported• Expensive: typical sizein tens of GB• Hard disk• More permanentstorage of data forfuture tasks• Slow response time,sequential access only• Cheap: typical size inhundreds orthousands of GB
  6. 6. Example Task onSmall DatasetsID Public recordR1 Steve Jones, 36, 12 Main St, 10001R2 John Brown, 72, 625 8th Ave, 90210R3 James Davis, 23, 10 Broadway, 20202R4 Tom Lewis, 45, 95 Park Pl, 90024R5 Tim Harris, 33, PO Box 256, 33514... ...R2000 Adam Parker, 59, 82 F St, 45454Size: 8 MB Size: 3.5 MBID Phone numberP1 Robert White, 45121, (654) 321-4702P2 David Johnson, 07470, (973) 602-2519P3 Scott Lee, 23910, (602) 412-2255P4 Steve Jones, 10001, (212) 347-3380P5 John Wayne, 13284, (312) 446-8878... ...P1000 Tom Lewis, 90024, (650) 945-2319
  7. 7. Real World:Large Datasets• 290 million public records = 380 GB• 228 million phone records = 252 GB• We could improve previous algorithm, but...• The machine doesn’t have enough memory• Would spend lots of time moving pieces of databetween disk and memory• Disk is so slow, the task is now impractical• What to do? Use Hadoop MapReduce!• Divide into smaller tasks, run them in parallel
  8. 8. Hadoop:What does it do?How do you work with it?
  9. 9. Components of theHadoop System• Hadoop Distributed File System(HDFS)• Splits up files into blocks, storesthem on multiple computers• Knows which blocks are oneach machine• Transfers blocks betweenmachines over the network• Replicates blocks, designed totolerate frequent machinefailures• MapReduce engine• Supports distributedcomputation• Programmer writes Map andReduce functions• Engine takes care ofparallelization, so you can focuson your work
  10. 10. The Map andReduce Functions• map : (K1, V1) List(K2, V2)• Take an input record and produce (emit) a list ofintermediate (key, value) pairs• reduce : (K2, List(V2)) List(K3, V3)• Examine the values for each intermediate key,produce a list of output records• Critical observation: output type of map ≠ input typeof reduce!• What’s going on in between?
  11. 11. The “Magic”:A Fast Parallel Sort• The core of Hadoop MapReduce is adistributed parallel sorting algorithm• Hadoop guarantees that the input to eachreducer is sorted by key (K2)• All the (K2, V2) pairs from the mappersare grouped by key• The reducer gets a list of valuescorresponding to each key
  12. 12. Why Is It Fast?• Imagine how you might sort a deck of cards• The most intuitive procedure for humans isvery inefficient for computers• Turns out the best algorithm, merge sort, isless straightforward• Split the data up into smaller pieces, sortthe pieces individually, then merge them• Hadoop is using HDFS to do a giant parallelmerge sort over its cluster
  13. 13. Example Taskwith MapReduce• map : (source_id, record) List(match_key, source_id)• For each input record, select the fields to match by, make akey out of them• Use the record’s unique identifier as the value• reduce : (match_key, List(source_id))List(public_record_id, phone_id)• For each match key, look through the list of unique IDs• If we find both a public record ID and a phone ID in thesame list, match!• The profiles with these IDs share all fields in the key• Generate the output pair of matched IDs
  14. 14. Example Task onSmall DatasetsID Public recordR1 Steve Jones, 36, 12 Main St, 10001R2 John Brown, 72, 625 8th Ave, 90210R3 James Davis, 23, 10 Broadway, 20202R4 Tom Lewis, 45, 95 Park Pl, 90024R5 Tim Harris, 33, PO Box 256, 33514... ...R2000 Adam Parker, 59, 82 F St, 45454Size: 8 MB Size: 3.5 MBID Phone numberP1 Robert White, 45121, (654) 321-4702P2 David Johnson, 07470, (973) 602-2519P3 Scott Lee, 23910, (602) 412-2255P4 Steve Jones, 10001, (212) 347-3380P5 John Wayne, 13284, (312) 446-8878... ...P1000 Tom Lewis, 90024, (650) 945-2319
  15. 15. When is MapReduceAppropriate?• To benefit from using Hadoop:• The data must be decomposable into many(key, value) pairs• Each mapper runs the same operation,independently of other mappers• Map output keys should sort values into groupsof similar size• Sequential algorithms that are more straightforwardmay need redesign for the MapReduce model
  16. 16. Common Applicationsof MapReduce• Many common distributed tasks are easilyexpressible with MapReduce.A few examples:• Term frequency counting• Pattern searching• Of course, sorting• Graph algorithms, such as reversal (Web links)• Inverted index generation• Data mining (clustering, statistics)
  17. 17. MapReduce at MyLife
  18. 18. Applications ofMapReduce at MyLife• We regularly run computations over large sets ofpeople data• Who’s Searching ForYou• Content-based aggregation pipeline (1.5 TB)• Deltas of licensed data updates (300 GB)• Generating search indexes for old platform• Various ad hoc jobs involving matching, searching,extraction, counting, de-duplication, and more
  19. 19. Hadoop ClusterSpecifications• Currently 63 machines, each configured to run 4 or 6 map orreduce tasks at once (total capacity 296)• CPU:• Each machine: 2x quad-core Opteron @ 2.2 GHz• Memory:• Each machine: 32 GB• Cluster total: 2 TB• Hard disk:• Each machine: between 3 and 9 TB• Total HDFS capacity: 345 TB
  20. 20. Other CompaniesUsing Hadoop• Yahoo! - Index calculations for Web search• Facebook - Analytics and machine learning• World’s largest Hadoop cluster!• Amazon - Supports Hadoop on EC2/S3 cloud services• LinkedIn• PeopleYou May Know• Viewers of This Profile AlsoViewed• Apple - Used in iAds platform• Twitter - Data warehousing and analytics• Lots more... http://wiki.apache.org/hadoop/PoweredBy
  21. 21. Further Reading• Google research papers• Google File System, SOSP 2003• MapReduce, OSDI 2004• BigTable, OSDI 2006• Hadoop manual: http://hadoop.apache.org/• Other Hadoop-related projects fromApache: Cassandra, HBase, Hive, Pig

×