Tuple map reduce: beyond classic mapreduce

3,876 views

Published on

Tuple MapReduce, a new
foundational model extending MapReduce with the notion of
tuples. Tuple MapReduce allows to bridge the gap between the
low-level constructs provided by MapReduce and higher-level
needs required by programmers, such as compound records,
sorting or joins. This paper presents as well Pangool, an open-
source framework implementing Tuple MapReduce. Pangool
eases the design and implementation of applications based
on MapReduce and increases their flexibility, still maintaining
Hadoop’s performance.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,876
On SlideShare
0
From Embeds
0
Number of Embeds
1,488
Actions
Shares
0
Downloads
37
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Tuple map reduce: beyond classic mapreduce

  1. 1. Tuple MapReduce: Beyond classic MapReducePedro Ferrera, Ivan de Prado, Eric Palacios Jose Luis Fernandez­Marquez DataSalt Giovanna Di Marzo Serugendo Barcelona, SPAIN University of Geneva, CUI pere,ivan,epalacios@datasalt.com Geneva, SWITZERLAND joseluis.fernandez@unige.ch
  2. 2. Outline● Introduction● Related Work● Classic MapReduce – The problems of MapReduce● Tuple MapReduce – The basic Tuple MapReduce – Joins – Generalization of MapReduce● Pangool● Conclusions and Future work 2 / 18
  3. 3. Introduction● A huge amount of information → needs for new processing technologies.● MapReduce → major contribution ... – … but involves a sharp learning curve.● Most of design patterns found in real world problems are not well covered.● We propose Tuple MapReduce as a better foundation model.● TupleMapReduce on Hadoop → Pangool – No key architectural changes needed. 3 / 18
  4. 4. Related work● MapReduce: Google paper on 2004● Hadoop● Higher level tools – Sawzall, FlumeJava, Pig, Hive, Jaql, Cascading● Higher level abstractions very popular – Supports the idea of MapReduce as a too low-level paradigm● Merge MapReduce – Targets the problem of relational operations (joins) – Implies changes in the architecture and a new step merge 4 / 18
  5. 5. Classic MapReduce● Jobs – input file, ouput file – Developer provides two functions: map and reduce● Distributed execution of work – Firstly the map function in the mapper phase – Then the reduce function in the reducing phase 5 / 18
  6. 6. The problems of MapReduce● Compound records – Real world problems include multi-field records. They don’t fit well on the key/value schema● Sorting – No inherent sorting within the reduce records. – “secondary sorting trick” on implementations (Hadoop)● Join – A quite common operation – Not directly possible in MapReduce without using “tricks”: ● secondary sorting ● compound records 6 / 18
  7. 7. Tuple MapReduce● Idea: replace key/value by tuples● group-by and sort-by clauses 7 / 18
  8. 8. Tuple MapReduce (II)● group-by and sort-by constraint – group-by as a prefix of sort-by – Needed if you want to be able to implement Tuple MapReduce over a MapReduce architecture● Contrary to MapReduce, Tuple MapReduce: – provides compound records → tuple – provides intra-reduce sorting 8 / 18
  9. 9. Example: cumulative visits● Cumulative # of visits up to each single date Input → URL, date, visits <<< Expected output → URL, date, cumulative visits 9 / 18
  10. 10. Join-Tuple MapReduce● Joins among heterogeneous datasets – Tuples associated with a source-id. ● Tuples reach the reducer sorted by source-id –enabling memoryless reduce joins – and grouped by some common fields 10 / 18
  11. 11. Example: join between clients and payments name client_id payment_id amountclients Inner join payments 11 / 18
  12. 12. Generalization of MapReduce● MapReduce is a TupleMapReduce with... – tuples of two values and – group-by and sort-by set to first value● The opposite is also possible → implementing Tuple MapReduce into existing MapReduce implementations. – Architectural changes are not needed. – Pangool is a proof of that. 12 / 18
  13. 13. Pangool pangool.net● Tuple MapReduce implementation on top of Hadoop. – On top of existing MapReduce implementation. ● It is just a library. No architecture change was needed.● Used on real world applications – Banking – Searching – Social networks 13 / 18
  14. 14. Pangool benchmark – secondary sort 14 / 18
  15. 15. Pangool benchmark – join 15 / 18
  16. 16. Pangool performance● Just between 5% and 8% worst than Hadoop – Pretty good considering that Pangool is built on top of Hadoop API ● The difference would probably disappear with a native implementation● Much better than higher level APIs – Probably because Pangool is a low level API 16 / 18
  17. 17. Conclusions and Future work● MapReduce key/value has been shown too strict.● Tuple MapReduce keep MapReduce features – Enhancing it with ● compound records, ● joins and ● intra-reduce sorting.● Pangool is a proof of its viability, – including in existing implementations like Hadoop without changing the architecture● Future work would involve abstractions for flow creations – Simplifying job chaining and data flow. 17 / 18
  18. 18. Thanks! Pedro Ferrera, Ivan de Prado, Eric Palacios Jose Luis Fernandez­Marquez DataSalt Giovanna Di Marzo Serugendo Barcelona, SPAIN University of Geneva, CUI pere,ivan,epalacios@datasalt.com Geneva, SWITZERLAND joseluis.fernandez@unige.ch ● Any questions, or doubts? – ivan@datasalt.com – @ivanprado 18 / 18

×