Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tuple map reduce: beyond classic mapreduce

4,094 views

Published on

Tuple MapReduce, a new
foundational model extending MapReduce with the notion of
tuples. Tuple MapReduce allows to bridge the gap between the
low-level constructs provided by MapReduce and higher-level
needs required by programmers, such as compound records,
sorting or joins. This paper presents as well Pangool, an open-
source framework implementing Tuple MapReduce. Pangool
eases the design and implementation of applications based
on MapReduce and increases their flexibility, still maintaining
Hadoop’s performance.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Tuple map reduce: beyond classic mapreduce

  1. 1. Tuple MapReduce: Beyond classic MapReducePedro Ferrera, Ivan de Prado, Eric Palacios Jose Luis Fernandez­Marquez DataSalt Giovanna Di Marzo Serugendo Barcelona, SPAIN University of Geneva, CUI pere,ivan,epalacios@datasalt.com Geneva, SWITZERLAND joseluis.fernandez@unige.ch
  2. 2. Outline● Introduction● Related Work● Classic MapReduce – The problems of MapReduce● Tuple MapReduce – The basic Tuple MapReduce – Joins – Generalization of MapReduce● Pangool● Conclusions and Future work 2 / 18
  3. 3. Introduction● A huge amount of information → needs for new processing technologies.● MapReduce → major contribution ... – … but involves a sharp learning curve.● Most of design patterns found in real world problems are not well covered.● We propose Tuple MapReduce as a better foundation model.● TupleMapReduce on Hadoop → Pangool – No key architectural changes needed. 3 / 18
  4. 4. Related work● MapReduce: Google paper on 2004● Hadoop● Higher level tools – Sawzall, FlumeJava, Pig, Hive, Jaql, Cascading● Higher level abstractions very popular – Supports the idea of MapReduce as a too low-level paradigm● Merge MapReduce – Targets the problem of relational operations (joins) – Implies changes in the architecture and a new step merge 4 / 18
  5. 5. Classic MapReduce● Jobs – input file, ouput file – Developer provides two functions: map and reduce● Distributed execution of work – Firstly the map function in the mapper phase – Then the reduce function in the reducing phase 5 / 18
  6. 6. The problems of MapReduce● Compound records – Real world problems include multi-field records. They don’t fit well on the key/value schema● Sorting – No inherent sorting within the reduce records. – “secondary sorting trick” on implementations (Hadoop)● Join – A quite common operation – Not directly possible in MapReduce without using “tricks”: ● secondary sorting ● compound records 6 / 18
  7. 7. Tuple MapReduce● Idea: replace key/value by tuples● group-by and sort-by clauses 7 / 18
  8. 8. Tuple MapReduce (II)● group-by and sort-by constraint – group-by as a prefix of sort-by – Needed if you want to be able to implement Tuple MapReduce over a MapReduce architecture● Contrary to MapReduce, Tuple MapReduce: – provides compound records → tuple – provides intra-reduce sorting 8 / 18
  9. 9. Example: cumulative visits● Cumulative # of visits up to each single date Input → URL, date, visits <<< Expected output → URL, date, cumulative visits 9 / 18
  10. 10. Join-Tuple MapReduce● Joins among heterogeneous datasets – Tuples associated with a source-id. ● Tuples reach the reducer sorted by source-id –enabling memoryless reduce joins – and grouped by some common fields 10 / 18
  11. 11. Example: join between clients and payments name client_id payment_id amountclients Inner join payments 11 / 18
  12. 12. Generalization of MapReduce● MapReduce is a TupleMapReduce with... – tuples of two values and – group-by and sort-by set to first value● The opposite is also possible → implementing Tuple MapReduce into existing MapReduce implementations. – Architectural changes are not needed. – Pangool is a proof of that. 12 / 18
  13. 13. Pangool pangool.net● Tuple MapReduce implementation on top of Hadoop. – On top of existing MapReduce implementation. ● It is just a library. No architecture change was needed.● Used on real world applications – Banking – Searching – Social networks 13 / 18
  14. 14. Pangool benchmark – secondary sort 14 / 18
  15. 15. Pangool benchmark – join 15 / 18
  16. 16. Pangool performance● Just between 5% and 8% worst than Hadoop – Pretty good considering that Pangool is built on top of Hadoop API ● The difference would probably disappear with a native implementation● Much better than higher level APIs – Probably because Pangool is a low level API 16 / 18
  17. 17. Conclusions and Future work● MapReduce key/value has been shown too strict.● Tuple MapReduce keep MapReduce features – Enhancing it with ● compound records, ● joins and ● intra-reduce sorting.● Pangool is a proof of its viability, – including in existing implementations like Hadoop without changing the architecture● Future work would involve abstractions for flow creations – Simplifying job chaining and data flow. 17 / 18
  18. 18. Thanks! Pedro Ferrera, Ivan de Prado, Eric Palacios Jose Luis Fernandez­Marquez DataSalt Giovanna Di Marzo Serugendo Barcelona, SPAIN University of Geneva, CUI pere,ivan,epalacios@datasalt.com Geneva, SWITZERLAND joseluis.fernandez@unige.ch ● Any questions, or doubts? – ivan@datasalt.com – @ivanprado 18 / 18

×