SlideShare a Scribd company logo
1 of 17
Download to read offline
Background Main Points Conclusion References
Flexible and Efficient
Data Flow for MapReduce
Adam Martini
University of Oregon
martini@cs.uoregon.edu
Thursday 5th June, 2014
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
What is MapReduce?
A framework popularized by Google circa
2004
Provides transparent scalability and fault
tolerance for “embarrassingly parallel”
tasks
Why do we need it?
Initially designed to solve PageRank
scalability issues
Big Data is everywhere!
Non-parallel programming experts need a
way to efficiently manipulate very large
datasets
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
Strengths
1) Massively scalable
2) Can operate on
commodity hardware
3) Transparency:
Data locality
Fault tolerance
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
Weaknessess
Flexibility
Data flow (and algorithm design) is forced to conform to an
oversimplified computational model
Efficiency (wrt Data Flow)
Programs handle data non-optimally ⇒ performance loss
Take Away
Improved flexibility can improve efficiency, but the inverse does
not hold
Focus on performance gains leads to a wide variety of
application specific MapReduce derivatives
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
Data Flow Improvements
Extensions to existing frameworks focus on more flexible work flow,
while new frameworks for specific applications focus on improved
data flow.
3 Main Directions:
Iterative MapReduce
Graph Dependent Models
Time Aware Models
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
Iterative MapReduce
Primarily inspired by Machine Learning applications that iteratively
update a set of dynamic variables using a set of static data
instances.
Why
Data IO is slow!
How
Hold invariant data in memory
Hold reductions in memory for convergence checking
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
Iterative MapReduce: Frameworks
Spark – A very popular framework that supports lazy
evaluation, efficient fault tolerance, and iterative machine
learning tasks... much faster than Hadoop!
HaLoop – The first framework to introduce convergence
checking data optimizations
Twister – Iterative MapReduce with an optional combine
step before reduction (also available in Hadoop)
Shredder/Incoop – A GPU accelerated iterative MapReduce
framework with an intelligent data chunking algorithms
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
Graph Dependent Models
Two Flavors – Frameworks built to support graph updates
(PageRank), and frameworks that build data flow DAG’s from
execution requests. Both flavors rely on graph structure for
optimization.
Why
Graphs contain information about dependencies!
How
Limit computation based on need
Recognize potential for optimization based on structure
(cycles, vertical and horizontal fusion)
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
Graph Dependent Models: Frameworks
Percolator – Developed by Google to replace MapReduce.
Updates only a portion of PageRank indices by propagating
the influence of new, or altered, web pages.
Pregel – A general framework for superstep-based graphical
updates. Allows for dynamic changes to graph structure.
Ceil – Uses runtime analysis of dependency graphs to
dynamically optimize execution. Runtime analysis allows for
detection of iterative/recursive patterns.
Flumejava – Preprocesses execution plan to generate a DAG,
which is used to perform fusion-based optimizations.
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
Time Aware Models
Time aware models use logical timestamps in conjunction with
graphical data flow models to achieve greater flexibility in program
design, while reducing computational latency.
Why
More flexible program design eliminates the need to combine
several application specific frameworks
How
More expressive language to define complex parallel patterns
A communication transparency layer to provide automated
event coordination
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
Graph Dependent Models: Frameworks
Naiad – Represents computation using a graph with stateful
nodes that are capable of sending data flow messages without
global coordination. Asynchronous events are coordinated by
providing a guarantee that messages will be delivered at a
given time, or that the system will be capable of delivering a
message by a given time. The results is a system that
outperforms popular frameworks in their target domain.
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
Take Away
MapReduce is a simple model that
provides useful transparency, but it
was not meant to do everything!
With increased flexibility comes
programmer responsibility...
The Ultimate Goal
A single framework that provides
transparent scalability/fault
tolerance, a friendly interface,
automatic optimization, and
expressivity.
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
References I
Jeffrey Dean and Sanjay Ghemawat,
“Mapreduce: simplified data processing on large clusters,”
Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and
Bongki Moon,
“Parallel data processing with mapreduce: a survey,”
AcM sIGMoD Record, vol. 40, no. 4, pp. 11–20, 2012.
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
References II
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin
Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica,
“Resilient distributed datasets: A fault-tolerant abstraction for in-memory
cluster computing,”
in Proceedings of the 9th USENIX conference on Networked Systems
Design and Implementation. USENIX Association, 2012, pp. 2–2.
Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert,
Ilan Horn, Naty Leiser, and Grzegorz Czajkowski,
“Pregel: a system for large-scale graph processing,”
in Proceedings of the 2010 ACM SIGMOD International Conference on
Management of data. ACM, 2010, pp. 135–146.
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
References III
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams,
Robert R Henry, Robert Bradshaw, and Nathan Weizenbaum,
“Flumejava: easy, efficient data-parallel pipelines,”
in ACM Sigplan Notices. ACM, 2010, vol. 45, pp. 363–375.
Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul
Barham, and Martin Abadi,
“Naiad: a timely dataflow system,”
in Proceedings of the Twenty-Fourth ACM Symposium on Operating
Systems Principles. ACM, 2013, pp. 439–455.
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
References IV
Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A Thekkath,
Yuan Yu, and Li Zhuang,
“Nectar: Automatic management of data and computation in
datacenters.,”
in OSDI, 2010, pp. 75–88.
Flexible and Efficient Data Flow for MapReduce UO
Background Main Points Conclusion References
The End
Flexible and Efficient Data Flow for MapReduce UO

More Related Content

What's hot

Plan4business technical solution
Plan4business technical solutionPlan4business technical solution
Plan4business technical solution
Karel Charvat
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Geoffrey Fox
 

What's hot (10)

Introducing Novel Graph Database Cloud Computing For Efficient Data Management
Introducing Novel Graph Database Cloud Computing For Efficient Data ManagementIntroducing Novel Graph Database Cloud Computing For Efficient Data Management
Introducing Novel Graph Database Cloud Computing For Efficient Data Management
 
A Survey on Geographically Distributed Big-Data Processing using Map Reduce
A Survey on Geographically Distributed Big-Data Processing using Map ReduceA Survey on Geographically Distributed Big-Data Processing using Map Reduce
A Survey on Geographically Distributed Big-Data Processing using Map Reduce
 
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
 
Organizing Information From Technical Reports
Organizing Information From Technical ReportsOrganizing Information From Technical Reports
Organizing Information From Technical Reports
 
GIS - Project Planning and Implementation
GIS - Project Planning and ImplementationGIS - Project Planning and Implementation
GIS - Project Planning and Implementation
 
IEEE Big data 2016 Title and Abstract
IEEE Big data  2016 Title and AbstractIEEE Big data  2016 Title and Abstract
IEEE Big data 2016 Title and Abstract
 
Maurer, Sakellariou, Brandic : Simulating Autonomic SLA Enactment in Clouds u...
Maurer, Sakellariou, Brandic : Simulating Autonomic SLA Enactment in Clouds u...Maurer, Sakellariou, Brandic : Simulating Autonomic SLA Enactment in Clouds u...
Maurer, Sakellariou, Brandic : Simulating Autonomic SLA Enactment in Clouds u...
 
Mrp Final
Mrp FinalMrp Final
Mrp Final
 
Plan4business technical solution
Plan4business technical solutionPlan4business technical solution
Plan4business technical solution
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 

Viewers also liked (6)

Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo Clusters
 
60 tips in 60 minutes: Social, Search & Conversion - Sam Crocker
60 tips in 60 minutes: Social, Search & Conversion - Sam Crocker60 tips in 60 minutes: Social, Search & Conversion - Sam Crocker
60 tips in 60 minutes: Social, Search & Conversion - Sam Crocker
 
Sexual Harassment On Latinos
Sexual Harassment On Latinos Sexual Harassment On Latinos
Sexual Harassment On Latinos
 
Law enforcement
Law enforcementLaw enforcement
Law enforcement
 
8 Billion Views & Counting - Embracing the Potential of Video in Performance ...
8 Billion Views & Counting - Embracing the Potential of Video in Performance ...8 Billion Views & Counting - Embracing the Potential of Video in Performance ...
8 Billion Views & Counting - Embracing the Potential of Video in Performance ...
 
Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data Lake
 

Similar to mapreduce_presentation

A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
NavNeet KuMar
 
Dataintensive
DataintensiveDataintensive
Dataintensive
sulfath
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
kalai75
 
2005-03-17 Air Quality Cluster TechTrack
2005-03-17 Air Quality Cluster TechTrack2005-03-17 Air Quality Cluster TechTrack
2005-03-17 Air Quality Cluster TechTrack
Rudolf Husar
 

Similar to mapreduce_presentation (20)

A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Opportunities and Challenges for Running Scientific Workflows on the Cloud
Opportunities and Challenges for Running Scientific Workflows on the Cloud Opportunities and Challenges for Running Scientific Workflows on the Cloud
Opportunities and Challenges for Running Scientific Workflows on the Cloud
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: Overview
 
Dataintensive
DataintensiveDataintensive
Dataintensive
 
The Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsThe Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing Systems
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
TUW-ASE-Summer 2014: Advanced Services Engineering- Introduction
TUW-ASE-Summer 2014: Advanced Services Engineering- IntroductionTUW-ASE-Summer 2014: Advanced Services Engineering- Introduction
TUW-ASE-Summer 2014: Advanced Services Engineering- Introduction
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Managing Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom DomainManaging Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom Domain
 
2005-03-17 Air Quality Cluster TechTrack
2005-03-17 Air Quality Cluster TechTrack2005-03-17 Air Quality Cluster TechTrack
2005-03-17 Air Quality Cluster TechTrack
 
Ws Stuff
Ws StuffWs Stuff
Ws Stuff
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
The Seven Main Challenges of an Early Warning System Architecture
The Seven Main Challenges of an Early Warning System ArchitectureThe Seven Main Challenges of an Early Warning System Architecture
The Seven Main Challenges of an Early Warning System Architecture
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Vol 10 No 1 - February 2014
Vol 10 No 1 - February 2014Vol 10 No 1 - February 2014
Vol 10 No 1 - February 2014
 
An Analytics Platform for Connected Vehicles
An Analytics Platform for Connected VehiclesAn Analytics Platform for Connected Vehicles
An Analytics Platform for Connected Vehicles
 
IEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and AbstractIEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and Abstract
 

mapreduce_presentation

  • 1. Background Main Points Conclusion References Flexible and Efficient Data Flow for MapReduce Adam Martini University of Oregon martini@cs.uoregon.edu Thursday 5th June, 2014 Flexible and Efficient Data Flow for MapReduce UO
  • 2. Background Main Points Conclusion References What is MapReduce? A framework popularized by Google circa 2004 Provides transparent scalability and fault tolerance for “embarrassingly parallel” tasks Why do we need it? Initially designed to solve PageRank scalability issues Big Data is everywhere! Non-parallel programming experts need a way to efficiently manipulate very large datasets Flexible and Efficient Data Flow for MapReduce UO
  • 3. Background Main Points Conclusion References Strengths 1) Massively scalable 2) Can operate on commodity hardware 3) Transparency: Data locality Fault tolerance Flexible and Efficient Data Flow for MapReduce UO
  • 4. Background Main Points Conclusion References Weaknessess Flexibility Data flow (and algorithm design) is forced to conform to an oversimplified computational model Efficiency (wrt Data Flow) Programs handle data non-optimally ⇒ performance loss Take Away Improved flexibility can improve efficiency, but the inverse does not hold Focus on performance gains leads to a wide variety of application specific MapReduce derivatives Flexible and Efficient Data Flow for MapReduce UO
  • 5. Background Main Points Conclusion References Data Flow Improvements Extensions to existing frameworks focus on more flexible work flow, while new frameworks for specific applications focus on improved data flow. 3 Main Directions: Iterative MapReduce Graph Dependent Models Time Aware Models Flexible and Efficient Data Flow for MapReduce UO
  • 6. Background Main Points Conclusion References Iterative MapReduce Primarily inspired by Machine Learning applications that iteratively update a set of dynamic variables using a set of static data instances. Why Data IO is slow! How Hold invariant data in memory Hold reductions in memory for convergence checking Flexible and Efficient Data Flow for MapReduce UO
  • 7. Background Main Points Conclusion References Iterative MapReduce: Frameworks Spark – A very popular framework that supports lazy evaluation, efficient fault tolerance, and iterative machine learning tasks... much faster than Hadoop! HaLoop – The first framework to introduce convergence checking data optimizations Twister – Iterative MapReduce with an optional combine step before reduction (also available in Hadoop) Shredder/Incoop – A GPU accelerated iterative MapReduce framework with an intelligent data chunking algorithms Flexible and Efficient Data Flow for MapReduce UO
  • 8. Background Main Points Conclusion References Graph Dependent Models Two Flavors – Frameworks built to support graph updates (PageRank), and frameworks that build data flow DAG’s from execution requests. Both flavors rely on graph structure for optimization. Why Graphs contain information about dependencies! How Limit computation based on need Recognize potential for optimization based on structure (cycles, vertical and horizontal fusion) Flexible and Efficient Data Flow for MapReduce UO
  • 9. Background Main Points Conclusion References Graph Dependent Models: Frameworks Percolator – Developed by Google to replace MapReduce. Updates only a portion of PageRank indices by propagating the influence of new, or altered, web pages. Pregel – A general framework for superstep-based graphical updates. Allows for dynamic changes to graph structure. Ceil – Uses runtime analysis of dependency graphs to dynamically optimize execution. Runtime analysis allows for detection of iterative/recursive patterns. Flumejava – Preprocesses execution plan to generate a DAG, which is used to perform fusion-based optimizations. Flexible and Efficient Data Flow for MapReduce UO
  • 10. Background Main Points Conclusion References Time Aware Models Time aware models use logical timestamps in conjunction with graphical data flow models to achieve greater flexibility in program design, while reducing computational latency. Why More flexible program design eliminates the need to combine several application specific frameworks How More expressive language to define complex parallel patterns A communication transparency layer to provide automated event coordination Flexible and Efficient Data Flow for MapReduce UO
  • 11. Background Main Points Conclusion References Graph Dependent Models: Frameworks Naiad – Represents computation using a graph with stateful nodes that are capable of sending data flow messages without global coordination. Asynchronous events are coordinated by providing a guarantee that messages will be delivered at a given time, or that the system will be capable of delivering a message by a given time. The results is a system that outperforms popular frameworks in their target domain. Flexible and Efficient Data Flow for MapReduce UO
  • 12. Background Main Points Conclusion References Take Away MapReduce is a simple model that provides useful transparency, but it was not meant to do everything! With increased flexibility comes programmer responsibility... The Ultimate Goal A single framework that provides transparent scalability/fault tolerance, a friendly interface, automatic optimization, and expressivity. Flexible and Efficient Data Flow for MapReduce UO
  • 13. Background Main Points Conclusion References References I Jeffrey Dean and Sanjay Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon, “Parallel data processing with mapreduce: a survey,” AcM sIGMoD Record, vol. 40, no. 4, pp. 11–20, 2012. Flexible and Efficient Data Flow for MapReduce UO
  • 14. Background Main Points Conclusion References References II Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012, pp. 2–2. Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 135–146. Flexible and Efficient Data Flow for MapReduce UO
  • 15. Background Main Points Conclusion References References III Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R Henry, Robert Bradshaw, and Nathan Weizenbaum, “Flumejava: easy, efficient data-parallel pipelines,” in ACM Sigplan Notices. ACM, 2010, vol. 45, pp. 363–375. Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi, “Naiad: a timely dataflow system,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013, pp. 439–455. Flexible and Efficient Data Flow for MapReduce UO
  • 16. Background Main Points Conclusion References References IV Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A Thekkath, Yuan Yu, and Li Zhuang, “Nectar: Automatic management of data and computation in datacenters.,” in OSDI, 2010, pp. 75–88. Flexible and Efficient Data Flow for MapReduce UO
  • 17. Background Main Points Conclusion References The End Flexible and Efficient Data Flow for MapReduce UO