Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
An Introduction of Recent Research on MapReduce
Yu Liu
The Graduate University for Advanced Studies
July 8th, 2011
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Outline
1 Papers in MAPREDUCE11
2 Talks in HADOOP WOLD 2010
3 Other Interesting Papers
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
MAPREDUCE11
Sessions
1 Environments and Extensions to the MapReduce Programing
Model
2 MapReduce Applications
3 Performance and Feature Improvements of MapReduce
4 Keynote by Greg Malewicz, Google Research.: Beyond
MapReduce
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Paper List
1 Otus: Resource Attribution and Metrics Correlation in Data
Intensive Clusters (1)
2 Phoenix++: Modular MapReduce for Shared-Memory Systems (1)
3 Static Type Checking of Hadoop MapReduce Programs (1)
4 Tall and Skinny QR factorizations in MapReduce architectures (2)
5 Rapid Parallel Genome Indexing with MapReduce (2)
6 Full-Text Indexing for Optimizing Selection Operations in
Large-Scale Data Analytics (2)
7 MapReducing a Genomic Sequencing Workflow (2)
8 Exploring MapReduce Efficiency with Highly-Distributed Data (3)
9 Parallelizing large-scale data processing applications with data (3)
skew: a case study in product-offer matching
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
The home page
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Tyson Condie, et al.:MapReduce Online,NSDI’10
James Demmel, et al.:Communication-avoiding parallel and
sequential QR factorizations,EECS-2008-74et
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Otus: Resource Attribution in Data-Intensive Clusters
Authors: Kai Ren, Julio L´opez, Garth Gibson
@Carnegie Mellon University
Basic content of this paper:
An approach for facilitating performance analyses of distributed
data-intensive applications
Background:
Understanding the resource requirements of frameworks like
Hadoop, Dryad, etc., and the performance characteristics of the
applications is inherently difficult due to the distributed nature and
scale of the computing platform.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Otus: Resource Attribution in Data-Intensive Clusters
Problems:
Traditional cluster monitoring tools fail to provide the necessary
information to answer the fundamental questions to understand
application performance in data-intensive environments.
Solutions:
Attributing the resource utilization to important components of
interest, in different layers in the cluster software stack. The data
is correlated to infer the resource utilization for each service
component and job process in the cluster.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Phoenix++: Modular MapReduce for Shared-Memory
Systems
The Phoenix home page
Authors: Justin Talbot, Richard M. Yoo, Christos Kozyrakis
@Computer Systems Laboratory Stanford University
Basic content of this paper:
Phoenix is a shared-memory implementation of Google’s
MapReduce. Phoenix++ is a new implementation and achieves a
4.7-fold performance improvement and increased scalability, based
on this paper.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Problems:
Performance issue of Phoenix: it adopts a static MapReduce
pipeline similar to cluster-based implementations.
Inefficient Key-Value Storage
Ineffective Combiner
Exposed Task Chunking
Solutions:
Abstractions for intermediate data: Containers
More effective combiner implementation: Combiner Objects
Hide the task chunking granularity
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Other Modularity in Phoenix++
Sort is optional.
Custom sorting functions can be defined over key-value pairs
Custom memory allocators.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Static Type Checking of Hadoop MapReduce Programs
Authors: Jens D¨orre, Sven Apel, Christian Lengauer
@University of Passau, Germany
Basic content of this paper:
Provide a static check for Hadoop programs without asking the
user to write any more code.
Background:
Higher-order functions of functional languages can be strongly
typed using parametric polymorphism but in Hadoop, the
connection between the two phases of a MapReduce computation
is unsafe: there is no static type check of the generic type
parameters involved.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Problems:
In many MapReduce implementations,MapReduce programs are
not type checked at compile time.
Solutions
A static type checker for Hadoop, using Java 5 compiler.
Users use the combinators to write codes in the main function.
Hadoop job configuration can be generated automatically by
the combinator code.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
The real codes:
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Two important functions:
check: uses a chaining combinator to check the interface
between the mapper and the combiner function, and another
one to check the interface between the result and the reducer
function.
configureTypeSafeJob: Generates the Hadoop job
configuration.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Tall and Skinny QR factorizations in MapReduce
architectures
Authors: Paul G. Constantine1, David F. Gleich2
1Sandia National Laboratories,Albuquerque, 2Sandia National
Laboratories ,Livermore
Basic content of this paper:
Implementation of the tall and skinny QR (TSQR) factorization in
the MapReduce framework
Background
Demmel et al derived a communication-avoiding version of the QR
(CAQR) factorization trades flops for messages and is ideal for
MapReduce, where computationally intensive processes operate
locally on subsets of the data.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
The Implementation
1. multi-Mapper-single-Reducer
2. 2 iterations of Map-Reduce
It seems they don’t know our work... I think we can do better.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Rapid Parallel Genome Indexing with MapReduce
Authors: Rohith K. Menon et al.
@Department of Computer Science, Stony Brook University
Basic content of this paper:
A novel parallel algorithm for constructing the suffix array and the
Burrows-Wheeler Transform (BWT) of a sequence leveraging the
unique features of the MapReduce parallel programming model.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Full-Text Indexing for Optimizing Selection Operations in
Large-Scale Data Analytics
Authors: Jimmy Lin et al.
@Twitter
Basic content of this paper:
This paper addresses one inefficient aspect of Hadoop-based
processing: the need to perform a full scan of the entire dataset,
even in cases where it is clearly not necessary to do so. It is
possible to leverage a full-text index to optimize selection
operations on text fields within records.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
MapReducing a Genomic Sequencing Workflow
Authors: Luca Pireddu et al.
@CRS4
Main content
A MapReduce workflow that harnesses Hadoop to post-process the
data produced by DNA sequencing machines.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Exploring MapReduce Efficiency with Highly-Distributed
Data
Authors: Michael Cardosa et al.
@University of Minnesota
Basic content of this paper:
Propose recommendations for alternative (and even hierarchical)
distributed MapReduce setup configurations, depending on the
workload and data set.
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Parallelizing large scale data processing applications with
data skew:a case study in product-offer matching
Authors: Ekaterina Gonina et al.
@ UC Berkeley
A case study of parallelizing an example large-scale application
(offer matching, a core part of online shopping) on an example
MapReduce-based distributed computation engine (DryadLINQ).
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
Tyson Condie, et al.:MapReduce Online,NSDI’10
James Demmel, et al.:Communication-avoiding parallel and
sequential QR factorizations,EECS-2008-74et
Yu Liu An Introduction of Recent Research on MapReduce
Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
The end
Questions?
?
Yu Liu An Introduction of Recent Research on MapReduce

An Introduction of Recent Research on MapReduce (2011)

  • 1.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers An Introduction of Recent Research on MapReduce Yu Liu The Graduate University for Advanced Studies July 8th, 2011 Yu Liu An Introduction of Recent Research on MapReduce
  • 2.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Outline 1 Papers in MAPREDUCE11 2 Talks in HADOOP WOLD 2010 3 Other Interesting Papers Yu Liu An Introduction of Recent Research on MapReduce
  • 3.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers MAPREDUCE11 Sessions 1 Environments and Extensions to the MapReduce Programing Model 2 MapReduce Applications 3 Performance and Feature Improvements of MapReduce 4 Keynote by Greg Malewicz, Google Research.: Beyond MapReduce Yu Liu An Introduction of Recent Research on MapReduce
  • 4.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Paper List 1 Otus: Resource Attribution and Metrics Correlation in Data Intensive Clusters (1) 2 Phoenix++: Modular MapReduce for Shared-Memory Systems (1) 3 Static Type Checking of Hadoop MapReduce Programs (1) 4 Tall and Skinny QR factorizations in MapReduce architectures (2) 5 Rapid Parallel Genome Indexing with MapReduce (2) 6 Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics (2) 7 MapReducing a Genomic Sequencing Workflow (2) 8 Exploring MapReduce Efficiency with Highly-Distributed Data (3) 9 Parallelizing large-scale data processing applications with data (3) skew: a case study in product-offer matching Yu Liu An Introduction of Recent Research on MapReduce
  • 5.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers The home page Yu Liu An Introduction of Recent Research on MapReduce
  • 6.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Tyson Condie, et al.:MapReduce Online,NSDI’10 James Demmel, et al.:Communication-avoiding parallel and sequential QR factorizations,EECS-2008-74et Yu Liu An Introduction of Recent Research on MapReduce
  • 7.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Otus: Resource Attribution in Data-Intensive Clusters Authors: Kai Ren, Julio L´opez, Garth Gibson @Carnegie Mellon University Basic content of this paper: An approach for facilitating performance analyses of distributed data-intensive applications Background: Understanding the resource requirements of frameworks like Hadoop, Dryad, etc., and the performance characteristics of the applications is inherently difficult due to the distributed nature and scale of the computing platform. Yu Liu An Introduction of Recent Research on MapReduce
  • 8.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Otus: Resource Attribution in Data-Intensive Clusters Problems: Traditional cluster monitoring tools fail to provide the necessary information to answer the fundamental questions to understand application performance in data-intensive environments. Solutions: Attributing the resource utilization to important components of interest, in different layers in the cluster software stack. The data is correlated to infer the resource utilization for each service component and job process in the cluster. Yu Liu An Introduction of Recent Research on MapReduce
  • 9.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Phoenix++: Modular MapReduce for Shared-Memory Systems The Phoenix home page Authors: Justin Talbot, Richard M. Yoo, Christos Kozyrakis @Computer Systems Laboratory Stanford University Basic content of this paper: Phoenix is a shared-memory implementation of Google’s MapReduce. Phoenix++ is a new implementation and achieves a 4.7-fold performance improvement and increased scalability, based on this paper. Yu Liu An Introduction of Recent Research on MapReduce
  • 10.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Problems: Performance issue of Phoenix: it adopts a static MapReduce pipeline similar to cluster-based implementations. Inefficient Key-Value Storage Ineffective Combiner Exposed Task Chunking Solutions: Abstractions for intermediate data: Containers More effective combiner implementation: Combiner Objects Hide the task chunking granularity Yu Liu An Introduction of Recent Research on MapReduce
  • 11.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Other Modularity in Phoenix++ Sort is optional. Custom sorting functions can be defined over key-value pairs Custom memory allocators. Yu Liu An Introduction of Recent Research on MapReduce
  • 12.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Static Type Checking of Hadoop MapReduce Programs Authors: Jens D¨orre, Sven Apel, Christian Lengauer @University of Passau, Germany Basic content of this paper: Provide a static check for Hadoop programs without asking the user to write any more code. Background: Higher-order functions of functional languages can be strongly typed using parametric polymorphism but in Hadoop, the connection between the two phases of a MapReduce computation is unsafe: there is no static type check of the generic type parameters involved. Yu Liu An Introduction of Recent Research on MapReduce
  • 13.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Problems: In many MapReduce implementations,MapReduce programs are not type checked at compile time. Solutions A static type checker for Hadoop, using Java 5 compiler. Users use the combinators to write codes in the main function. Hadoop job configuration can be generated automatically by the combinator code. Yu Liu An Introduction of Recent Research on MapReduce
  • 14.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers The real codes: Yu Liu An Introduction of Recent Research on MapReduce
  • 15.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Two important functions: check: uses a chaining combinator to check the interface between the mapper and the combiner function, and another one to check the interface between the result and the reducer function. configureTypeSafeJob: Generates the Hadoop job configuration. Yu Liu An Introduction of Recent Research on MapReduce
  • 16.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Tall and Skinny QR factorizations in MapReduce architectures Authors: Paul G. Constantine1, David F. Gleich2 1Sandia National Laboratories,Albuquerque, 2Sandia National Laboratories ,Livermore Basic content of this paper: Implementation of the tall and skinny QR (TSQR) factorization in the MapReduce framework Background Demmel et al derived a communication-avoiding version of the QR (CAQR) factorization trades flops for messages and is ideal for MapReduce, where computationally intensive processes operate locally on subsets of the data. Yu Liu An Introduction of Recent Research on MapReduce
  • 17.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers The Implementation 1. multi-Mapper-single-Reducer 2. 2 iterations of Map-Reduce It seems they don’t know our work... I think we can do better. Yu Liu An Introduction of Recent Research on MapReduce
  • 18.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Rapid Parallel Genome Indexing with MapReduce Authors: Rohith K. Menon et al. @Department of Computer Science, Stony Brook University Basic content of this paper: A novel parallel algorithm for constructing the suffix array and the Burrows-Wheeler Transform (BWT) of a sequence leveraging the unique features of the MapReduce parallel programming model. Yu Liu An Introduction of Recent Research on MapReduce
  • 19.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics Authors: Jimmy Lin et al. @Twitter Basic content of this paper: This paper addresses one inefficient aspect of Hadoop-based processing: the need to perform a full scan of the entire dataset, even in cases where it is clearly not necessary to do so. It is possible to leverage a full-text index to optimize selection operations on text fields within records. Yu Liu An Introduction of Recent Research on MapReduce
  • 20.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers MapReducing a Genomic Sequencing Workflow Authors: Luca Pireddu et al. @CRS4 Main content A MapReduce workflow that harnesses Hadoop to post-process the data produced by DNA sequencing machines. Yu Liu An Introduction of Recent Research on MapReduce
  • 21.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Exploring MapReduce Efficiency with Highly-Distributed Data Authors: Michael Cardosa et al. @University of Minnesota Basic content of this paper: Propose recommendations for alternative (and even hierarchical) distributed MapReduce setup configurations, depending on the workload and data set. Yu Liu An Introduction of Recent Research on MapReduce
  • 22.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Yu Liu An Introduction of Recent Research on MapReduce
  • 23.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Yu Liu An Introduction of Recent Research on MapReduce
  • 24.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Parallelizing large scale data processing applications with data skew:a case study in product-offer matching Authors: Ekaterina Gonina et al. @ UC Berkeley A case study of parallelizing an example large-scale application (offer matching, a core part of online shopping) on an example MapReduce-based distributed computation engine (DryadLINQ). Yu Liu An Introduction of Recent Research on MapReduce
  • 25.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers Tyson Condie, et al.:MapReduce Online,NSDI’10 James Demmel, et al.:Communication-avoiding parallel and sequential QR factorizations,EECS-2008-74et Yu Liu An Introduction of Recent Research on MapReduce
  • 26.
    Outline MAPREDUCE11 Talks in HADOOPWORLD 2010 Other Interesting Papers Paper Introduction Other Papers The end Questions? ? Yu Liu An Introduction of Recent Research on MapReduce