An Introduction of Recent Research on MapReduce (2011)

Outline
MAPREDUCE11
Talks in HADOOP WORLD 2010
Other Interesting Papers
Paper Introduction
Other Papers
An Introduction of Recent Research on MapReduce
Yu Liu
The Graduate University for Advanced Studies
July 8th, 2011
Yu Liu An Introduction of Recent Research on MapReduce

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Outline
1 Papers in MAPREDUCE11
2 Talks in HADOOP WOLD 2010
3 Other Interesting Papers

Outline
MAPREDUCE11
Paper Introduction
Other Papers
MAPREDUCE11
Sessions
1 Environments and Extensions to the MapReduce Programing
Model
2 MapReduce Applications
3 Performance and Feature Improvements of MapReduce
4 Keynote by Greg Malewicz, Google Research.: Beyond
MapReduce

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Paper List
1 Otus: Resource Attribution and Metrics Correlation in Data
Intensive Clusters (1)
2 Phoenix++: Modular MapReduce for Shared-Memory Systems (1)
3 Static Type Checking of Hadoop MapReduce Programs (1)
4 Tall and Skinny QR factorizations in MapReduce architectures (2)
5 Rapid Parallel Genome Indexing with MapReduce (2)
6 Full-Text Indexing for Optimizing Selection Operations in
Large-Scale Data Analytics (2)
7 MapReducing a Genomic Sequencing Workflow (2)
8 Exploring MapReduce Efficiency with Highly-Distributed Data (3)
9 Parallelizing large-scale data processing applications with data (3)
skew: a case study in product-offer matching

Outline
MAPREDUCE11
Paper Introduction
Other Papers
The home page

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Tyson Condie, et al.:MapReduce Online,NSDI’10
James Demmel, et al.:Communication-avoiding parallel and
sequential QR factorizations,EECS-2008-74et

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Otus: Resource Attribution in Data-Intensive Clusters
Authors: Kai Ren, Julio L´opez, Garth Gibson
@Carnegie Mellon University
Basic content of this paper:
An approach for facilitating performance analyses of distributed
data-intensive applications
Background:
Understanding the resource requirements of frameworks like
Hadoop, Dryad, etc., and the performance characteristics of the
applications is inherently diﬃcult due to the distributed nature and
scale of the computing platform.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Otus: Resource Attribution in Data-Intensive Clusters
Problems:
Traditional cluster monitoring tools fail to provide the necessary
information to answer the fundamental questions to understand
application performance in data-intensive environments.
Solutions:
Attributing the resource utilization to important components of
interest, in diﬀerent layers in the cluster software stack. The data
is correlated to infer the resource utilization for each service
component and job process in the cluster.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Phoenix++: Modular MapReduce for Shared-Memory
Systems
The Phoenix home page
Authors: Justin Talbot, Richard M. Yoo, Christos Kozyrakis
@Computer Systems Laboratory Stanford University
Phoenix is a shared-memory implementation of Google’s
MapReduce. Phoenix++ is a new implementation and achieves a
4.7-fold performance improvement and increased scalability, based
on this paper.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Problems:
Performance issue of Phoenix: it adopts a static MapReduce
pipeline similar to cluster-based implementations.
Inefficient Key-Value Storage
Ineffective Combiner
Exposed Task Chunking
Solutions:
Abstractions for intermediate data: Containers
More effective combiner implementation: Combiner Objects
Hide the task chunking granularity

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Other Modularity in Phoenix++
Sort is optional.
Custom sorting functions can be deﬁned over key-value pairs
Custom memory allocators.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Static Type Checking of Hadoop MapReduce Programs
Authors: Jens D¨orre, Sven Apel, Christian Lengauer
@University of Passau, Germany
Provide a static check for Hadoop programs without asking the
user to write any more code.
Background:
Higher-order functions of functional languages can be strongly
typed using parametric polymorphism but in Hadoop, the
connection between the two phases of a MapReduce computation
is unsafe: there is no static type check of the generic type
parameters involved.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Problems:
In many MapReduce implementations,MapReduce programs are
not type checked at compile time.
Solutions
A static type checker for Hadoop, using Java 5 compiler.
Users use the combinators to write codes in the main function.
Hadoop job conﬁguration can be generated automatically by
the combinator code.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
The real codes:

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Two important functions:
check: uses a chaining combinator to check the interface
between the mapper and the combiner function, and another
one to check the interface between the result and the reducer
function.
conﬁgureTypeSafeJob: Generates the Hadoop job
conﬁguration.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Tall and Skinny QR factorizations in MapReduce
architectures
Authors: Paul G. Constantine1, David F. Gleich2
1Sandia National Laboratories,Albuquerque, 2Sandia National
Laboratories ,Livermore
Implementation of the tall and skinny QR (TSQR) factorization in
the MapReduce framework
Background
Demmel et al derived a communication-avoiding version of the QR
(CAQR) factorization trades ﬂops for messages and is ideal for
MapReduce, where computationally intensive processes operate
locally on subsets of the data.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
The Implementation
1. multi-Mapper-single-Reducer
2. 2 iterations of Map-Reduce
It seems they don’t know our work... I think we can do better.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Rapid Parallel Genome Indexing with MapReduce
Authors: Rohith K. Menon et al.
@Department of Computer Science, Stony Brook University
A novel parallel algorithm for constructing the suﬃx array and the
Burrows-Wheeler Transform (BWT) of a sequence leveraging the
unique features of the MapReduce parallel programming model.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Full-Text Indexing for Optimizing Selection Operations in
Large-Scale Data Analytics
Authors: Jimmy Lin et al.
@Twitter
This paper addresses one ineﬃcient aspect of Hadoop-based
processing: the need to perform a full scan of the entire dataset,
even in cases where it is clearly not necessary to do so. It is
possible to leverage a full-text index to optimize selection
operations on text ﬁelds within records.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
MapReducing a Genomic Sequencing Workﬂow
Authors: Luca Pireddu et al.
@CRS4
Main content
A MapReduce workﬂow that harnesses Hadoop to post-process the
data produced by DNA sequencing machines.

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Exploring MapReduce Eﬃciency with Highly-Distributed
Data
Authors: Michael Cardosa et al.
@University of Minnesota
Propose recommendations for alternative (and even hierarchical)
distributed MapReduce setup conﬁgurations, depending on the
workload and data set.

Outline
MAPREDUCE11
Paper Introduction
Other Papers

Outline
MAPREDUCE11
Paper Introduction
Other Papers
Parallelizing large scale data processing applications with
data skew:a case study in product-oﬀer matching
Authors: Ekaterina Gonina et al.
@ UC Berkeley
A case study of parallelizing an example large-scale application
(oﬀer matching, a core part of online shopping) on an example
MapReduce-based distributed computation engine (DryadLINQ).

Outline
MAPREDUCE11
Paper Introduction
Other Papers
The end
Questions?
?

An Introduction of Recent Research on MapReduce (2011)

More Related Content

What's hot

Viewers also liked

Similar to An Introduction of Recent Research on MapReduce (2011)

More from Yu Liu

Recently uploaded

An Introduction of Recent Research on MapReduce (2011)