Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Jithin Raveendran
S7 IT
Roll No : 31
Guided by :
Prof. Remesh Babu
Presented by :
1

BigData???
 Buzz-word big-data : large-scale
distributed data processing applications
that operate on exceptionally large
amounts of data.
 2.5 Zettabytes of data/day — so much
that 90% of the data in the world today
has been created in the last two years
alone.
2

Case study with Hadoop MapReduce
Hadoop:
 Open-source software framework from Apache - Distributed processing
of large data sets across clusters of commodity servers.
 Designed to scale up from a single server to thousands of machines,
with a very high degree of fault tolerance.
 Inspired by
 Google MapReduce
 GFS (Google File System)
HDFS
Map/Reduce
4

Apache Hadoop has two pillars
• HDFS
• Self healing
• High band width clustered
storage
• MapReduce
• Retrieval System
• Maper function tells the cluster
which data points we want to
retrieve
• Reducer function then take all
the data and aggregate
5

 Name Node:
HDFS - Architecture
 Center piece of an HDFS file system
 Client applications talk to the NameNode whenever they wish to locate a
file, or when they want to add/copy/move/delete a file.
 Responds the successful requests by returning a list of relevant DataNode
servers where the data lives.
 Data Node:
 Stores data in the Hadoop File System.
 A functional file system has more than one Data Node, with data replicated
across them.
7

 Secondary Name node:
 Act as a check point of name node
 Takes the snapshot of the Name node and use it whenever the back
up is needed
 HDFS Features:
 Rack awareness
 Reliable storage
 High throughput
HDFS - Architecture
8

MapReduce Architecture
• Job Client:
• Submits Jobs
• Job Tracker:
• Co-ordinate
Jobs
• Task Tracker:
• Execute Job
tasks
9

MapReduce Architecture
1. Clients submits jobs to the Job Tracker
2. Job Tracker talks to the name node
3. Job Tracker creates execution plan
4. Job Tracker submit works to Task tracker
5. Task Trackers report progress via heart beats
6. Job Tracker manages the phases
7. Job Tracker update the status
10

Current System :
 MapReduce is used for
providing a standardized
framework.
 Limitation
 Inefficiency in
incremental processing.
11

Proposed System
 Dache - a data aware cache system for
big-data applications using the
MapReduce framework.
 Dache aim-extending the MapReduce
framework and provisioning a cache
layer for efficiently identifying and
accessing cache items in a MapReduce
job.
12

Related Work
 Google Big table - handle incremental processing
 Google Percolator - incremental processing platform
 Ramcloud - distributed computing platform-Data on RAM
13

Technical challenges need to be
addressed
 Cache description scheme:
 Data-aware caching requires each data object to be indexed by its content.
 Provide a customizable indexing that enables the applications to describe
their operations and the content of their generated partial results. This is a
nontrivial task.
 Cache request and reply protocol:
 The size of the aggregated intermediate data can be very large. When such
data is requested by other worker nodes, determining how to transport
this data becomes complex
14

Cache Description
 Map phase cache description scheme
 Cache refers to the intermediate data produced by worker nodes/processes
during the execution of a MapReduce task.
 A piece of cached data stored in a Distributed File System (DFS).
 Content of a cache item is described by the original data and the operations
applied.
2-tuple: {Origin, Operation}
Origin : Name of a file in the DFS.
Operation : Linear list of available operations performed on the Origin file
15

Cache Description
 Reduce phase cache description scheme
 The input for the reduce phase is also a list of key-value pairs, where the
value could be a list of values.
 Original input and the applied operations are required.
 Original input obtained by storing the intermediate results of the map
phase in the DFS.
16

Protocol
Relationship between job types and cache
organization
• When processing each file split, the
cache manager reports the previous
file splitting scheme used in its
cache item.
17

Protocol
Relationship between job types and cache
organization
 To find words starting with
‘ab’, We use the results from
the cache for word starting
with ‘a’ ; and also add it to
the cache
 Find the best match among
overlapped results [choose
‘ab’ instead of ‘a’]
18

Protocol
Cache item submission
 Mapper and reducer nodes/processes record cache items into their
local storage space
 Cache item should be put on the same machine as the worker
process that generates it.
Worker node/process contacts the cache manager each time before
it begins processing an input data file.
Worker process receives the tentative description and fetches the
cache item.
19

Lifetime management of cache item
 Cache manager - Determine how much time a cache item can be
kept in the DFS.
 Two types of policies for determining the lifetime of a cache item
1. Fixed storage quota
• Least Recent Used (LRU) is employed
2. Optimal utility
• Estimates the saved computation
time, ts, by caching a cache item for a
given amount of time, ta.
• ts,ta - used to derive the monetary
gain and cost. 20

Cache request and reply
Map Cache:
 Cache requests must be sent out before the file splitting phase.
 Job tracker issues cache requests to the cache manager.
 Cache manager replies a list of cache descriptions.
Reduce Cache :
• First , compare the requested cache item with the cached items in
the cache manager’s database.
• Cache manager identify the overlaps of the original input files of
the requested cache and stored cache.
• Linear scan is used here.
21

Performance Evaluation
Implementation
 Extend Hadoop to implement Dache by changing the components that
are open to application developers.
 The cache manager is implemented as an independent server.
22

Experiment settings
 Hadoop is run in pseudo-distributed mode on a server that has
 8-core CPU
 core running at 3 GHz,
 16GB memory,
 a SATA disk
 Two applications to benchmark the speedup of Dache over Hadoop
 word-count and tera-sort.
23

Conclusion
 Requires minimum change to the original MapReduce programming
model.
 Application code only requires slight changes in order to utilize Dache.
 Implement Dache in Hadoop by extending relevant components.
 Testbed experiments show that it can eliminate all the duplicate tasks in
incremental MapReduce jobs.
 Minimum execution time and CPU utilization.
27

Future Work
 This scheme utilizes much amount of cache.
 Better cache management system will be needed.
28

Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Similar to Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework (20)

Recently uploaded

Recently uploaded (20)