SlideShare a Scribd company logo
Jithin Raveendran 
S7 IT 
Roll No : 31 
Guided by : 
Prof. Remesh Babu 
Presented by : 
1
BigData??? 
 Buzz-word big-data : large-scale 
distributed data processing applications 
that operate on exceptionally large 
amounts of data. 
 2.5 Zettabytes of data/day — so much 
that 90% of the data in the world today 
has been created in the last two years 
alone. 
2
3
Case study with Hadoop MapReduce 
Hadoop: 
 Open-source software framework from Apache - Distributed processing 
of large data sets across clusters of commodity servers. 
 Designed to scale up from a single server to thousands of machines, 
with a very high degree of fault tolerance. 
 Inspired by 
 Google MapReduce 
 GFS (Google File System) 
HDFS 
Map/Reduce 
4
Apache Hadoop has two pillars 
• HDFS 
• Self healing 
• High band width clustered 
storage 
• MapReduce 
• Retrieval System 
• Maper function tells the cluster 
which data points we want to 
retrieve 
• Reducer function then take all 
the data and aggregate 
5
HDFS - Architecture 
6
 Name Node: 
HDFS - Architecture 
 Center piece of an HDFS file system 
 Client applications talk to the NameNode whenever they wish to locate a 
file, or when they want to add/copy/move/delete a file. 
 Responds the successful requests by returning a list of relevant DataNode 
servers where the data lives. 
 Data Node: 
 Stores data in the Hadoop File System. 
 A functional file system has more than one Data Node, with data replicated 
across them. 
7
 Secondary Name node: 
 Act as a check point of name node 
 Takes the snapshot of the Name node and use it whenever the back 
up is needed 
 HDFS Features: 
 Rack awareness 
 Reliable storage 
 High throughput 
HDFS - Architecture 
8
MapReduce Architecture 
• Job Client: 
• Submits Jobs 
• Job Tracker: 
• Co-ordinate 
Jobs 
• Task Tracker: 
• Execute Job 
tasks 
9
MapReduce Architecture 
1. Clients submits jobs to the Job Tracker 
2. Job Tracker talks to the name node 
3. Job Tracker creates execution plan 
4. Job Tracker submit works to Task tracker 
5. Task Trackers report progress via heart beats 
6. Job Tracker manages the phases 
7. Job Tracker update the status 
10
Current System : 
 MapReduce is used for 
providing a standardized 
framework. 
 Limitation 
 Inefficiency in 
incremental processing. 
11
Proposed System 
 Dache - a data aware cache system for 
big-data applications using the 
MapReduce framework. 
 Dache aim-extending the MapReduce 
framework and provisioning a cache 
layer for efficiently identifying and 
accessing cache items in a MapReduce 
job. 
12
Related Work 
 Google Big table - handle incremental processing 
 Google Percolator - incremental processing platform 
 Ramcloud - distributed computing platform-Data on RAM 
13
Technical challenges need to be 
addressed 
 Cache description scheme: 
 Data-aware caching requires each data object to be indexed by its content. 
 Provide a customizable indexing that enables the applications to describe 
their operations and the content of their generated partial results. This is a 
nontrivial task. 
 Cache request and reply protocol: 
 The size of the aggregated intermediate data can be very large. When such 
data is requested by other worker nodes, determining how to transport 
this data becomes complex 
14
Cache Description 
 Map phase cache description scheme 
 Cache refers to the intermediate data produced by worker nodes/processes 
during the execution of a MapReduce task. 
 A piece of cached data stored in a Distributed File System (DFS). 
 Content of a cache item is described by the original data and the operations 
applied. 
2-tuple: {Origin, Operation} 
Origin : Name of a file in the DFS. 
Operation : Linear list of available operations performed on the Origin file 
15
Cache Description 
 Reduce phase cache description scheme 
 The input for the reduce phase is also a list of key-value pairs, where the 
value could be a list of values. 
 Original input and the applied operations are required. 
 Original input obtained by storing the intermediate results of the map 
phase in the DFS. 
16
Protocol 
Relationship between job types and cache 
organization 
• When processing each file split, the 
cache manager reports the previous 
file splitting scheme used in its 
cache item. 
17
Protocol 
Relationship between job types and cache 
organization 
 To find words starting with 
‘ab’, We use the results from 
the cache for word starting 
with ‘a’ ; and also add it to 
the cache 
 Find the best match among 
overlapped results [choose 
‘ab’ instead of ‘a’] 
18
Protocol 
Cache item submission 
 Mapper and reducer nodes/processes record cache items into their 
local storage space 
 Cache item should be put on the same machine as the worker 
process that generates it. 
Worker node/process contacts the cache manager each time before 
it begins processing an input data file. 
Worker process receives the tentative description and fetches the 
cache item. 
19
Lifetime management of cache item 
 Cache manager - Determine how much time a cache item can be 
kept in the DFS. 
 Two types of policies for determining the lifetime of a cache item 
1. Fixed storage quota 
• Least Recent Used (LRU) is employed 
2. Optimal utility 
• Estimates the saved computation 
time, ts, by caching a cache item for a 
given amount of time, ta. 
• ts,ta - used to derive the monetary 
gain and cost. 20
Cache request and reply 
Map Cache: 
 Cache requests must be sent out before the file splitting phase. 
 Job tracker issues cache requests to the cache manager. 
 Cache manager replies a list of cache descriptions. 
Reduce Cache : 
• First , compare the requested cache item with the cached items in 
the cache manager’s database. 
• Cache manager identify the overlaps of the original input files of 
the requested cache and stored cache. 
• Linear scan is used here. 
21
Performance Evaluation 
Implementation 
 Extend Hadoop to implement Dache by changing the components that 
are open to application developers. 
 The cache manager is implemented as an independent server. 
22
Experiment settings 
 Hadoop is run in pseudo-distributed mode on a server that has 
 8-core CPU 
 core running at 3 GHz, 
 16GB memory, 
 a SATA disk 
 Two applications to benchmark the speedup of Dache over Hadoop 
 word-count and tera-sort. 
23
Results 
24
Results 
25
Results 
26
Conclusion 
 Requires minimum change to the original MapReduce programming 
model. 
 Application code only requires slight changes in order to utilize Dache. 
 Implement Dache in Hadoop by extending relevant components. 
 Testbed experiments show that it can eliminate all the duplicate tasks in 
incremental MapReduce jobs. 
 Minimum execution time and CPU utilization. 
27
Future Work 
 This scheme utilizes much amount of cache. 
 Better cache management system will be needed. 
28
29

More Related Content

What's hot

Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
AMIT BORUDE
 

What's hot (20)

Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 

Viewers also liked

UML and Data Modeling - A Reconciliation
UML and Data Modeling - A ReconciliationUML and Data Modeling - A Reconciliation
UML and Data Modeling - A Reconciliation
dmurph4
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
Ryu Kobayashi
 

Viewers also liked (20)

UML for Data Architects
UML for Data ArchitectsUML for Data Architects
UML for Data Architects
 
Yarn
YarnYarn
Yarn
 
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with SuccessBig Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspective
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
UML and Data Modeling - A Reconciliation
UML and Data Modeling - A ReconciliationUML and Data Modeling - A Reconciliation
UML and Data Modeling - A Reconciliation
 
Data-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your Business
Data-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your BusinessData-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your Business
Data-Ed Slides: Data-Centric Strategy & Roadmap - Supercharging Your Business
 
CDO Webinar: 2017 Trends in Data Strategy
CDO Webinar: 2017 Trends in Data StrategyCDO Webinar: 2017 Trends in Data Strategy
CDO Webinar: 2017 Trends in Data Strategy
 
Data-Ed Slides: Exorcising the Seven Deadly Data Sins
Data-Ed Slides: Exorcising the Seven Deadly Data SinsData-Ed Slides: Exorcising the Seven Deadly Data Sins
Data-Ed Slides: Exorcising the Seven Deadly Data Sins
 
Data-Ed Slides: Data Architecture Strategies - Constructing Your Data Garden
Data-Ed Slides: Data Architecture Strategies - Constructing Your Data GardenData-Ed Slides: Data Architecture Strategies - Constructing Your Data Garden
Data-Ed Slides: Data Architecture Strategies - Constructing Your Data Garden
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
Data strategy in a Big Data world
Data strategy in a Big Data worldData strategy in a Big Data world
Data strategy in a Big Data world
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Data Governance
Data GovernanceData Governance
Data Governance
 
Data Strategy
Data StrategyData Strategy
Data Strategy
 
A New Way of Thinking About MDM
A New Way of Thinking About MDMA New Way of Thinking About MDM
A New Way of Thinking About MDM
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy
 

Similar to Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 

Similar to Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework (20)

Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Anju
AnjuAnju
Anju
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 

Recently uploaded

Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
zahraomer517
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 

Recently uploaded (20)

Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 

Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

  • 1. Jithin Raveendran S7 IT Roll No : 31 Guided by : Prof. Remesh Babu Presented by : 1
  • 2. BigData???  Buzz-word big-data : large-scale distributed data processing applications that operate on exceptionally large amounts of data.  2.5 Zettabytes of data/day — so much that 90% of the data in the world today has been created in the last two years alone. 2
  • 3. 3
  • 4. Case study with Hadoop MapReduce Hadoop:  Open-source software framework from Apache - Distributed processing of large data sets across clusters of commodity servers.  Designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.  Inspired by  Google MapReduce  GFS (Google File System) HDFS Map/Reduce 4
  • 5. Apache Hadoop has two pillars • HDFS • Self healing • High band width clustered storage • MapReduce • Retrieval System • Maper function tells the cluster which data points we want to retrieve • Reducer function then take all the data and aggregate 5
  • 7.  Name Node: HDFS - Architecture  Center piece of an HDFS file system  Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file.  Responds the successful requests by returning a list of relevant DataNode servers where the data lives.  Data Node:  Stores data in the Hadoop File System.  A functional file system has more than one Data Node, with data replicated across them. 7
  • 8.  Secondary Name node:  Act as a check point of name node  Takes the snapshot of the Name node and use it whenever the back up is needed  HDFS Features:  Rack awareness  Reliable storage  High throughput HDFS - Architecture 8
  • 9. MapReduce Architecture • Job Client: • Submits Jobs • Job Tracker: • Co-ordinate Jobs • Task Tracker: • Execute Job tasks 9
  • 10. MapReduce Architecture 1. Clients submits jobs to the Job Tracker 2. Job Tracker talks to the name node 3. Job Tracker creates execution plan 4. Job Tracker submit works to Task tracker 5. Task Trackers report progress via heart beats 6. Job Tracker manages the phases 7. Job Tracker update the status 10
  • 11. Current System :  MapReduce is used for providing a standardized framework.  Limitation  Inefficiency in incremental processing. 11
  • 12. Proposed System  Dache - a data aware cache system for big-data applications using the MapReduce framework.  Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job. 12
  • 13. Related Work  Google Big table - handle incremental processing  Google Percolator - incremental processing platform  Ramcloud - distributed computing platform-Data on RAM 13
  • 14. Technical challenges need to be addressed  Cache description scheme:  Data-aware caching requires each data object to be indexed by its content.  Provide a customizable indexing that enables the applications to describe their operations and the content of their generated partial results. This is a nontrivial task.  Cache request and reply protocol:  The size of the aggregated intermediate data can be very large. When such data is requested by other worker nodes, determining how to transport this data becomes complex 14
  • 15. Cache Description  Map phase cache description scheme  Cache refers to the intermediate data produced by worker nodes/processes during the execution of a MapReduce task.  A piece of cached data stored in a Distributed File System (DFS).  Content of a cache item is described by the original data and the operations applied. 2-tuple: {Origin, Operation} Origin : Name of a file in the DFS. Operation : Linear list of available operations performed on the Origin file 15
  • 16. Cache Description  Reduce phase cache description scheme  The input for the reduce phase is also a list of key-value pairs, where the value could be a list of values.  Original input and the applied operations are required.  Original input obtained by storing the intermediate results of the map phase in the DFS. 16
  • 17. Protocol Relationship between job types and cache organization • When processing each file split, the cache manager reports the previous file splitting scheme used in its cache item. 17
  • 18. Protocol Relationship between job types and cache organization  To find words starting with ‘ab’, We use the results from the cache for word starting with ‘a’ ; and also add it to the cache  Find the best match among overlapped results [choose ‘ab’ instead of ‘a’] 18
  • 19. Protocol Cache item submission  Mapper and reducer nodes/processes record cache items into their local storage space  Cache item should be put on the same machine as the worker process that generates it. Worker node/process contacts the cache manager each time before it begins processing an input data file. Worker process receives the tentative description and fetches the cache item. 19
  • 20. Lifetime management of cache item  Cache manager - Determine how much time a cache item can be kept in the DFS.  Two types of policies for determining the lifetime of a cache item 1. Fixed storage quota • Least Recent Used (LRU) is employed 2. Optimal utility • Estimates the saved computation time, ts, by caching a cache item for a given amount of time, ta. • ts,ta - used to derive the monetary gain and cost. 20
  • 21. Cache request and reply Map Cache:  Cache requests must be sent out before the file splitting phase.  Job tracker issues cache requests to the cache manager.  Cache manager replies a list of cache descriptions. Reduce Cache : • First , compare the requested cache item with the cached items in the cache manager’s database. • Cache manager identify the overlaps of the original input files of the requested cache and stored cache. • Linear scan is used here. 21
  • 22. Performance Evaluation Implementation  Extend Hadoop to implement Dache by changing the components that are open to application developers.  The cache manager is implemented as an independent server. 22
  • 23. Experiment settings  Hadoop is run in pseudo-distributed mode on a server that has  8-core CPU  core running at 3 GHz,  16GB memory,  a SATA disk  Two applications to benchmark the speedup of Dache over Hadoop  word-count and tera-sort. 23
  • 27. Conclusion  Requires minimum change to the original MapReduce programming model.  Application code only requires slight changes in order to utilize Dache.  Implement Dache in Hadoop by extending relevant components.  Testbed experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs.  Minimum execution time and CPU utilization. 27
  • 28. Future Work  This scheme utilizes much amount of cache.  Better cache management system will be needed. 28
  • 29. 29

Editor's Notes

  1. As You Can See… Can You Imagine How Big It Wud Become If We Take Statitics of 1 DAY
  2. It Works as in Master- Salve Model NameNod- Details abt Which data belongs to which Datanodes, How many copies
  3. Whenever We put data on hdfs, it breaks up into several pieces. Rack Awarenes- Ensures Multiple copies of data is on multiple DataNodes of Multiple Racks.
  4. Incremental processing refers to the applications that incrementally grow the input data and continuously apply computations on the input in order to generate output