Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework


Published on

Dache - a data aware cache system for big-data applications using the MapReduce framework.

Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.

Published in: Data & Analytics
  • Be the first to comment

Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

  1. 1. Jithin Raveendran S7 IT Roll No : 31 Guided by : Prof. Remesh Babu Presented by : 1
  2. 2. BigData???  Buzz-word big-data : large-scale distributed data processing applications that operate on exceptionally large amounts of data.  2.5 Zettabytes of data/day — so much that 90% of the data in the world today has been created in the last two years alone. 2
  3. 3. 3
  4. 4. Case study with Hadoop MapReduce Hadoop:  Open-source software framework from Apache - Distributed processing of large data sets across clusters of commodity servers.  Designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.  Inspired by  Google MapReduce  GFS (Google File System) HDFS Map/Reduce 4
  5. 5. Apache Hadoop has two pillars • HDFS • Self healing • High band width clustered storage • MapReduce • Retrieval System • Maper function tells the cluster which data points we want to retrieve • Reducer function then take all the data and aggregate 5
  6. 6. HDFS - Architecture 6
  7. 7.  Name Node: HDFS - Architecture  Center piece of an HDFS file system  Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file.  Responds the successful requests by returning a list of relevant DataNode servers where the data lives.  Data Node:  Stores data in the Hadoop File System.  A functional file system has more than one Data Node, with data replicated across them. 7
  8. 8.  Secondary Name node:  Act as a check point of name node  Takes the snapshot of the Name node and use it whenever the back up is needed  HDFS Features:  Rack awareness  Reliable storage  High throughput HDFS - Architecture 8
  9. 9. MapReduce Architecture • Job Client: • Submits Jobs • Job Tracker: • Co-ordinate Jobs • Task Tracker: • Execute Job tasks 9
  10. 10. MapReduce Architecture 1. Clients submits jobs to the Job Tracker 2. Job Tracker talks to the name node 3. Job Tracker creates execution plan 4. Job Tracker submit works to Task tracker 5. Task Trackers report progress via heart beats 6. Job Tracker manages the phases 7. Job Tracker update the status 10
  11. 11. Current System :  MapReduce is used for providing a standardized framework.  Limitation  Inefficiency in incremental processing. 11
  12. 12. Proposed System  Dache - a data aware cache system for big-data applications using the MapReduce framework.  Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job. 12
  13. 13. Related Work  Google Big table - handle incremental processing  Google Percolator - incremental processing platform  Ramcloud - distributed computing platform-Data on RAM 13
  14. 14. Technical challenges need to be addressed  Cache description scheme:  Data-aware caching requires each data object to be indexed by its content.  Provide a customizable indexing that enables the applications to describe their operations and the content of their generated partial results. This is a nontrivial task.  Cache request and reply protocol:  The size of the aggregated intermediate data can be very large. When such data is requested by other worker nodes, determining how to transport this data becomes complex 14
  15. 15. Cache Description  Map phase cache description scheme  Cache refers to the intermediate data produced by worker nodes/processes during the execution of a MapReduce task.  A piece of cached data stored in a Distributed File System (DFS).  Content of a cache item is described by the original data and the operations applied. 2-tuple: {Origin, Operation} Origin : Name of a file in the DFS. Operation : Linear list of available operations performed on the Origin file 15
  16. 16. Cache Description  Reduce phase cache description scheme  The input for the reduce phase is also a list of key-value pairs, where the value could be a list of values.  Original input and the applied operations are required.  Original input obtained by storing the intermediate results of the map phase in the DFS. 16
  17. 17. Protocol Relationship between job types and cache organization • When processing each file split, the cache manager reports the previous file splitting scheme used in its cache item. 17
  18. 18. Protocol Relationship between job types and cache organization  To find words starting with ‘ab’, We use the results from the cache for word starting with ‘a’ ; and also add it to the cache  Find the best match among overlapped results [choose ‘ab’ instead of ‘a’] 18
  19. 19. Protocol Cache item submission  Mapper and reducer nodes/processes record cache items into their local storage space  Cache item should be put on the same machine as the worker process that generates it. Worker node/process contacts the cache manager each time before it begins processing an input data file. Worker process receives the tentative description and fetches the cache item. 19
  20. 20. Lifetime management of cache item  Cache manager - Determine how much time a cache item can be kept in the DFS.  Two types of policies for determining the lifetime of a cache item 1. Fixed storage quota • Least Recent Used (LRU) is employed 2. Optimal utility • Estimates the saved computation time, ts, by caching a cache item for a given amount of time, ta. • ts,ta - used to derive the monetary gain and cost. 20
  21. 21. Cache request and reply Map Cache:  Cache requests must be sent out before the file splitting phase.  Job tracker issues cache requests to the cache manager.  Cache manager replies a list of cache descriptions. Reduce Cache : • First , compare the requested cache item with the cached items in the cache manager’s database. • Cache manager identify the overlaps of the original input files of the requested cache and stored cache. • Linear scan is used here. 21
  22. 22. Performance Evaluation Implementation  Extend Hadoop to implement Dache by changing the components that are open to application developers.  The cache manager is implemented as an independent server. 22
  23. 23. Experiment settings  Hadoop is run in pseudo-distributed mode on a server that has  8-core CPU  core running at 3 GHz,  16GB memory,  a SATA disk  Two applications to benchmark the speedup of Dache over Hadoop  word-count and tera-sort. 23
  24. 24. Results 24
  25. 25. Results 25
  26. 26. Results 26
  27. 27. Conclusion  Requires minimum change to the original MapReduce programming model.  Application code only requires slight changes in order to utilize Dache.  Implement Dache in Hadoop by extending relevant components.  Testbed experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs.  Minimum execution time and CPU utilization. 27
  28. 28. Future Work  This scheme utilizes much amount of cache.  Better cache management system will be needed. 28
  29. 29. 29