Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Map Reduce


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Map Reduce

  1. 1. Why do we need Grid Storage-->Single disk can not host all the data Computation-->Single cpu can not provide all the computing needs Parallel jobs--> Serial execution is no more viable option
  2. 2. What we expect from a framework Distributed storage Job specification platform Job spliting/merging Job execution and monitoring
  3. 3. Basic attributes expected <ul><li>Resource management </li></ul><ul><ul><ul><li>Disk </li></ul></ul></ul><ul><ul><ul><li>CPU </li></ul></ul></ul><ul><ul><ul><li>Memory </li></ul></ul></ul><ul><ul><ul><li>Band width of network </li></ul></ul></ul><ul><li>Fault tolerant </li></ul><ul><ul><li>Network failure </li></ul></ul><ul><ul><li>Machine failure </li></ul></ul><ul><ul><li>Job/code bug </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Scalability </li></ul>
  4. 4. Hadoop Core <ul><li>Separate distributed file system based on google file system type architecture-->HDFS </li></ul><ul><li>Separate job splitting and merging mechanism </li></ul><ul><ul><ul><li>mapreduce framework on top of distributed file system </li></ul></ul></ul><ul><ul><ul><li>Provides custom job specification mechanism-->input format,mapper,partitioner,combiner,reducer, outputformat </li></ul></ul></ul>
  5. 5. HDFS attributes Distributed, Reliable,Scalable Optimized for streaming reads, very large data sets Assumes write once read several times No local caching possible due to large files and streaming reads High data replication Fit logically with mapreduce Synchronized access to metadata--> namenode Metadata (Edit log, FSI image) stored in namenode local os file system.
  6. 6. HDFS Copied from HDFS design document
  7. 7. Mapreduce framework attributes Fair isolation--> easy synchronization and fail over ...
  8. 8. Mapreduce Copied from yahoo tutorial
  9. 9. Copied from yahoo tutorial
  10. 10. Fault tolerant goal <ul><li>Hadoop assumes that at least one machine is down every time </li></ul><ul><li>HDFS </li></ul><ul><ul><li>Block level replication </li></ul></ul><ul><ul><li>Replicated and persistent metadata </li></ul></ul><ul><ul><li>Rack awareness and consideration of whole rac failure </li></ul></ul>
  11. 11. Fault tolerant goal contd.. <ul><li>Mapreduce </li></ul><ul><ul><li>No dependency assumed between tasks </li></ul></ul><ul><ul><li>Tasks from a failed node can be transferred to other nodes without any state information </li></ul></ul><ul><ul><ul><li>Mapper--> whole tasks are to be executed in other nodes </li></ul></ul></ul><ul><ul><ul><li>Reducer-->only un executed tasks are to be transmitted since all executed result are written to output </li></ul></ul></ul>
  12. 12. Resource management goal <ul><li>CPU/ Memory </li></ul><ul><ul><li>Mechanisms are provided so that direct streaming are possible to the file descriptor--> no user level operations for very large objects </li></ul></ul><ul><ul><li>Optimized sorting possible so that we can mostly decide the order from the bytes without instantiating object around them </li></ul></ul><ul><ul><li>.... </li></ul></ul>
  13. 13. Resource management goal contd.. <ul><li>Bandwidth </li></ul><ul><ul><li>HDFS architecture ensures that the read request is served from the nearest node (replication) </li></ul></ul><ul><ul><li>Mapreduce framework ensures that the operations are executed nearest to the data -->moving operations is cheaper to moving data </li></ul></ul><ul><ul><li>Optimized operations in every stage--> combiner, data replication (parallel buffering and transfer from one node), ... </li></ul></ul>
  14. 14. Scalability goal Flat scalability--> addition and removal of a node is fairly straight forward
  15. 15. Sub projects Zoo keeper for small shared information (useful for synchronization, lock, leader selection and so many sharing problems in distributed systems). Hbase for semi structured data (provides implementation of google big table design) Hive for ad hoc query analysis (currently supports insertion in multiple tables, group by, multiple table selection and order by is under construction) Avro for data serialization applicable to map reduce
  16. 16. How about other frameworks ??
  17. 17. Questions ???