Your SlideShare is downloading. ×
Map Reduce
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Map Reduce

600
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
600
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Why do we need Grid Storage-->Single disk can not host all the data Computation-->Single cpu can not provide all the computing needs Parallel jobs--> Serial execution is no more viable option
  • 2. What we expect from a framework Distributed storage Job specification platform Job spliting/merging Job execution and monitoring
  • 3. Basic attributes expected
    • Resource management
        • Disk
        • CPU
        • Memory
        • Band width of network
    • Fault tolerant
      • Network failure
      • Machine failure
      • Job/code bug
    • Scalability
  • 4. Hadoop Core
    • Separate distributed file system based on google file system type architecture-->HDFS
    • Separate job splitting and merging mechanism
        • mapreduce framework on top of distributed file system
        • Provides custom job specification mechanism-->input format,mapper,partitioner,combiner,reducer, outputformat
  • 5. HDFS attributes Distributed, Reliable,Scalable Optimized for streaming reads, very large data sets Assumes write once read several times No local caching possible due to large files and streaming reads High data replication Fit logically with mapreduce Synchronized access to metadata--> namenode Metadata (Edit log, FSI image) stored in namenode local os file system.
  • 6. HDFS Copied from HDFS design document
  • 7. Mapreduce framework attributes Fair isolation--> easy synchronization and fail over ...
  • 8. Mapreduce Copied from yahoo tutorial
  • 9. Copied from yahoo tutorial
  • 10. Fault tolerant goal
    • Hadoop assumes that at least one machine is down every time
    • HDFS
      • Block level replication
      • Replicated and persistent metadata
      • Rack awareness and consideration of whole rac failure
  • 11. Fault tolerant goal contd..
    • Mapreduce
      • No dependency assumed between tasks
      • Tasks from a failed node can be transferred to other nodes without any state information
        • Mapper--> whole tasks are to be executed in other nodes
        • Reducer-->only un executed tasks are to be transmitted since all executed result are written to output
  • 12. Resource management goal
    • CPU/ Memory
      • Mechanisms are provided so that direct streaming are possible to the file descriptor--> no user level operations for very large objects
      • Optimized sorting possible so that we can mostly decide the order from the bytes without instantiating object around them
      • ....
  • 13. Resource management goal contd..
    • Bandwidth
      • HDFS architecture ensures that the read request is served from the nearest node (replication)
      • Mapreduce framework ensures that the operations are executed nearest to the data -->moving operations is cheaper to moving data
      • Optimized operations in every stage--> combiner, data replication (parallel buffering and transfer from one node), ...
  • 14. Scalability goal Flat scalability--> addition and removal of a node is fairly straight forward
  • 15. Sub projects Zoo keeper for small shared information (useful for synchronization, lock, leader selection and so many sharing problems in distributed systems). Hbase for semi structured data (provides implementation of google big table design) Hive for ad hoc query analysis (currently supports insertion in multiple tables, group by, multiple table selection and order by is under construction) Avro for data serialization applicable to map reduce
  • 16. How about other frameworks ??
  • 17. Questions ???