Meethadoop

2,945 views

Published on

Hadoop

1 Comment
4 Likes
Statistics
Notes
  • i want say thanks to all slideshare guys..
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
2,945
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
235
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide

Meethadoop

  1. 1. Hadoop
  2. 2. Overview <ul><li>Introduction </li></ul><ul><li>What is Hadoop? </li></ul><ul><li>Introduction to MapReduce Programming Model </li></ul><ul><li>Hadoop Distributed File System(HDFS)‏ </li></ul><ul><li>HDFS Architecture </li></ul><ul><li>Hadoop Mapreduce Architecture </li></ul><ul><li>Basic Functioning of Hadoop </li></ul><ul><li>Hadoop a hit!! </li></ul>
  3. 3. Hadoop
  4. 4. Thinking at scale <ul><li>Need to process 100TB datasets. </li></ul><ul><li>On 1 node: </li></ul><ul><li>– scanning @ 50MB/s = 23 days </li></ul><ul><li>On 1000 node cluster: </li></ul><ul><li>– scanning @ 50MB/s = 33 min </li></ul><ul><li>Need Efficient, Reliable and Usable framework </li></ul>
  5. 5. What we need?
  6. 6. What we need?
  7. 7. What we need? Job data data data
  8. 8. What we need? data job data
  9. 9. How?
  10. 10. Hadoop is at your service!!
  11. 11. Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware. Hadoop
  12. 12. Two components <ul><li>Hadoop Distributed Filesystem: HDFS stores data </li></ul><ul><li>on nodes in the cluster with the goal of providing greater </li></ul><ul><li>bandwidth across the cluster. </li></ul><ul><li>Hadoop MapReduce : It is a computational paradigm </li></ul><ul><li>called Map/Reduce, which takes an application and divides </li></ul><ul><li>it into multiple fragments of work, each of which can be executed on any node in the cluster </li></ul>
  13. 13. Hadoop Distributed Filesystem HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.
  14. 14. HDFS:Motivation <ul><li>Based on Google’s GFS </li></ul><ul><li>Redundant storage of massive amounts of data on cheap and unreliable computers </li></ul><ul><li>Why not use an existing file system? </li></ul><ul><li>– Different workload and design priorities </li></ul><ul><li>– Handles much bigger dataset sizes than other </li></ul><ul><li>filesystems </li></ul>
  15. 15. Assumptions <ul><li>High component failure rates </li></ul><ul><li>– Inexpensive commodity components fail all the time </li></ul><ul><li>“ Modest” number of HUGE files </li></ul><ul><li>– Just a few million </li></ul><ul><li>– Each is 100MB or larger; multi-GB files typical </li></ul><ul><li>Files are write-once, mostly appended to </li></ul><ul><li>– Perhaps concurrently </li></ul><ul><li>Large streaming reads </li></ul>
  16. 16. HDFS Design <ul><li>Files stored as blocks </li></ul><ul><li>– Much larger size than most filesystems (default is 64MB)‏ </li></ul><ul><li>Reliability through replication </li></ul><ul><li>– Each block replicated across 3+ DataNodes </li></ul><ul><li>Single master (NameNode) coordinates access, metadata </li></ul><ul><li>-- Simple centralized management </li></ul>
  17. 17. HDFS Design <ul><li>No data caching </li></ul><ul><li>– Little benefit due to large data sets, streaming </li></ul><ul><li>reads </li></ul><ul><li>Familiar interface, but customize the API </li></ul><ul><li>– Simplify the problem; focus on distributed apps </li></ul>
  18. 18. Hdfs NameNode Hdfs DataNode Hdfs DataNode Hdfs aware application Posix API HDFS API Hdfs view Network Stack Regular filesystem Specific drivers.. HDFS Client Block Diagram Client computer
  19. 19. HDFS Architecture <ul><li>Master-Slave Architecture </li></ul><ul><li>HDFS Master “Namenode” </li></ul><ul><li>– Manages all filesystem metadata </li></ul><ul><li>Transactions are logged, merged at startup </li></ul><ul><li>– Controls read/write access to files </li></ul><ul><li>– Manages block replication </li></ul><ul><li>HDFS Slaves “Datanodes” </li></ul><ul><li>– Notifies NameNode about block-IDs it has </li></ul><ul><li>– Serve read/write requests from clients </li></ul><ul><li>– Perform replication tasks upon instruction by namenode </li></ul>
  20. 21. HDFS: Handling Failures <ul><li>NameNode failure </li></ul><ul><li>– A single point of failure </li></ul><ul><li>Secondary NameNode provides consistency </li></ul><ul><li>– Copies FsImage and Transaction Log from </li></ul><ul><li>NameNode & merges them </li></ul><ul><li>– Uploads new FSImage to the NameNode </li></ul><ul><li>DataNode failures </li></ul><ul><li>Uses CRC checks to avoid data corruption. </li></ul>
  21. 22. Hadoop MapReduce Map Reduce is a programming model and an associated implementation for processing and generating large data sets.
  22. 23. Map/Reduce Programming Model <ul><li>Borrows from functional programming </li></ul><ul><li>Users implement interface of two Functions: – map (in_key, in_value) -> (out_key, intermediate_value) list </li></ul><ul><li>– reduce (out_key, intermediate_value list) -> out_value list </li></ul>
  23. 24. Map <ul><li>Records from the data source are fed into the map function as key*value pairs. </li></ul><ul><li>map() produces one or more intermediate values along with an output key from the input. </li></ul>
  24. 25. Map <ul><li>map (in_key, in_value) ->(out_key, intermediate_value) list </li></ul>
  25. 26. Example: Upper-case Mapperlet <ul><li>let map(k, v) = emit(k.toUpper(), v.toUpper()) </li></ul><ul><li>(“foo”, “bar”) => (“FOO”, “BAR”) </li></ul><ul><li>(“Foo”, “other”) => (“FOO”, “OTHER”)‏ </li></ul><ul><li>(“key2”, “data”) => (“KEY2”, “DATA”)‏ </li></ul>
  26. 27. Reduce <ul><li>After the map phase is over, all the intermediate values for a given output key are combined together into a list </li></ul><ul><li>reduce() combines those intermediate values into one or more final values for that same output key </li></ul><ul><li>(in practice, usually only one final value per key)‏ </li></ul>
  27. 28. Reduce <ul><li>reduce (out_key, intermediate_value list) -> out_value list </li></ul>
  28. 29. Example: Sum Reducer <ul><li>let reduce(k, vals) = sum = 0 foreach int v in vals: sum += v emit(k, sum)‏ </li></ul><ul><li>(“A”, [42, 100, 312]) => (“A”, 454) (“B”, [12, 6, -2]) => (“B”, 16)‏ </li></ul>
  29. 30. MapReduce DataFlow Example:Word Count Hi,how are you? Iam good Hello Hello how are you? Not so good Are 1 hi 1 how 1 you 1 Are 1 Hello 1 Hello 1 how 1 you 1 Are 2 Hello 2 Hi 1 how 2 you 2 Are[1 1] Hello[1 1] Hi[1] how[1 1] you[1 1] Map Reduce Input Intermediate results Output merged Sorted
  30. 32. Parallelism <ul><li>map() functions run in parallel, creating different intermediate values from different input data sets. </li></ul><ul><li>reduce() functions also run in parallel, each working on a different output key. </li></ul><ul><li>All values are processed independently </li></ul><ul><li>Bottleneck: reduce phase can’t start until map phase is completely finished. </li></ul>
  31. 33. Combining Phase <ul><li>Run on mapper nodes after map phase </li></ul><ul><li>“ Mini-reduce,” only on local map output </li></ul><ul><li>Used to save bandwidth before sending data to full reducer </li></ul><ul><li>Reducer can be combiner if commutative & associative </li></ul><ul><li>– e.g., SumReducer </li></ul>
  32. 34. Hadoop Map-Reduce Architecture <ul><li>Master-Slave architecture </li></ul><ul><li>Map-Reduce Master “Jobtracker” </li></ul><ul><li>– Accepts MR jobs submitted by users </li></ul><ul><li>– Assigns Map and Reduce tasks to </li></ul><ul><li>Tasktrackers </li></ul><ul><li>– Monitors task and tasktracker status, </li></ul><ul><li>re-executes tasks upon failure </li></ul>
  33. 35. Hadoop Map-Reduce Architecture <ul><li>Map-Reduce Slaves “Tasktrackers” </li></ul><ul><li>– Run Map and Reduce tasks upon instruction from the Jobtracker </li></ul><ul><li>– Manage storage and transmission of intermediate output </li></ul>
  34. 36. MapReduce: Client <ul><li>Define Mapper and Reducer classes and a “launching” program. </li></ul><ul><li>Language support – Java, C++ – Streaming model </li></ul><ul><li>Special case – Maps for parallelizing only </li></ul>
  35. 37. Hadoop Architecture
  36. 38. Hadoop is flourishing...
  37. 39. Hadoop is now a part of...
  38. 40. Hadoop users..
  39. 41. Thank you

×