An Introduction to Hadoop


Published on

An Introduction to Hadoop and the MapReduce paradigm. (A presentation that I did in mid-2010.)

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hello and welcome to An Introduction to Hadoop
  • Before we begin, I’m …
  • Here’s what I want to accomplish in this talk
  • Read this quote. That data is something like 4 exabytes.
  • User-generated content
  • Traditional retailers are creating this data too
  • And government has troves of data too
  • One way to do that analysis is through Hadoop
  • Rackspace for log processing. Netflix for recommendations. LinkedIn for social graph. SU for page recommendations.
  • HDFS cluster/healing. MapReduce
  • Not cheap servers
  • In the cluster there are two kinds of nodes….
  • Let’s talk about HDFS
  • So let’s look at an example: Word Count. WC is the hello world for MapReduce jobs.
  • An Introduction to Hadoop

    1. 1. An introduction to
    2. 2. Hello <ul><li>Processing against a 156 node cluster </li></ul><ul><li>Certified Hadoop Developer </li></ul><ul><li>Certified Hadoop System Administrator </li></ul>
    3. 3. Goals <ul><li>Why should you care? </li></ul><ul><li>What is it? </li></ul><ul><li>How does it work? </li></ul>
    4. 4. Data Everywhere “ Every two days now we create as much information as we did from the dawn of civilization up until  2003” <ul><li>Eric Schmidt </li></ul><ul><li>then CEO of Google </li></ul><ul><li>Aug 4, 2010 </li></ul>
    5. 5. Data Everywhere
    6. 6. Data Everywhere
    7. 7. Data Everywhere
    8. 8. The Hadoop Project <ul><li>Originally based on papers published by Google in 2003 and 2004 </li></ul><ul><li>Hadoop started in 2006 at Yahoo! </li></ul><ul><li>Top level Apache Foundation project </li></ul><ul><li>Large, active user base, user groups </li></ul><ul><li>Very active development, strong development team </li></ul>
    9. 9. Who Uses Hadoop?
    10. 10. Hadoop Components Storage Self-healing high-bandwidth clustered storage Processing Fault-tolerant distributed processing HDFS MapReduce
    11. 11. Typical Cluster <ul><li>3-4000 commodity servers </li></ul><ul><li>Each server </li></ul><ul><ul><li>2x quad-core </li></ul></ul><ul><ul><li>16-24 GB ram </li></ul></ul><ul><ul><li>4-12 TB disk space </li></ul></ul><ul><li>20-30 servers per rack </li></ul>
    12. 12. 2 Kinds of Nodes Master Nodes Slave Nodes
    13. 13. Master Nodes <ul><li>NameNode </li></ul><ul><ul><li>only 1 per cluster </li></ul></ul><ul><ul><li>metadata server and database </li></ul></ul><ul><ul><li>SecondaryNameNode helps with some housekeeping </li></ul></ul><ul><li>JobTracker </li></ul><ul><li>only 1 per cluster </li></ul><ul><li>job scheduler </li></ul>
    14. 14. Slave Nodes <ul><li>DataNodes </li></ul><ul><ul><li>1-4000 per cluster </li></ul></ul><ul><ul><li>block data storage </li></ul></ul><ul><li>TaskTrackers </li></ul><ul><li>1-4000 per cluster </li></ul><ul><li>task execution </li></ul>
    15. 15. HDFS Basics <ul><li>HDFS is a filesystem written in Java </li></ul><ul><li>Sits on top of a native filesystem </li></ul><ul><li>Provides redundant storage for massive amounts of data </li></ul><ul><li>Use cheap(ish), unreliable computers </li></ul>
    16. 16. HDFS Data <ul><li>Data is split into blocks and stored on multiple nodes in the cluster </li></ul><ul><ul><li>Each block is usually 64 MB or 128 MB (conf) </li></ul></ul><ul><li>Each block is replicated multiple times (conf) </li></ul><ul><ul><li>Replicas stored on different data nodes </li></ul></ul><ul><li>Large files, 100 MB+ </li></ul>
    17. 17. NameNode <ul><li>A single NameNode stores all metadata </li></ul><ul><li>Filenames, locations on DataNodes of each block, owner, group, etc. </li></ul><ul><li>All information maintained in RAM for fast lookup </li></ul><ul><li>Filesystem metadata size is limited to the amount of available RAM on the NameNode </li></ul>
    18. 18. SecondaryNameNode <ul><li>The Secondary NameNode is not a failover NameNode </li></ul><ul><li>Does memory-intensive administrative functions for the NameNode </li></ul><ul><li>Should run on a separate machine </li></ul>
    19. 19. Data Node <ul><li>DataNodes store file contents </li></ul><ul><li>Stored as opaque ‘blocks’ on the underlying filesystem </li></ul><ul><li>Different blocks of the same file will be stored on different DataNodes </li></ul><ul><li>Same block is stored on three (or more) DataNodes for redundancy </li></ul>
    20. 20. Self-healing <ul><li>DataNodes send heartbeats to the NameNode </li></ul><ul><ul><li>After a period without any heartbeats, a DataNode is assumed to be lost </li></ul></ul><ul><ul><li>NameNode determines which blocks were on the lost node </li></ul></ul><ul><ul><li>NameNode finds other DataNodes with copies of these blocks </li></ul></ul><ul><ul><li>These DataNodes are instructed to copy the blocks to other nodes </li></ul></ul><ul><ul><li>Replication is actively maintained </li></ul></ul>
    21. 21. HDFS Data Storage <ul><li>NameNode holds file metadata </li></ul><ul><li>DataNodes hold the actual data </li></ul><ul><ul><li>Block size is 64 MB, 128 MB, etc </li></ul></ul><ul><ul><li>Each block replicated three times </li></ul></ul>NameNode foo.txt: blk_1, blk_2, blk_3 bar.txt: blk_4, blk_5 DataNodes blk_1 blk_2 blk_3 blk_5 blk_1 blk_3 blk_4 blk_1 blk_4 blk_5 blk_2 blk_4 blk_2 blk_3 blk_5
    22. 22. What is MapReduce? <ul><li>MapReduce is a method for distributing a task across multiple nodes </li></ul><ul><li>Automatic parallelization and distribution </li></ul><ul><li>Each node processes data stored on that node (processing goes to the data) </li></ul>
    23. 23. Features of MapReduce <ul><li>Fault-tolerance </li></ul><ul><li>Status and monitoring tools </li></ul><ul><li>A clean abstraction for programmers </li></ul>
    24. 24. JobTracker <ul><li>MapReduce jobs are controlled by a software daemon known as the JobTracker </li></ul><ul><li>The JobTracker resides on a master node </li></ul><ul><ul><li>Assigns Map and Reduce tasks to other nodes on the cluster </li></ul></ul><ul><ul><li>These nodes each run a software daemon known as the TaskTracker </li></ul></ul><ul><ul><li>The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker </li></ul></ul>
    25. 25. Two Parts <ul><li>Developer specifies two functions: </li></ul><ul><ul><li>map() </li></ul></ul><ul><ul><li>reduce() </li></ul></ul><ul><li>The framework does the rest </li></ul>
    26. 26. map() <ul><li>The Mapper reads data in the form of key/value pairs </li></ul><ul><li>It outputs zero or more key/value pairs </li></ul>map(key_in, value_in) -> (key_out, value_out)
    27. 27. reduce() <ul><li>After the Map phase all the intermediate values for a given intermediate key are combined together into a list </li></ul><ul><li>This list is given to one or more Reducers </li></ul><ul><li>The Reducer outputs zero or more final key/value pairs </li></ul><ul><ul><li>These are written to HDFS </li></ul></ul>
    28. 28. map() Word Count map(String input_key, String input_value) foreach word w in input_value emit(w, 1) (1234, “to be or not to be”) (5678, “to see or not to see”) (“to”,1),(“be”,1),(“or”,1),(“not”,1), (“to”,1),(“be”,1), (“to”,1),(“see”,1), (“or”,1),(“not”,1),(“to”,1),(“see”,1)
    29. 29. reduce() Word Count reduce(String output_key, List middle_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count) (“to”, [1,1,1,1]) (“be”,[1,1]) (“or”,[1,1]) (“not”,[1,1]) (“see”,[1,1]) (“to”, 4) (“be”,2) (“or”,2) (“not”,2) (“see”,2)
    30. 30. Resources
    31. 31. Questions?