Your SlideShare is downloading. ×
An Introduction to Hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

An Introduction to Hadoop


Published on

An Introduction to Hadoop and the MapReduce paradigm. (A presentation that I did in mid-2010.)

An Introduction to Hadoop and the MapReduce paradigm. (A presentation that I did in mid-2010.)

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Hello and welcome to An Introduction to Hadoop
  • Before we begin, I’m …
  • Here’s what I want to accomplish in this talk
  • Read this quote. That data is something like 4 exabytes.
  • User-generated content
  • Traditional retailers are creating this data too
  • And government has troves of data too
  • One way to do that analysis is through Hadoop
  • Rackspace for log processing. Netflix for recommendations. LinkedIn for social graph. SU for page recommendations.
  • HDFS cluster/healing. MapReduce
  • Not cheap servers
  • In the cluster there are two kinds of nodes….
  • Let’s talk about HDFS
  • So let’s look at an example: Word Count. WC is the hello world for MapReduce jobs.
  • Transcript

    • 1. An introduction to
    • 2. Hello
      • Processing against a 156 node cluster
      • Certified Hadoop Developer
      • Certified Hadoop System Administrator
    • 3. Goals
      • Why should you care?
      • What is it?
      • How does it work?
    • 4. Data Everywhere “ Every two days now we create as much information as we did from the dawn of civilization up until  2003”
      • Eric Schmidt
      • then CEO of Google
      • Aug 4, 2010
    • 5. Data Everywhere
    • 6. Data Everywhere
    • 7. Data Everywhere
    • 8. The Hadoop Project
      • Originally based on papers published by Google in 2003 and 2004
      • Hadoop started in 2006 at Yahoo!
      • Top level Apache Foundation project
      • Large, active user base, user groups
      • Very active development, strong development team
    • 9. Who Uses Hadoop?
    • 10. Hadoop Components Storage Self-healing high-bandwidth clustered storage Processing Fault-tolerant distributed processing HDFS MapReduce
    • 11. Typical Cluster
      • 3-4000 commodity servers
      • Each server
        • 2x quad-core
        • 16-24 GB ram
        • 4-12 TB disk space
      • 20-30 servers per rack
    • 12. 2 Kinds of Nodes Master Nodes Slave Nodes
    • 13. Master Nodes
      • NameNode
        • only 1 per cluster
        • metadata server and database
        • SecondaryNameNode helps with some housekeeping
      • JobTracker
      • only 1 per cluster
      • job scheduler
    • 14. Slave Nodes
      • DataNodes
        • 1-4000 per cluster
        • block data storage
      • TaskTrackers
      • 1-4000 per cluster
      • task execution
    • 15. HDFS Basics
      • HDFS is a filesystem written in Java
      • Sits on top of a native filesystem
      • Provides redundant storage for massive amounts of data
      • Use cheap(ish), unreliable computers
    • 16. HDFS Data
      • Data is split into blocks and stored on multiple nodes in the cluster
        • Each block is usually 64 MB or 128 MB (conf)
      • Each block is replicated multiple times (conf)
        • Replicas stored on different data nodes
      • Large files, 100 MB+
    • 17. NameNode
      • A single NameNode stores all metadata
      • Filenames, locations on DataNodes of each block, owner, group, etc.
      • All information maintained in RAM for fast lookup
      • Filesystem metadata size is limited to the amount of available RAM on the NameNode
    • 18. SecondaryNameNode
      • The Secondary NameNode is not a failover NameNode
      • Does memory-intensive administrative functions for the NameNode
      • Should run on a separate machine
    • 19. Data Node
      • DataNodes store file contents
      • Stored as opaque ‘blocks’ on the underlying filesystem
      • Different blocks of the same file will be stored on different DataNodes
      • Same block is stored on three (or more) DataNodes for redundancy
    • 20. Self-healing
      • DataNodes send heartbeats to the NameNode
        • After a period without any heartbeats, a DataNode is assumed to be lost
        • NameNode determines which blocks were on the lost node
        • NameNode finds other DataNodes with copies of these blocks
        • These DataNodes are instructed to copy the blocks to other nodes
        • Replication is actively maintained
    • 21. HDFS Data Storage
      • NameNode holds file metadata
      • DataNodes hold the actual data
        • Block size is 64 MB, 128 MB, etc
        • Each block replicated three times
      NameNode foo.txt: blk_1, blk_2, blk_3 bar.txt: blk_4, blk_5 DataNodes blk_1 blk_2 blk_3 blk_5 blk_1 blk_3 blk_4 blk_1 blk_4 blk_5 blk_2 blk_4 blk_2 blk_3 blk_5
    • 22. What is MapReduce?
      • MapReduce is a method for distributing a task across multiple nodes
      • Automatic parallelization and distribution
      • Each node processes data stored on that node (processing goes to the data)
    • 23. Features of MapReduce
      • Fault-tolerance
      • Status and monitoring tools
      • A clean abstraction for programmers
    • 24. JobTracker
      • MapReduce jobs are controlled by a software daemon known as the JobTracker
      • The JobTracker resides on a master node
        • Assigns Map and Reduce tasks to other nodes on the cluster
        • These nodes each run a software daemon known as the TaskTracker
        • The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker
    • 25. Two Parts
      • Developer specifies two functions:
        • map()
        • reduce()
      • The framework does the rest
    • 26. map()
      • The Mapper reads data in the form of key/value pairs
      • It outputs zero or more key/value pairs
      map(key_in, value_in) -> (key_out, value_out)
    • 27. reduce()
      • After the Map phase all the intermediate values for a given intermediate key are combined together into a list
      • This list is given to one or more Reducers
      • The Reducer outputs zero or more final key/value pairs
        • These are written to HDFS
    • 28. map() Word Count map(String input_key, String input_value) foreach word w in input_value emit(w, 1) (1234, “to be or not to be”) (5678, “to see or not to see”) (“to”,1),(“be”,1),(“or”,1),(“not”,1), (“to”,1),(“be”,1), (“to”,1),(“see”,1), (“or”,1),(“not”,1),(“to”,1),(“see”,1)
    • 29. reduce() Word Count reduce(String output_key, List middle_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count) (“to”, [1,1,1,1]) (“be”,[1,1]) (“or”,[1,1]) (“not”,[1,1]) (“see”,[1,1]) (“to”, 4) (“be”,2) (“or”,2) (“not”,2) (“see”,2)
    • 30. Resources
    • 31. Questions?