An Introduction to Hadoop
Upcoming SlideShare
Loading in...5
×
 

An Introduction to Hadoop

on

  • 4,522 views

An Introduction to Hadoop and the MapReduce paradigm. (A presentation that I did in mid-2010.)

An Introduction to Hadoop and the MapReduce paradigm. (A presentation that I did in mid-2010.)

Statistics

Views

Total Views
4,522
Views on SlideShare
4,077
Embed Views
445

Actions

Likes
11
Downloads
530
Comments
0

2 Embeds 445

http://derrekyoung.com 444
http://derrekyoung.wordpress.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Hello and welcome to An Introduction to Hadoop
  • Before we begin, I’m …
  • Here’s what I want to accomplish in this talk
  • Read this quote. That data is something like 4 exabytes.
  • User-generated content
  • Traditional retailers are creating this data too
  • And government has troves of data too
  • One way to do that analysis is through Hadoop
  • Rackspace for log processing. Netflix for recommendations. LinkedIn for social graph. SU for page recommendations.
  • HDFS cluster/healing. MapReduce
  • Not cheap servers
  • In the cluster there are two kinds of nodes….
  • Let’s talk about HDFS
  • So let’s look at an example: Word Count. WC is the hello world for MapReduce jobs.

An Introduction to Hadoop An Introduction to Hadoop Presentation Transcript

  • An introduction to
  • Hello
    • Processing against a 156 node cluster
    • Certified Hadoop Developer
    • Certified Hadoop System Administrator
  • Goals
    • Why should you care?
    • What is it?
    • How does it work?
  • Data Everywhere “ Every two days now we create as much information as we did from the dawn of civilization up until  2003”
    • Eric Schmidt
    • then CEO of Google
    • Aug 4, 2010
  • Data Everywhere
  • Data Everywhere
  • Data Everywhere
  • The Hadoop Project
    • Originally based on papers published by Google in 2003 and 2004
    • Hadoop started in 2006 at Yahoo!
    • Top level Apache Foundation project
    • Large, active user base, user groups
    • Very active development, strong development team
  • Who Uses Hadoop?
  • Hadoop Components Storage Self-healing high-bandwidth clustered storage Processing Fault-tolerant distributed processing HDFS MapReduce
  • Typical Cluster
    • 3-4000 commodity servers
    • Each server
      • 2x quad-core
      • 16-24 GB ram
      • 4-12 TB disk space
    • 20-30 servers per rack
  • 2 Kinds of Nodes Master Nodes Slave Nodes
  • Master Nodes
    • NameNode
      • only 1 per cluster
      • metadata server and database
      • SecondaryNameNode helps with some housekeeping
    • JobTracker
    • only 1 per cluster
    • job scheduler
  • Slave Nodes
    • DataNodes
      • 1-4000 per cluster
      • block data storage
    • TaskTrackers
    • 1-4000 per cluster
    • task execution
  • HDFS Basics
    • HDFS is a filesystem written in Java
    • Sits on top of a native filesystem
    • Provides redundant storage for massive amounts of data
    • Use cheap(ish), unreliable computers
  • HDFS Data
    • Data is split into blocks and stored on multiple nodes in the cluster
      • Each block is usually 64 MB or 128 MB (conf)
    • Each block is replicated multiple times (conf)
      • Replicas stored on different data nodes
    • Large files, 100 MB+
  • NameNode
    • A single NameNode stores all metadata
    • Filenames, locations on DataNodes of each block, owner, group, etc.
    • All information maintained in RAM for fast lookup
    • Filesystem metadata size is limited to the amount of available RAM on the NameNode
  • SecondaryNameNode
    • The Secondary NameNode is not a failover NameNode
    • Does memory-intensive administrative functions for the NameNode
    • Should run on a separate machine
  • Data Node
    • DataNodes store file contents
    • Stored as opaque ‘blocks’ on the underlying filesystem
    • Different blocks of the same file will be stored on different DataNodes
    • Same block is stored on three (or more) DataNodes for redundancy
  • Self-healing
    • DataNodes send heartbeats to the NameNode
      • After a period without any heartbeats, a DataNode is assumed to be lost
      • NameNode determines which blocks were on the lost node
      • NameNode finds other DataNodes with copies of these blocks
      • These DataNodes are instructed to copy the blocks to other nodes
      • Replication is actively maintained
  • HDFS Data Storage
    • NameNode holds file metadata
    • DataNodes hold the actual data
      • Block size is 64 MB, 128 MB, etc
      • Each block replicated three times
    NameNode foo.txt: blk_1, blk_2, blk_3 bar.txt: blk_4, blk_5 DataNodes blk_1 blk_2 blk_3 blk_5 blk_1 blk_3 blk_4 blk_1 blk_4 blk_5 blk_2 blk_4 blk_2 blk_3 blk_5
  • What is MapReduce?
    • MapReduce is a method for distributing a task across multiple nodes
    • Automatic parallelization and distribution
    • Each node processes data stored on that node (processing goes to the data)
  • Features of MapReduce
    • Fault-tolerance
    • Status and monitoring tools
    • A clean abstraction for programmers
  • JobTracker
    • MapReduce jobs are controlled by a software daemon known as the JobTracker
    • The JobTracker resides on a master node
      • Assigns Map and Reduce tasks to other nodes on the cluster
      • These nodes each run a software daemon known as the TaskTracker
      • The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker
  • Two Parts
    • Developer specifies two functions:
      • map()
      • reduce()
    • The framework does the rest
  • map()
    • The Mapper reads data in the form of key/value pairs
    • It outputs zero or more key/value pairs
    map(key_in, value_in) -> (key_out, value_out)
  • reduce()
    • After the Map phase all the intermediate values for a given intermediate key are combined together into a list
    • This list is given to one or more Reducers
    • The Reducer outputs zero or more final key/value pairs
      • These are written to HDFS
  • map() Word Count map(String input_key, String input_value) foreach word w in input_value emit(w, 1) (1234, “to be or not to be”) (5678, “to see or not to see”) (“to”,1),(“be”,1),(“or”,1),(“not”,1), (“to”,1),(“be”,1), (“to”,1),(“see”,1), (“or”,1),(“not”,1),(“to”,1),(“see”,1)
  • reduce() Word Count reduce(String output_key, List middle_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count) (“to”, [1,1,1,1]) (“be”,[1,1]) (“or”,[1,1]) (“not”,[1,1]) (“see”,[1,1]) (“to”, 4) (“be”,2) (“or”,2) (“not”,2) (“see”,2)
  • Resources http://hadoop.apache.org/ http://developer.yahoo.com/hadoop/ http://www.cloudera.com/resources/?media=Video
  • Questions?