A (very) short intro to Hadoop
Upcoming SlideShare
Loading in...5
×
 

A (very) short intro to Hadoop

on

  • 15,067 views

A very short introduction to Hadoop, from the talk I gave at the BigDataCamp held in Washington DC this past November 2011. Some of this content is also covered in the various big data classes we ...

A very short introduction to Hadoop, from the talk I gave at the BigDataCamp held in Washington DC this past November 2011. Some of this content is also covered in the various big data classes we offer via on-site training (see http://www.scaleunlimited.com/training/)

Statistics

Views

Total Views
15,067
Views on SlideShare
12,978
Embed Views
2,089

Actions

Likes
31
Downloads
371
Comments
1

20 Embeds 2,089

http://www.scaleunlimited.com 1912
http://www.linkedin.com 54
http://193.146.210.28 28
http://localhost 20
http://paper.li 18
http://pilarlejarza.wordpress.com 13
http://winterstreetdesign.com 13
http://prodtest.sophiaapp.com 7
http://www.onlydoo.com 5
http://www.sophia.org 5
https://www.linkedin.com 2
http://webcache.googleusercontent.com 2
https://twitter.com 2
http://pmomale-ld1 2
http://prodstaging.sophiaapp.com 1
http://translate.googleusercontent.com 1
http://silverreader.com 1
http://www.winnova.fi 1
http://www.planearth.net 1
http://a0.twimg.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • how can we check replication of data on slave nodes?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

A (very) short intro to Hadoop A (very) short intro to Hadoop Presentation Transcript

  • 1 Scale Unlimited, Inc A very short Intro to Hadoop photo by: i_pinz, flickr Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 2 Welcome to Intro to Hadoop Usually a 4 hour online class Focus on Why, What, How of Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 3 Meet Your Instructor Developer of Bixo web mining toolkit Committer on Apache Tika Hadoop developer and trainer Solr developer and trainer Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 4 Overview A very short Intro to photo by: exfordy, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry, since network errors happen Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 6 Hadoop to the Rescue Scalable - many servers with lots of cores and spindles Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 7 Logical Architecture Cluster Execution Storage Logically, Hadoop is simply a computing cluster that provides: a Storage layer, and an Execution layer Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 8 Storage Layer Hadoop Distributed File System (aka HDFS, or Hadoop DFS) Runs on top of regular OS file system, typically Linux ext3 Fixed-size blocks (64MB by default) that are replicated Write once, read many; optimized for streaming in and out Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 9 Execution Layer Hadoop Map-Reduce Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Jobs consist of special “map” and “reduce” operations Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 10 Scalable Cluster Rack Rack Rack Node Node Node Node ... cpu Execution disk Storage Virtual execution and storage layers span many nodes (servers) Scales linearly (sort of) with cores and disks. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 11 Reliable Each block is replicated, typically three times. Each task must succeed, or the job fails Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 12 Fault-tolerant Failed tasks are automatically retried. Failed data transfers are automatically retried. Servers can join and leave the cluster at any time. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 13 Simple Node Node Node Node ... cpu Execution disk Storage Reduces complexity Conceptual “operating system” that spans many CPUs & disks Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 14 Typical Hadoop Cluster Has one “master” server - high quality, beefy box. NameNode process - manages file system JobTracker process - manages tasks Has multiple “slave” servers - commodity hardware. DataNode process - manages file system blocks on local drives TaskTracker process - runs tasks on server Uses high speed network between all servers Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 15 Architectural Components Solid boxes are unique applications NameNode DataNode Dashed boxes are child JVM instances DataNode DataNode DataNode data block (on same node as parent) Dotted boxes are blocks of managed files (on same node as parent) mapper mapper child jvm mapper child jvm child jvm Client JobTracker jobs tasks TaskTracker reducer reducer child jvm reducer child jvm child jvm Master Slaves Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 16 Distributed File System A very short Intro to photo by: Graham Racher, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 17 Virtual File System Treats many disks on many servers as one huge, logical volume Data is stored in 1...n blocks The “DataNode” process manages blocks of data on a slave. The “NameNode” process keeps track of file metadata on the master. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 18 Replication Each block is stored on several different disks (default is 3) Hadoop tries to copy blocks to different servers and racks. Protects data against disk, server, rack failures. Reduces the need to move data to code. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 19 Error Recovery Slaves constantly “check in” with the master. Data is automatically replicated if a disk or server “goes away”. Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 20 Limitations Optimized for streaming data in/out So no random access to data in a file Data rates ≈ 30% - 50% of max raw disk rate No append support currently Write once, read many Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 21 NameNode Runs on master node Is a single point of failure There are no built-in software hot failover mechanisms Maintains filesystem metadata, the “Namespace” files and hierarchical directories Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 22 DataNodes DataNode DataNode DataNode DataNode data block Files stored on HDFS are chunked and stored as blocks on DataNode Manages storage attached to the nodes that they run on Data never flows through NameNode, only DataNodes Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 23 Map-Reduce Paradigm A very short Intro to photo by: Graham Racher, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 24 Definitions Key Value Pair -> two units of data, exchanged between Map & Reduce Map -> The ‘map’ function in the MapReduce algorithm user defined converts each input Key Value Pair to 0...n output Key Value Pairs Reduce -> The ‘reduce’ function in the MapReduce algorithm user defined converts each input Key + all Values to 0...n output Key Value Pairs Group -> A built-in operation that happens between Map and Reduce ensures each Key passed to Reduce includes all Values Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 25 All Together Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 26 How MapReduce Works Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2] System Groups each unique key with all its values [K2,V2] Group [K2,{V2,V2,....}] Reduce translates the values of each unique key to new keys and values [K2,{V2,V2,....}] Reduce [K3,V3] Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 27 Canonical Example - Word Count Word Count Read a document, parse out the words, count the frequency of each word Specifically, in MapReduce With a document consisting of lines of text Translate each line of text into key = word and value = 1 e.g. <“the”,1> <“quick”,1> <“brown”,1> <“fox”,1> <“jumped”,1> Sum the values (ones) for each unique word Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 28 [1, When in the Course of human events, it becomes] [2, necessary for one people to dissolve the political bands] [3, which have connected them with another, and to assume] [K1,V1] [n, {k,k,k,k,...}] Map [K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1] User Group Defined [K2,{V2,V2,....}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}] [connected,{1,1,1,1,1}] [k,{v,v,...}] Reduce [When,4] [peope,6] [dissolve,3] [K3,V3] [connected,5] [k,sum(v,v,v,..)] Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 29 Divide & Conquer (splitting data) Because The Map function only cares about the current key and value, and The Reduce function only cares about the current key and its values Then A Mapper can invoke Map on an arbitrary number of input keys and values or just some fraction of the input data set A Reducer can invoke Reduce on an arbitrary number of the unique keys but all the values for that key Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 30 [K1,V1] [K2,V2] [K2,{V2,V2,...}] [K3,V3] Mapper Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Shuffle Reducer Mapper Mappers must complete before Reducers can begin split1 split2 split3 split4 ... part-00000 part-00001 part-000N file directory Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 31 Divide & Conquer (parallelizable) Because Each Mapper is independent and processes part of the whole, and Each Reducer is independent and processes part of the whole Then Any number of Mappers can run on each node, and Any number of Reducers can run on each node, and The cluster can contain any number of nodes Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 32 JobTracker jobs Client JobTracker Is a single point of failure Determines # Mapper Tasks from file splits via InputFormat Uses predefined value for # Reducer Tasks Client applications use JobClient to submit jobs and query status Command line use hadoop job <commands> Web status console use http://jobtracker-server:50030/ Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 33 TaskTracker mapper mapper child jvm mapper child jvm child jvm TaskTracker reducer reducer child jvm reducer child jvm child jvm Spawns each Task as a new child JVM Max # mapper and reducer tasks set independently Can pass child JVM opts via mapred.child.java.opts Can re-use JVM to avoid overhead of task initialization Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 34 Hadoop (sort of) Deep Thoughts A very short Intro to photo by: Graham Racher, flickr Hadoop Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 35 Avoiding Hadoop Hadoop is a big hammer - but not every problem is a nail Small data Real-time data Beware the Hadoopaphile Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 36 Avoiding Map-Reduce Writing Hadoop code is painful and error prone Hive & Pig are good solutions for People Who Like SQL Cascading is a good solution for complex, stable workflows Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 37 Leveraging the Eco-System Many open source projects built on top of Hadoop HBase - scalable NoSQL data store Sqoop - getting SQL data in/out of Hadoop Other projects work well with Hadoop Kafka/Scribe - getting log data in/out of Hadoop Avro - data serialization/storage format Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 38 Get Involved Join the mailing list - http://hadoop.apache.org/mailing_lists.html Go to user group meetings - e.g. http://hadoop.meetup.com/ Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 39 Learn More Buy the book - Hadoop: The Definitive Guide, 2nd edition Try the tutorials http://hadoop.apache.org/common/docs/current/mapred_tutorial.html Get training (danger, personal plug) http://www.scaleunlimited.com/training http://www.cloudera.com/hadoop-training Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011
  • 40 Resources Scale Unlimited Alumni list - scale-unlimited-alumni@googlegroups.com Hadoop mailing lists - http://hadoop.apache.org/mailing_lists.html Users groups - e.g. http://hadoop.meetup.com/ Hadoop API - http://hadoop.apache.org/common/docs/current/api/ Hadoop: The Definitive Guide, 2nd edition by Tom White Cascading: http://www.cascading.org Datameer: http://www.datameer.com Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.Monday, December 19, 2011