An Intro to Hadoop
Upcoming SlideShare
Loading in...5
×
 

An Intro to Hadoop

on

  • 3,177 views

Moore’s law has finally hit the wall and CPU speeds have actually decreased in the last few years. The industry is reacting with hardware with an ever-growing number of cores and software that can ...

Moore’s law has finally hit the wall and CPU speeds have actually decreased in the last few years. The industry is reacting with hardware with an ever-growing number of cores and software that can leverage “grids” of distributed, often commodity, computing resources. But how is a traditional Java developer supposed to easily take advantage of this revolution? The answer is the Apache Hadoop family of projects. Hadoop is a suite of Open Source APIs at the forefront of this grid computing revolution and is considered the absolute gold standard for the divide-and-conquer model of distributed problem crunching. The well-travelled Apache Hadoop framework is currently being leveraged in production by prominent names such as Yahoo, IBM, Amazon, Adobe, AOL, Facebook and Hulu just to name a few.

In this session, you’ll start by learning the vocabulary unique to the distributed computing space. Next, we’ll discover how to shape a problem and processing to fit the Hadoop MapReduce framework. We’ll then examine the incredible auto-replicating, redundant and self-healing HDFS filesystem. Finally, we’ll fire up several Hadoop nodes and watch our calculation process get devoured live by our Hadoop grid. At this talk’s conclusion, you’ll feel equipped to take on any massive data set and processing your employer can throw at you with absolute ease.

Statistics

Views

Total Views
3,177
Views on SlideShare
2,857
Embed Views
320

Actions

Likes
3
Downloads
247
Comments
1

8 Embeds 320

http://ambientideas.com 156
http://www.nofluffjuststuff.com 143
http://www.slideshare.net 10
http://www.ambientideas.com 4
https://www.linkedin.com 3
http://uberconf.com 2
http://therichwebexperience.com 1
http://rdnymulen 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • How can i search files in Hadoop?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

An Intro to Hadoop An Intro to Hadoop Presentation Transcript

  • Hadoop
  • Talk Metadata
  • MapReduce: Simplified Dat a Processing on Large Clusters Jeffrey Dean and Sanjay Ghe mawat jeff@google.com, sanjay@goo gle.com Google, Inc. Abstract given day, etc. Most such comp MapReduce is a programming utations are conceptu- model and an associ- ally straightforward. However, ated implementation for proce the input data is usually ssing and generating large large and the computations have data sets. Users specify a map to be distributed across function that processes a hundreds or thousands of mach key/value pair to generate a set ines in order to finish in of intermediate key/value a reasonable amount of time. pairs, and a reduce function that The issues of how to par- merges all intermediate allelize the computation, distri values associated with the same bute the data, and handle intermediate key. Many failures conspire to obscure the real world tasks are expressible original simple compu- in this model, as shown tation with large amounts of in the paper. complex code to deal with these issues. Programs written in this funct As a reaction to this complexity ional style are automati- , we designed a new cally parallelized and executed abstraction that allows us to expre on a large cluster of com- ss the simple computa- modity machines. The run-time tions we were trying to perfo system takes care of the rm but hides the messy de- details of partitioning the input tails of parallelization, fault- data, scheduling the pro- tolerance, data distribution gram’s execution across a set and load balancing in a librar of machines, handling ma- y. Our abstraction is in- chine failures, and managing spired by the map and reduce the required inter-machine primitives present in Lisp communication. This allows and many other functional langu programmers without any ages. We realized that experience with parallel and most of our computations invol distributed systems to eas- ved applying a map op- ily utilize the resources of a large eration to each logical “record” distributed system. in our input in order to Our implementation of MapR compute a set of intermediat educe runs on a large e key/value pairs, and then cluster of commodity machines applying a reduce operation to and is highly scalable: all the values that shared a typical MapReduce computatio the same key, in order to comb n processes many ter- ine the derived data ap- abytes of data on thousands of propriately. Our use of a funct machines. Programmers ional model with user- find the system easy to use: hund specified map and reduce opera reds of MapReduce pro- tions allows us to paral- grams have been implemented lelize large computations easily and upwards of one thou- and to use re-execution sand MapReduce jobs are execu as the primary mechanism for ted on Google’s clusters fault tolerance. every day. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined 1 Introduction with an implementation of this interface that achieves high performance on large cluste rs of commodity PCs. Over the past five years, the Section 2 describes the basic authors and many others at programming model and Google have implemented hund gives several examples. Secti reds of special-purpose on 3 describes an imple- computations that process large mentation of the MapReduce amounts of raw data, interface tailored towards such as crawled documents, our cluster-based computing web request logs, etc., to environment. Section 4 de- compute various kinds of deriv scribes several refinements of ed data, such as inverted the programming model indices, various representations that we have found useful. Secti of the graph structure on 5 has performance of web documents, summaries measurements of our implement of the number of pages ation for a variety of crawled per host, the set of tasks. Section 6 explores the most frequent queries in a use of MapReduce within Google including our experience s in using it as the basis To appear in OSDI 2004 1
  • rs Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanjay @ google.com Google, Inc. Abstract given day, etc. Most su MapReduce is a progra ch computations are co mming model and an as ally straightforward. Ho nceptu- ated implementation fo soci- wever, the input data is r processing and genera large and the computatio usually data sets. Users specify ting large ns have to be distributed a map function that proc hundreds or thousands across key/value pair to genera esses a of machines in order to te a set of intermediate ke a reasonable amount of finish in pairs, and a reduce func y/value time. The issues of how tion that merges all inter allelize the computatio to par- values associated with th mediate n, distribute the data, an e same intermediate key. failures conspire to obsc d handle real world tasks are expr Many ure the original simple essible in this model, as tation with large amount compu- in the paper. shown s of complex code to de these issues. al with Programs written in this As a reaction to this co functional style are auto mplexity, we designed cally parallelized and ex mati- abstraction that allows us a new ecuted on a large cluste to express the simple co modity machines. The r of com- tions we were trying to mputa- run-time system takes ca perform but hides the m details of partitioning th re of the tails of parallelization, essy de- e input data, scheduling fault-tolerance, data distr gram’s execution across the pro- and load balancing in ibution a set of machines, hand a library. Our abstractio chine failures, and man ling ma- spired by the map and n is in- aging the required inter reduce primitives presen communication. This all -machine and many other functio t in Lisp ows programmers with nal languages. We reali experience with paralle out any most of our computatio zed that l and distributed system ns involved applying a ily utilize the resources s to eas- eration to each logical map op- of a large distributed sy “record” in our input in stem. compute a set of interm order to Our implementation of ediate key/value pairs, MapReduce runs on a and then cluster of commodity m large applying a reduce oper achines and is highly sc ation to all the values th a typical MapReduce co alable: the same key, in order at shared mputation processes m to combine the derived abytes of data on thousa any ter- propriately. Our use of data ap- nds of machines. Progra a functional model with find the system easy to us mmers specified map and redu user- e: hundreds of MapRedu ce operations allows us grams have been implem ce pro- lelize large computatio to paral- ented and upwards of on ns easily and to use re-ex sand MapReduce jobs ar e thou- as the primary mechani ecution e executed on Google’s sm for fault tolerance. every day. clusters The major contributions of this work are a simpl powerful interface that e and enables automatic paralle and distribution of large lization 1 Introduction -scale computations, co with an implementatio mbined n of this interface that high performance on lar achieves ge clusters of commodity Over the past five years, Section 2 describes the PCs. the authors and many ot basic programming mod Google have implemen hers at gives several examples el and ted hundreds of special . Section 3 describes computatio -purpose ment an imple-
  • gle, Inc. Abstract given day MapReduce is a progra ally straig mming model and an a ated implementation fo ssoci- r processing and genera large and t data sets. Users specify ting large a map function that pro hundreds o key/value pair to genera cesses a te a set of intermediate k a reasonab pairs, and a reduce func ey/value tion that merges all inte allelize the values associated with th rmediate e same intermediate key failures con real world tasks are exp . Many ressible in this model, a tation with in the paper. s shown these issues Programs written in this As a rea functional style are auto cally parallelized and ex mati- abstraction ecuted on a large cluste modity machines. The r o f co m - tions we we run-time system takes c details of partitioning th are of the tails of para e input data, scheduling gram’s execution across the pro- and load ba a set of machines, hand chine failures, and man ling ma- spired by th aging the required inter- communication. This a machine and many ot llows programmers wit experience with paralle hout any most of our l and distributed system ily utilize the resour s to eas- eration
  • l world tasks are express tation with ible in this model, as sh in the paper. own these issue Programs written in this As a rea functional style are auto cally parallelized and ex mati- abstraction ecuted on a large cluste modity machines. The r o f co m - tions we we run-time system takes c details of partitioning th are of the tails of para e input data, scheduling gram’s execution across the pro- and load ba a set of machines, hand chine failures, and man ling ma- spired by th aging the required inter- communication. This a machine and many o llows programmers wit experience with paralle hout any most of our l and distributed system ily utilize the resources s to eas- eration to ea of a large distributed sy stem. compute a s Our implementation of MapReduce runs on a cluster of commodity m large applying a re achines and is highly sc a typical MapReduce c alable: the same key omputation processes m abytes of data on thousa any ter- propriately. nds of machines. Progra find the system easy to u mmers specified map se: hundreds of MapRed grams have been implem uce pro- lelize large c ented and upwards of o sand MapReduce jobs a ne thou- as the primary re executed on Google’s every day. clusters The major powerful inter and distributio 1
  • MapReduce history “ ”
  • Origins
  • Today
  • Today
  • Today
  • Why Hadoop?
  • $74 .85
  • $74 .85 b g 4
  • b 1t $74 .85 b g 4
  • vs
  • 0 0 ,0 $ 10 vs 0 0 ,0 $1
  • vs
  • r u t o u e y o ur Bu ay il w Fa f o vs is re ble i lu ta a vi F e p a in C he Go
  • Sproinnnng! Bzzzt! Crrrkt!
  • Structured
  • Structured Unstructured
  • NOSQL
  • NOSQL
  • NOSQL
  • Applications
  • Applications
  • Applications
  • Applications
  • Applications
  • Applications
  • Applications
  • Applications
  • Applications
  • Applications
  • Particle Physics
  • Particle Physics
  • Particle Physics
  • Financial Trends
  • Financial Trends
  • Financial Trends
  • Financial Trends
  • Financial Trends
  • Contextual Ads
  • Contextual Ads
  • Hadoop Family
  • Hadoop Components
  • the Players
  • the PlayAs
  • the PlayAs
  • MapReduce
  • MapReduce
  • The process
  • The process
  • The process
  • The process
  • The process
  • The process
  • Start
  • Map
  • Grouping
  • Reduce
  • MapReduce Demo
  • HDFS
  • HDFS Basics
  • HDFS Basics
  • HDFS Basics
  • HDFS Basics
  • Data Overload
  • Data Overload
  • Data Overload
  • Data Overload
  • Data Overload
  • HDFS Demo
  • Pig
  • Pig Basics
  • Pig Sample A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; dump B; store B into 'id.out';
  • Pig demo
  • Hive
  • Hive Basics
  • Sqoop
  • HBase
  • HBase Basics
  • HBase Demo
  • on
  • Amazon Elastic MapReduce
  • Amazon Elastic MapReduce
  • Amazon Elastic MapReduce
  • Amazon Elastic MapReduce
  • EMR Languages
  • EMR Languages
  • EMR Pricing
  • EMR Pricing
  • EMR Functions
  • EMR Functions
  • Final Thoughts
  • Ha! Your Hadoop is Shut up! slower than my I’m reducing. Hadoop!
  • Does this Hadoop make my list look ugly?
  • Use Hadoop!
  • Thanks
  • Hadoop
  • Contact
  • Credits
  • Hadoop Friday, January 15, 2010 1
  • Talk Metadata Friday, January 15, 2010 2
  • Friday, January 15, 2010 3
  • MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemaw at jeff@google.com, sanjay@go ogle.com Google, Inc. Abstract given day, etc. Most such computat MapReduce is a programming ions are conceptu- model and an associ- ally straightforward. However, ated implementation for processin the input data is usually g and generating large large and the computations have data sets. Users specify a map to be distributed across function that processes a hundreds or thousands of machines key/value pair to generate a set in order to finish in of intermediate key/value a reasonable amount of time. pairs, and a reduce function that The issues of how to par- merges all intermediate allelize the computation, distribute values associated with the same the data, and handle intermediate key. Many failures conspire to obscure the real world tasks are expressib original simple compu- le in this model, as shown tation with large amounts of in the paper. complex code to deal with these issues. Programs written in this functiona As a reaction to this complexit l style are automati- y, we designed a new cally parallelized and executed abstraction that allows us to express on a large cluster of com- the simple computa- modity machines. The run-time tions we were trying to perform system takes care of the but hides the messy de- details of partitioning the input tails of parallelization, fault-toler data, scheduling the pro- ance, data distribution gram’s execution across a set and load balancing in a library. of machines, handling ma- Our abstraction is in- chine failures, and managing spired by the map and reduce the required inter-machine primitives present in Lisp communication. This allows and many other functional languages programmers without any . We realized that experience with parallel and most of our computations involved distributed systems to eas- applying a map op- ily utilize the resources of a large eration to each logical “record” distributed system. in our input in order to Our implementation of MapRedu compute a set of intermediate ce runs on a large key/value pairs, and then cluster of commodity machines applying a reduce operation to and is highly scalable: all the values that shared a typical MapReduce computat the same key, in order to combine ion processes many ter- the derived data ap- abytes of data on thousands of propriately. Our use of a functiona machines. Programmers l model with user- find the system easy to use: hundreds specified map and reduce operation of MapReduce pro- s allows us to paral- grams have been implemented lelize large computations easily and upwards of one thou- and to use re-execution sand MapReduce jobs are executed as the primary mechanism for on Google’s clusters fault tolerance. every day. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scal e computations, combined 1 Introduction with an implementation of this interface that achieves high performance on large clusters of commodity PCs. Over the past five years, the Section 2 describes the basic authors and many others at programming model and Google have implemented hundreds gives several examples. Section of special-purpose 3 describes an imple- computations that process large mentation of the MapReduce amounts of raw data, interface tailored towards such as crawled documents, our cluster-based computing web request logs, etc., to environment. Section 4 de- compute various kinds of derived scribes several refinements of data, such as inverted the programming model indices, various representations that we have found useful. Section of the graph structure 5 has performance of web documents, summarie measurements of our implemen s of the number of pages tation for a variety of crawled per host, the set of tasks. Section 6 explores the most frequent queries in a use of MapReduce within Google including our experienc es in using it as the basis To appear in OSDI 2004 1 Friday, January 15, 2010 4 Seminal paper on MapReduce http://labs.google.com/papers/mapreduce.html
  • Large Clusters Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanja y@google.com Google, Inc. Abstract given day, etc. Most such MapReduce is a program computations are conc ming model and an asso ally straightforward. How eptu- ated implementation for ci- ever, the input data is usua processing and generatin large and the computation lly data sets. Users specify g large s have to be distributed a map function that proc hundreds or thousands across key/value pair to generate esses a of machines in order to a set of intermediate key/ a reasonable amount of finish in pairs, and a reduce func value time. The issues of how tion that merges all inter allelize the computation to par- values associated with the mediate , distribute the data, and same intermediate key. failures conspire to obsc handle real world tasks are expr Many ure the original simple essible in this model, as tation with large amounts compu- in the paper. shown of complex code to deal these issues. with Programs written in this As a reaction to this com functional style are auto plexity, we designed a cally parallelized and exec mati- abstraction that allows us new uted on a large cluster of to express the simple com modity machines. The com- tions we were trying to puta- run-time system takes care perform but hides the mes details of partitioning the of the tails of parallelization, sy de- input data, scheduling the fault-tolerance, data distr gram’s execution across pro- and load balancing in ibution a set of machines, hand a library. Our abstracti chine failures, and man ling ma- spired by the map and on is in- aging the required inter reduce primitives present communication. This allow -machine and many other function in Lisp s programmers without al languages. We reali experience with parallel any most of our computation zed that and distributed systems s involved applying a map ily utilize the resources to eas- eration to each logical op- of a large distributed syst “record” in our input in em. compute a set of intermed order to Our implementation of iate key/value pairs, and MapReduce runs on a then cluster of commodity mac large applying a reduce oper hines and is highly scal ation to all the values that a typical MapReduce com able: the same key, in order shared putation processes man to combine the derived abytes of data on thousand y ter- propriately. Our use of data ap- s of machines. Program a functional model with find the system easy to use: mers specified map and redu user- hundreds of MapReduce ce operations allows us grams have been impleme pro- lelize large computation to paral- nted and upwards of one s easily and to use re-e sand MapReduce jobs are thou- as the primary mechani xecution executed on Google’s clus sm for fault tolerance. every day. ters The major contribution s of this work are a simp powerful interface that le and enables automatic paralleli and distribution of larg zation 1 Introduction e-scale computations, com with an implementation bined of this interface that achi high performance on larg eves e clusters of commodity Over the past five years, Section 2 describes the PCs. the authors and many othe basic programming mod Google have implemented rs at gives several examples el and hundreds of special-purpo . Section 3 describes computations that proc se mentation of the MapRed an imple- ess large amounts of raw uce interface tailored towa data, Friday, January 15, 2010 such as crawled documen compute various kinds ts, web request logs, etc., to our cluster-based computi scribes several refineme ng environment. Section rds 4 de- 5 of derived data, such as nts of the programmin indices, various represen inverted that we have found usef g model tations of the graph struc ul. Section 5 has perform of web documents, sum ture measurements of our ance maries of the number of implementation for a crawled per host, the set pages tasks. Section 6 explores variety of of most frequent queries the use of MapReduce in a Google including our expe within riences in using it as the basis To appear in OSDI 2004 1
  • MapReduce history “ ” Friday, January 15, 2010 6
  • Friday, January 15, 2010 7 http://www.cern.ch/
  • Origins Friday, January 15, 2010 8 http://en.wikipedia.org/wiki/Mapreduce
  • Today Friday, January 15, 2010 9 http://hadoop.apache.org
  • Today Friday, January 15, 2010 10 http://wiki.apache.org/hadoop/PoweredBy
  • Today Friday, January 15, 2010 11
  • Why Hadoop? Friday, January 15, 2010 12
  • b 1t $74 .85 b g 4 Friday, January 15, 2010 13
  • 0 0 ,0 $10 vs 0 0 0 $ 1, Friday, January 15, 2010 14
  • ur t o u y o ure Bu ay il w Fa f o vs is e le ur b il ita Fa ev p in ea Ch Go Friday, January 15, 2010 15 This concept doesn’t work well at weddings or dinner parties
  • Sproinnnng! Bzzzt! Crrrkt! Friday, January 15, 2010 16 http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-has-serious-cooling-needs/ http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-center-design-ability-to-predict- performance.html
  • Unstructured Structured Friday, January 15, 2010 17
  • NOSQL Friday, January 15, 2010 18 http://nosql.mypopescu.com/
  • Applications Friday, January 15, 2010 19
  • Applications Friday, January 15, 2010 20
  • Particle Physics Friday, January 15, 2010 21 http://upload.wikimedia.org/wikipedia/commons/f/fc/ CERN_LHC_Tunnel1.jpg
  • Financial Trends Friday, January 15, 2010 22
  • Contextual Ads Friday, January 15, 2010 23
  • Hadoop Family Friday, January 15, 2010 24
  • Hadoop Components Friday, January 15, 2010 25
  • the Players the PlayAs Friday, January 15, 2010 26 http://www.flickr.com/photos/mandj98/3804322095/ http://www.flickr.com/photos/8583446@N05/3304141843/ http://www.flickr.com/photos/joits/219824254/ http://www.flickr.com/photos/streetfly_jz/2312194534/ http://www.flickr.com/photos/sybrenstuvel/2811467787/ http://www.flickr.com/photos/lacklusters/2080288154/ http://www.flickr.com/photos/sybrenstuvel/2811467787/
  • MapReduce Friday, January 15, 2010 27
  • The process Friday, January 15, 2010 28
  • Start Map Friday, January 15, 2010 29
  • Grouping Friday, January 15, 2010 30
  • Reduce Friday, January 15, 2010 31
  • MapReduce Demo Friday, January 15, 2010 32
  • HDFS Friday, January 15, 2010 33
  • Friday, January 15, 2010 34
  • HDFS Basics Friday, January 15, 2010 35
  • Data Overload Friday, January 15, 2010 36
  • Friday, January 15, 2010 37
  • HDFS Demo Friday, January 15, 2010 38
  • Pig Friday, January 15, 2010 39
  • Pig Basics Friday, January 15, 2010 40 http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html#Sample +Code
  • Pig Sample A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; dump B; store B into 'id.out'; Friday, January 15, 2010 41
  • Pig demo Friday, January 15, 2010 42
  • Hive Friday, January 15, 2010 43
  • Hive Basics Friday, January 15, 2010 44
  • Sqoop Friday, January 15, 2010 45 http://www.cloudera.com/hadoop-sqoop
  • HBase Friday, January 15, 2010 46
  • HBase Basics Friday, January 15, 2010 47
  • HBase Demo Friday, January 15, 2010 48
  • on Friday, January 15, 2010 49
  • Amazon Elastic MapReduce Friday, January 15, 2010 50 Launched in April 2009 Save results to S3 buckets http://aws.amazon.com/elasticmapreduce/#functionality
  • EMR Languages Friday, January 15, 2010 51
  • EMR Pricing Friday, January 15, 2010 52 Pay for both columns additively
  • EMR Functions Friday, January 15, 2010 53
  • Friday, January 15, 2010 54
  • Final Thoughts Friday, January 15, 2010 55
  • Ha! Your Hadoop is Shut up! slower than my I’m reducing. Hadoop! Friday, January 15, 2010 56 http://www.flickr.com/photos/robryb/14826486/sizes/l/
  • Friday, January 15, 2010 57
  • Does this Hadoop make my list look ugly? Friday, January 15, 2010 58 http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/
  • Friday, January 15, 2010 59
  • Use Hadoop! Friday, January 15, 2010 60 http://www.flickr.com/photos/robryb/14826417/sizes/l/
  • Thanks Friday, January 15, 2010 61
  • Hadoop Friday, January 15, 2010 62
  • Contact Friday, January 15, 2010 63
  • Credits Friday, January 15, 2010 64
  • Friday, January 15, 2010 65