• Like
  • Save
Hadoop v0.3.1
Upcoming SlideShare
Loading in...5
×

Hadoop v0.3.1

  • 982 views
Uploaded on

Matthew McCullough's Hadoop presentation to the Tampa JUG

Matthew McCullough's Hadoop presentation to the Tampa JUG

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
982
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop Divide and conquer gigantic data © Matthew McCullough, Ambient Ideas, LLC
  • 2. Talk Metadata Twitter @matthewmccull #HadoopIntro Matthew McCullough Ambient Ideas, LLC matthewm@ambientideas.com http://ambientideas.com/blog http://speakerrate.com/matthew.mccullough
  • 3. MapReduce: Simplified Dat a Processing on Large Clusters Jeffrey Dean and Sanjay Ghe mawat jeff@google.com, sanjay@goo gle.com Google, Inc. Abstract given day, etc. Most such comp MapReduce is a programming utations are conceptu- model and an associ- ally straightforward. However, ated implementation for proce the input data is usually ssing and generating large large and the computations have data sets. Users specify a map to be distributed across function that processes a hundreds or thousands of mach key/value pair to generate a set ines in order to finish in of intermediate key/value a reasonable amount of time. pairs, and a reduce function that The issues of how to par- merges all intermediate allelize the computation, distri values associated with the same bute the data, and handle intermediate key. Many failures conspire to obscure the real world tasks are expressible original simple compu- in this model, as shown tation with large amounts of in the paper. complex code to deal with these issues. Programs written in this funct As a reaction to this complexity ional style are automati- , we designed a new cally parallelized and executed abstraction that allows us to expre on a large cluster of com- ss the simple computa- modity machines. The run-time tions we were trying to perfo system takes care of the rm but hides the messy de- details of partitioning the input tails of parallelization, fault- data, scheduling the pro- tolerance, data distribution gram’s execution across a set and load balancing in a librar of machines, handling ma- y. Our abstraction is in- chine failures, and managing spired by the map and reduce the required inter-machine primitives present in Lisp communication. This allows and many other functional langu programmers without any ages. We realized that experience with parallel and most of our computations invol distributed systems to eas- ved applying a map op- ily utilize the resources of a large eration to each logical “record” distributed system. in our input in order to Our implementation of MapR compute a set of intermediat educe runs on a large e key/value pairs, and then cluster of commodity machines applying a reduce operation to and is highly scalable: all the values that shared a typical MapReduce computatio the same key, in order to comb n processes many ter- ine the derived data ap- abytes of data on thousands of propriately. Our use of a funct machines. Programmers ional model with user- find the system easy to use: hund specified map and reduce opera reds of MapReduce pro- tions allows us to paral- grams have been implemented lelize large computations easily and upwards of one thou- and to use re-execution sand MapReduce jobs are execu as the primary mechanism for ted on Google’s clusters fault tolerance. every day. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scal e computations, combined 1 Introduction with an implementation of this interface that achieves high performance on large clust ers of commodity PCs. Over the past five years, the Section 2 describes the basic authors and many others at programming model and Google have implemented hund gives several examples. Secti reds of special-purpose on 3 describes an imple- computations that process large mentation of the MapReduce amounts of raw data, interface tailored towards such as crawled documents, our cluster-based computing web request logs, etc., to environment. Section 4 de- compute various kinds of deriv scribes several refinements of ed data, such as inverted the programming model indices, various representations that we have found useful. Secti of the graph structure on 5 has performance of web documents, summaries measurements of our implement of the number of pages ation for a variety of crawled per host, the set of tasks. Section 6 explores the most frequent queries in a use of MapReduce within Google including our experienc es in using it as the basis To appear in OSDI 2004 1
  • 4. Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanjay @ google.com Google, Inc. Abstract given day, etc. Most su MapReduce is a progra ch computations are co mming model and an as ally straightforward. Ho nceptu- ated implementation fo soci- wever, the input data is r processing and genera large and the computatio usually data sets. Users specify ting large ns have to be distributed a map function that proc hundreds or thousands across key/value pair to genera esses a of machines in order to te a set of intermediate ke a reasonable amount of finish in pairs, and a reduce func y/value time. The issues of how tion that merges all inter allelize the computatio to par- values associated with th mediate n, distribute the data, an e same intermediate key. failures conspire to obsc d handle real world tasks are expr Many ure the original simple essible in this model, as tation with large amount compu- in the paper. shown s of complex code to de these issues. al with Programs written in this As a reaction to this co functional style are auto mplexity, we designed cally parallelized and ex mati- abstraction that allows us a new ecuted on a large cluste to express the simple co modity machines. The r of com- tions we were trying to mputa- run-time system takes ca perform but hides the m details of partitioning th re of the tails of parallelization, essy de- e input data, scheduling fault-tolerance, data distr gram’s execution across the pro- and load balancing in ibution a set of machines, hand a library. Our abstractio chine failures, and man ling ma- spired by the map and n is in- aging the required inter reduce primitives presen communication. This all -machine and many other functio t in Lisp ows programmers with nal languages. We reali experience with paralle out any most of our computatio zed that l and distributed system ns involved applying a ily utilize the resources s to eas- eration to each logical map op- of a large distributed sy “record” in our input in stem. compute a set of interm order to Our implementation of ediate key/value pairs, MapReduce runs on a and then cluster of commodity m large applying a reduce oper achines and is highly sc ation to all the values th a typical MapReduce co alable: the same key, in order at shared mputation processes m to combine the derived abytes of data on thousa any ter- propriately. Our use of data ap- nds of machines. Progra a functional model with find the system easy to us mmers specified map and redu user- e: hundreds of MapRedu ce operations allows us grams have been implem ce pro- lelize large computatio to paral- ented and upwards of on ns easily and to use re-e sand MapReduce jobs ar e thou- as the primary mechani xecution e executed on Google’s sm for fault tolerance. every day. clusters The major contributions of this work are a simpl powerful interface that e and enables automatic paralle and distribution of large lization 1 Introduction -scale computations, co with an implementatio mbined n of this interface that high performance on lar achieves ge clusters of commodity Over the past five years, Section 2 describes the PCs. the authors and many ot basic programming mod Google have implemen hers at gives several examples el and ted hu
  • 5. Abstract given da MapReduce is a progra ally strai mming model and an a ated implementation fo ssoci- r processing and genera large and data sets. Users specify ting large a map function that pro hundreds key/value pair to genera cesses a te a set of intermediate k a reasona pairs, and a reduce func ey/value tion that merges all inte allelize th values associated with th rmediate e same intermediate key failures co real world tasks are exp . Many ressible in this model, a tation wit in the paper. s shown these issue Programs written in this As a re functional style are auto cally parallelized and ex mati- abstraction ecuted on a large cluste modity machines. The r o f co m - tions we w run-time system takes c details of partitioning th are of the tails of par e input data, scheduling gram’s execution across the pro- and load b a set of machines, hand chine failures, and man ling ma- spired by th aging the required inter- communication. This a machine and many o llows programmers wit experience with paralle hout any most of ou l and distributed system il s to eas-
  • 6. expressible in this mod tation wit in the paper. e l, as shown these issu Programs written in this As a re functional style are auto cally parallelized and ex mati- abstractio ecuted on a large cluste modity machines. The r o f co m - tions we w run-time system takes c details of partitioning th are of the tails of pa e input data, scheduling gram’s execution across the pro- and load b a set of machines, hand chine failures, and man ling ma- spired by t aging the required inter- communication. This a machine and many llows programmers wit experience with paralle hout any most of ou l and distributed system ily utilize the resources s to eas- eration to e of a large distributed sy stem. compute a Our implementation of MapReduce runs on a cluster of commodity m large applying a achines and is highly s a typical MapReduce c calable: the same ke omputation processes m abytes of data on thous any ter- propriately. ands of machines. Prog find the system easy to u rammers specified m se: hundreds of MapRed grams have been implem uce pro- lelize large ented and upwards of o sand MapReduce jobs a ne thou- as the prima re executed on Google’s every day. clusters The major powerful int and distribut
  • 7. MapReduce history “ ing mod el and A pr ogramm ion for p rocessing imp lementat ” g large d ata sets and generatin
  • 8. Origins MapReduce implementation Founded by OpenSource at
  • 9. Today 0.20.1 current version Dozens of companies contributing Hundreds of companies using
  • 10. Why Hadoop?
  • 11. $74 .85
  • 12. $74 .85 b g 4
  • 13. b 1t $74 .85 b g 4
  • 14. vs
  • 15. 0 0 ,0 $ 10 vs 0 0 1 ,0 $
  • 16. vs
  • 17. ur Y o ut y o ure Bu ay ail w F f o vs is re ble lu ta i i Fa ev a p in C he Go
  • 18. Failure is inevitable Go Cheap
  • 19. Sproinnnng! Bzzzt! Crrrkt!
  • 20. server Funerals No pagers go off when machines die Report of dead machines once a week Clean out the carcasses
  • 21. utes pre vented obustness attrib n code R g into a pplicatio from bleedin Data redundancy Node death Retries Data geography Parallelism Scalability
  • 22. Hadoop for what?
  • 23. Structured
  • 24. Structured Unstructured
  • 25. NOSQL
  • 26. NOSQL Death of the RDBMS is a lie
  • 27. NOSQL Death of the RDBMS is a lie NoJOINs
  • 28. NOSQL Death of the RDBMS is a lie NoJOINs NoNormalization
  • 29. NOSQL Death of the RDBMS is a lie NoJOINs NoNormalization Big-data tools are solving different issues than RDBMSes
  • 30. Applications
  • 31. Applications Protein folding (pharmaceuticals)
  • 32. Applications Protein folding (pharmaceuticals) Search engines
  • 33. Applications Protein folding (pharmaceuticals) Search engines Sorting
  • 34. Applications Protein folding (pharmaceuticals) Search engines Sorting Classification (government intelligence)
  • 35. Applications
  • 36. Applications Price search
  • 37. Applications Price search Steganography
  • 38. Applications Price search Steganography Analytics
  • 39. Applications Price search Steganography Analytics Primes (code breaking)
  • 40. Particle Physics
  • 41. Particle Physics Large Hadron Collider
  • 42. Particle Physics Large Hadron Collider 15 petabytes of data per year
  • 43. Financial Trends
  • 44. Financial Trends Daily trade performance analysis
  • 45. Financial Trends Daily trade performance analysis Market trending
  • 46. Financial Trends Daily trade performance analysis Market trending Uses employee desktops during off hours
  • 47. Financial Trends Daily trade performance analysis Market trending Uses employee desktops during off hours Fiscally responsible/economical
  • 48. Contextual Ads
  • 49. Contextual Ads
  • 50. 30% of Amazon sales are from recommendations
  • 51. Not right now...
  • 52. Not right now... Do you expect to tackle a very large problem before you: change jobs change industries retire die see the heat death of the universe
  • 53. In the next decade, the class (scale) of problems we are aiming to solve will grow exponentially.
  • 54. MapReduce
  • 55. MapReduce map then... um... reduce.
  • 56. The process
  • 57. The process Every item in dataset is parallel candidate for Map
  • 58. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2)
  • 59. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2) Collects and groups pairs from all lists by key
  • 60. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2) Collects and groups pairs from all lists by key Reduce in parallel on each group
  • 61. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2) Collects and groups pairs from all lists by key Reduce in parallel on each group Reduce(k2, list (v2)) -> list(v3)
  • 62. FP For the Grid MapReduce Functional programming on a distributed processing platform
  • 63. The Goal
  • 64. The Goal Provide the occurrence count of each distinct word across all documents
  • 65. Start
  • 66. Map
  • 67. Grouping
  • 68. Reduce
  • 69. MapReduce Demo
  • 70. Have Code, Will Travel Code travels to the data Opposite of traditional systems
  • 71. Speed Test
  • 72. Competition TeraSort Jim Gray, MSFT 1985 paper Derived sort benchmark http://sortbenchmark.org/ 209 seconds (2007) 120 seconds (2009)
  • 73. Nodes
  • 74. Processing Nodes Anonymous “No identity” is good Commodity equipment
  • 75. Master Node Master is a special machine Use high quality hardware Single point of failure But recoverable
  • 76. Hadoop Family
  • 77. Hadoop Components Pig Hive Core Common Chukwa HBase HDFS
  • 78. the Players
  • 79. the PlayAs
  • 80. the PlayAs a Comm Chukw ZooKeeper on HBa se Hive HDFS
  • 81. HDFS
  • 82. HDFS Basics
  • 83. HDFS Basics Based on Google BigTable
  • 84. HDFS Basics Based on Google BigTable Replicated data store
  • 85. HDFS Basics Based on Google BigTable Replicated data store Stored in 64MB blocks
  • 86. Data Overload
  • 87. Data Overload
  • 88. Data Overload
  • 89. Data Overload
  • 90. Data Overload
  • 91. HDFS Replicating Rack-location aware Configurable redundancy factor Self-healing
  • 92. HDFS Demo
  • 93. Pig
  • 94. Pig Basics Yahoo-authored add-on DSL & tool Origin: Pig Latin Analyzes large data sets High-level language for expressing data analysis programs
  • 95. PIG Questions Ask big questions on unstructured data How many ___? Should we ____? Decide on the questions you want to ask long after you`ve collected the data.
  • 96. Pig Sample A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; dump B; store B into 'id.out';
  • 97. Pig demo
  • 98. HBase
  • 99. HBase Basics Structured data store Notice we didn’t say relational Relies on ZooKeeper and HDFS
  • 100. NoSQL Voldemort Google BigTable MongoDB HBase
  • 101. HBase Demo
  • 102. Hive
  • 103. Hive Basics Authored by SQL interface to HBase Hive is low-level Hive-specific metadata
  • 104. Sqoop Sqoop by is higher level Importing from RDBMS to Hive sqoop --connect jdbc:mysql://database.example.com/
  • 105. Sync, Async RDBMS SQL is realtime Hadoop is primarily asynchronous
  • 106. on
  • 107. Amazon Elastic MapReduce
  • 108. Amazon Elastic MapReduce Hosted Hadoop clusters
  • 109. Amazon Elastic MapReduce Hosted Hadoop clusters True use of cloud computing
  • 110. Amazon Elastic MapReduce Hosted Hadoop clusters True use of cloud computing Easy to set up
  • 111. Amazon Elastic MapReduce Hosted Hadoop clusters True use of cloud computing Easy to set up Pay per use
  • 112. EMR Languages Supports applications in... Java PHP Perl R Ruby C++ Python
  • 113. EMR Languages Supports applications in... Java PHP Perl R Ruby C++ Python
  • 114. EMR Pricing
  • 115. EMR Pricing
  • 116. EMR Functions RunJobFlow: Creates a job flow request, starts EC2 instances and begins processing. DescribeJobFlows: Provides status of your job flow request(s). AddJobFlowSteps: Adds additional step to an already running job flow. TerminateJobFlows: Terminates running job flow and shutdowns all instances.
  • 117. EMR Functions RunJobFlow: Creates a job flow request, starts EC2 instances and begins processing. DescribeJobFlows: Provides status of your job flow request(s). AddJobFlowSteps: Adds additional step to an already running job flow. TerminateJobFlows: Terminates running job flow and shutdowns all instances.
  • 118. Final Thoughts
  • 119. Ha! Your Hadoop is Shut up! slower than my I’m reducing. Hadoop!
  • 120. The RDBMS is not dead Has new friends, helpers NoSQL is taking the world by storm No more throwing away perfectly good historical data
  • 121. Failure is acceptable
  • 122. Failure is acceptable ❖ Failure is inevitable
  • 123. Failure is acceptable ❖ Failure is inevitable ❖ Go cheap
  • 124. Failure is acceptable ❖ Failure is inevitable ❖ Go cheap Go distributed
  • 125. Use Hadoop!
  • 126. Hadoop Divide and conquer gigantic data Matthew McCullough Email matthewm@ambientideas.com Twitter @matthewmccull Blog http://ambientideas.com/blog
  • 127. Credits
  • 128. http://www.fontspace.com/david-rakowski/tribeca http://www.cern.ch/ http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre- has-serious-cooling-needs/ http://www.greenm3.com/2009/10/googles-secret-to-efficient-data- center-design-ability-to-predict-performance.html http://upload.wikimedia.org/wikipedia/commons/f/fc/ CERN_LHC_Tunnel1.jpg http://www.flickr.com/photos/mandj98/3804322095/ http://www.flickr.com/photos/8583446@N05/3304141843/ http://www.flickr.com/photos/joits/219824254/ http://www.flickr.com/photos/streetfly_jz/2312194534/ http://www.flickr.com/photos/sybrenstuvel/2811467787/ http://www.flickr.com/photos/lacklusters/2080288154/ http://www.flickr.com/photos/sybrenstuvel/2811467787/ http://www.flickr.com/photos/robryb/14826417/sizes/l/ http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/ http://www.flickr.com/photos/robryb/14826486/sizes/l/ All others, iStockPhoto.com