Hadoop at JavaZone 2010

4,205 views

Published on

Matthew McCullough's presentation of Hadoop to the JavaZone 2010 conference in Oslo, Norway

Published in: Education
3 Comments
13 Likes
Statistics
Notes
No Downloads
Views
Total views
4,205
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
0
Comments
3
Likes
13
Embeds 0
No embeds

No notes for slide

Hadoop at JavaZone 2010

  1. 1. Hadoop divide and conquer petabyte-scale data © Matthew McCullough, Ambient Ideas, LLC
  2. 2. Metadata Matthew McCullough Ambient Ideas, LLC matthewm@ambientideas.com http://ambientideas.com/blog @matthewmccull Code http://github.com/matthewmccullough/hadoop-intro
  3. 3. http://delicious.com/matthew.mccullough/hadoop
  4. 4. ado op o ften. I u se H d my Tivo the s oun Th at’s I s kip a very time ma kes e com mer cial. rrency in Practice Ja va Concu n Goetz, author of -Bria
  5. 5. Why?
  6. 6. u are using mpu ter yo The co right now ell hav e the very w sor may GHz pro ces fas test u’ll ever own yo
  7. 7. Scale up?
  8. 8. Scale out
  9. 9. Talk Roadmap History MapReduce Filesystem DSLs
  10. 10. History
  11. 11. Do ug Cu tt in g
  12. 12. open source
  13. 13. Lucene
  14. 14. Lucene Nutch
  15. 15. Lucene Nutch Hadoop
  16. 16. Lucene Nutch Hadoop
  17. 17. Lucene Mahout Nutch Hadoop
  18. 18. Today 0.21.0 current version Dozens of companies contributing Hundreds of companies using
  19. 19. Virtual Machine
  20. 20. VM Sources Yahoo True to the OSS distribution Cloudera Desktop tools Both VMWare based
  21. 21. Set up the Environment • Launch “cloudera-training 0.3.3” VMWare instance • Open a terminal window Lab
  22. 22. Log Files •Attach tail to all the Hadoop logs Lab
  23. 23. processing / Data
  24. 24. Processing Framework Data Storage MapReduce HDFS Hive QL Hive Chuckwa HBase Flume Avro ZooKeeper Hybrid Pig
  25. 25. Hadoop Components
  26. 26. Hadoop Components Tool Purpose Common MapReduce HDFS Filesystem Pig Analyst MapReduce language HBase Column-oriented data storage Hive SQL-like language for HBase ZooKeeper Workflow & distributed transactions Chukwa Log file processing
  27. 27. the Players Comm C hukwa ZooKeeper on H Base Hive HDFS
  28. 28. Motivations
  29. 29. Original Goals Web Indexer Yahoo Search Engine
  30. 30. Pre-Hadoop Shortcomings Search engine update frequency Storage costs Expense of durability Ad-hoc queries
  31. 31. Anti-Patterns Cured RAM-heavy RDBMS boxes Sharding Archiving Ever-smaller-range SQL queries
  32. 32. 1 Gigabyte
  33. 33. 1 Terabyte
  34. 34. 1 Petabyte
  35. 35. 16 Petabytes
  36. 36. Near-Linear Hardware Scalability
  37. 37. Applications Protein folding pharmaceutical research Search Engine Indexing walking billions of web pages Product Recommendations based on other customer purchases Sorting terabytes to petabyes in size Classification government intelligence
  38. 38. Contextual Ads
  39. 39. SELECT reccProd.name, reccProd.id FROM products reccProd WHERE purchases.customerId = (SELECT customerId FROM customers WHERE purchases.productId = thisProd) LIMIT 5
  40. 40. 30% of Amazon sales are from recommendations
  41. 41. ACID ATOMICITY CONSISTENCY ISOLATION DURABILITY
  42. 42. CAP Consistency Availability Partition Tolerance
  43. 43. MapReduce
  44. 44. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghem awat jeff@google.com, sanjay@goog le.com Google, Inc. Abstract given day, etc. Most such com MapReduce is a programmin putations are conceptu- g model and an associ- ally straightforward. However ated implementation for proc , the input data is usually essing and generating large large and the computations have data sets. Users specify a map to be distributed across function that processes a hundreds or thousands of mac key/value pair to generate a set hines in order to finish in of intermediate key/value a reasonable amount of time. pairs, and a reduce function that The issues of how to par- merges all intermediate allelize the computation, dist values associated with the sam ribute the data, and handle e intermediate key. Many failures conspire to obscure the real world tasks are expressible original simple compu- in this model, as shown tation with large amounts of in the paper. complex code to deal with these issues. Programs written in this func As a reaction to this complex tional style are automati- ity, we designed a new cally parallelized and executed abstraction that allows us to exp on a large cluster of com- ress the simple computa- modity machines. The run-time tions we were trying to perform system takes care of the but hides the messy de- details of partitioning the inpu tails of parallelization, fault-tol t data, scheduling the pro- erance, data distribution gram’s execution across a set and load balancing in a libra of machines, handling ma- ry. Our abstraction is in- chine failures, and managing spired by the map and reduce the required inter-machine primitives present in Lisp communication. This allows and many other functional lang programmers without any uages. We realized that experience with parallel and most of our computations invo distributed systems to eas- lved applying a map op- ily utilize the resources of a larg eration to each logical “record” e distributed system. in our input in order to Our implementation of Map compute a set of intermediat Reduce runs on a large e key/value pairs, and then cluster of commodity machine applying a reduce operation to s and is highly scalable: all the values that shared a typical MapReduce computa the same key, in order to com tion processes many ter- bine the derived data ap- abytes of data on thousands of propriately. Our use of a func machines. Programmers tional model with user- find the system easy to use: hun specified map and reduce ope dreds of MapReduce pro- rations allows us to paral- grams have been implemente lelize large computations easi d and upwards of one thou- ly and to use re-execution sand MapReduce jobs are exec as the primary mechanism for uted on Google’s clusters fault tolerance. every day. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined 1 Introduction with an implementation of this interface that achieves high performance on large clus ters of commodity PCs. Over the past five years, the Section 2 describes the basic authors and many others at programming model and Google have implemented hun gives several examples. Sec dreds of special-purpose tion 3 describes an imple- computations that process larg mentation of the MapReduce e amounts of raw data, interface tailored towards such as crawled documents, our cluster-based computing web request logs, etc., to environment. Section 4 de- compute various kinds of deri scribes several refinements of ved data, such as inverted the programming model indices, various representatio that we have found useful. Sec ns of the graph structure tion 5 has performance of web documents, summaries measurements of our implem of the number of pages entation for a variety of crawled per host, the set of tasks. Section 6 explores the most frequent queries in a use of MapReduce within Google including our experie nces in using it as the basis To appear in OSDI 2004 1
  45. 45. mode l and gram ming “ A pro for ment ation imple nd ge nera ting pro cess ing a dat a s ets ” la rg e
  46. 46. MapReduce a word counting conceptual example
  47. 47. The Goal Provide the occurrence count of each distinct word across all documents
  48. 48. Raw Data a folder of documents mydoc1.txt mydoc2.txt mydoc3.txt
  49. 49. Map break documents into words
  50. 50. Shuffle physically group (relocate) by key
  51. 51. Reduce count word occurrences
  52. 52. Reduce Again sort occurrences alphabetically
  53. 53. Grep.java
  54. 54. package org.apache.hadoop.examples; import java.util.Random; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.*; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; /* Extracts matching regexs from input files and counts them. */ public class Grep extends Configured implements Tool { private Grep() {} // singleton public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
  55. 55. import org.apache.hadoop.util.ToolRunner; /* Extracts matching regexs from input files and counts them. */ public class Grep extends Configured implements Tool { private Grep() {} // singleton public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return -1; } Path tempDir = new Path("grep-temp-"+ Integer.toString(new Random().nextInt (Integer.MAX_VALUE))); JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]);
  56. 56. grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class);
  57. 57. FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); sortJob.setNumReduceTasks(1); // write a single file FileOutputFormat.setOutputPath(sortJob, new Path(args [1])); sortJob.setOutputKeyComparatorClass // sort by decreasing freq (LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); }
  58. 58. JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); sortJob.setNumReduceTasks(1); // write a single file FileOutputFormat.setOutputPath(sortJob, new Path(args [1])); sortJob.setOutputKeyComparatorClass // sort by decreasing freq (LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; }
  59. 59. RegExMapper.java
  60. 60. /** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.mapred.lib; import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern;
  61. 61. /** A {@link Mapper} that extracts text matching a regular expression. */ public class RegexMapper<K> extends MapReduceBase implements Mapper<K, Text, Text, LongWritable> { private Pattern pattern; private int group; public void configure(JobConf job) { pattern = Pattern.compile(job.get("mapred.mapper.regex")); group = job.getInt("mapred.mapper.regex.group", 0); } public void map(K key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String text = value.toString(); Matcher matcher = pattern.matcher(text); while (matcher.find()) { output.collect(new Text(matcher.group(group)), new LongWritable (1)); } } }
  62. 62. Have Code, Will Travel Code travels to the data Opposite of traditional systems
  63. 63. Filesystem
  64. 64. uted file alab le di strib “sc ibuted r large distr s yst em fo icati ons. nsive appl d ata-inte ance fa ult t oler ovides It pr in expensive e runn ing on whil ty h ardw are” com modi
  65. 65. HDFS Basics Open Source implementation of Google BigTable Replicated data store Stored in 64MB blocks
  66. 66. HDFS Rack location aware Configurable redundancy factor Self-healing Looks almost like *NIX filesystem
  67. 67. Why HDFS? Random reads Parallel reads Redundancy
  68. 68. Data Overload
  69. 69. Test HDFS • Upload a file • List directories Lab
  70. 70. HDFS Upload • Show contents of file in HDFS • Show vocabulary of HDFS Lab
  71. 71. HDFS Distcp • Copy file to other nodes • MapReduce job starts Lab
  72. 72. HDFS Challenges Writes are re-writes today Append is planned Block size, alignment Small file inefficiencies NameNode SPOF
  73. 73. FSCK •Check filesystem Lab
  74. 74. NO SQL
  75. 75. Data Categories
  76. 76. Structured Semi-structured Unstructured
  77. 77. NOSQL is... Semi-structured not death of the RDBMS no JOINing no Normalization big-data tools solving different issues than RDBMSes
  78. 78. Grid Benefits
  79. 79. Scalable Data storage is pipelined Code travels to data Near linear hardware scalability
  80. 80. Optimization Preemptive execution Maximizes use of faster hardware New jobs trump P.E.
  81. 81. Fault Tolerant Configurable data redundancy Minimizes hardware failure impact Automatic job retries Self healing filesystem
  82. 82. Sproinnnng! Bzzzt! Poof!
  83. 83. server Funerals No pagers go off when machines die Report of dead machines once a week Clean out the carcasses
  84. 84. ss attributes Ro bustne eding ed fro m ble p revent plication code into ap Data redundancy Node death Retries Data geography Parallelism Scalability
  85. 85. Node TYPES
  86. 86. NameNode SecondaryNameNode DataNode JobTracker TaskTracker
  87. 87. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  88. 88. NameNode Catalog of data blocks Prefers large files High memory consumption Journaled file system Saves state to disk
  89. 89. Robust, Primary 1 NameNode JobTracker Commodity 1..N TaskTracker DataNode
  90. 90. Secondary NameNode Near-hot backup for NN Snapshots Missing journal entries Data loss in non-optimal setup Grid reconfig? Big IP device
  91. 91. Robust, Primary 1 Hot Backup 0..1 NameNode Secondary NameNode JobTracker JobTracker
  92. 92. NameNode, Cont’d Backup via SNN Journal to NFS with replication? Single point of failure Failover plan Big IP device, virtual IP
  93. 93. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity 1..N TaskTracker Virtual IP DataNode
  94. 94. DataNode Anonymous Data block storage “No identity” is positive trait Commodity equipment
  95. 95. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  96. 96. JobTracker Singleton Commonly co-located with NN Coordinates TaskTracker Load balances FairPlay scheduling Preemptive execution
  97. 97. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  98. 98. TaskTracker Runs MapReduce jobs Reports back to JobTracker
  99. 99. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  100. 100. Listing Nodes •Use Java’sa JPS tool to get list of nodes on box •sudo jps -l Lab
  101. 101. Processing Nodes Anonymous “No identity” is positive trait Commodity equipment
  102. 102. Master Node Master is a special machine Use high quality hardware Single point of failure Recoverable
  103. 103. Key-value Store
  104. 104. HBase
  105. 105. HBase Basics Map-oriented storage Key value pairs Column families Stores to HDFS Fast Usable for synchronous responses
  106. 106. Direct Comparison Amazon SimpleDB Google BigTable (Datastore)
  107. 107. Competitors Facebook Cassandra LinkedIn Project Voldemort
  108. 108. Shortcomings HQL WHERE clause limited to key Other column filters are full-table scans No ad-hoc queries
  109. 109. hbase> help create 'mylittletable', 'mylittlecolumnfamily' describe 'mylittletable' put 'mylittletable', 'r2', 'mylittlecolumnfamily', 'x' get 'mylittletable', 'r2' scan 'mylittletable'
  110. 110. HBase • Create column family • Insert data • Select data Lab
  111. 111. DSLs
  112. 112. Pig
  113. 113. Pig Basics Yahoo-authored add-on High-level language for authoring data analysis programs Console
  114. 114. PIG Questions Ask big questions on unstructured data How many ___? Should we ____? Decide on the questions you want to ask long after you’ve collected the data.
  115. 115. Pig Data 999991,female,Mary,T,Hargrave,600 Quiet Valley Lane,Los Angeles,CA,90017,US,Mary.T.Hargrave@dodgit.com, 999992,male,Harold,J,Candelario,294 Ford Street,OAKLAND,CA,94607,US,Harold.J.Candelario@dodgit.com,ad2U 999993,female,Ruth,G,Carter,4890 Murphy Court,Shakopee,MN,55379,US,Ruth.G.Carter@mailinator.com,uaseu8e 999994,male,Lionel,J,Carter,2701 Irving Road,Saint Clairsville,OH,43950,US,Lionel.J.Carter@trashymail.c 999995,female,Georgia,C,Medina,4541 Cooks Mine Road,CLOVIS,NM,88101,US,Georgia.C.Medina@trashymail.com, 999996,male,Stanley,S,Cruz,1463 Jennifer Lane,Durham,NC,27703,US,Stanley.S.Cruz@pookmail.com,aehoh5rooG 999997,male,Justin,A,Delossantos,3169 Essex Court,MANCHESTER,VT,05254,US,Justin.A.Delossantos@mailinato 999998,male,Leonard,K,Baker,4672 Margaret Street,Houston,TX,77063,US,Leonard.K.Baker@trashymail.com,Aep 999999,female,Charissa,J,Thorne,2806 Cedar Street,Little Rock,AR,72211,US,Charissa.J.Thorne@trashymail. 1000000,male,Michael,L,Powell,2797 Turkey Pen Road,New York,NY,10013,US,Michael.L.Powell@mailinator.com
  116. 116. Pig Sample Person = LOAD 'people.csv' using PigStorage(','); Names = FOREACH Person GENERATE $2 AS name; OrderedNames = ORDER Names BY name ASC; GroupedNames = GROUP OrderedNames BY name; NameCount = FOREACH GroupedNames GENERATE group, COUNT(OrderedNames); store NameCount into 'names.out';
  117. 117. Pig Scripting •Re-column and sort data Lab
  118. 118. Hive
  119. 119. Hive Basics Authored by SQL interface to HBase Hive is low-level Hive-specific metadata Data warehousing
  120. 120. hive -e 'select t1.x from mylittletable t1'
  121. 121. hive -S -e 'select a.col from tab1 a' > dump.txt
  122. 122. SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10;
  123. 123. Load Hive Data • Load and select Hive data Lab
  124. 124. Sync, Async RDBMS SQL is realtime Hive is primarily asynchronous
  125. 125. Sqoop
  126. 126. Sqoop A utility Imports from RDBMS Outputs plaintext, SequenceFile, or Hive
  127. 127. sqoop --connect jdbc:mysql://database.example.com/
  128. 128. Data Warehousing
  129. 129. Sync, Async RDBMS is realtime HBase is near realtime Hive is asynchronous
  130. 130. Monitoring
  131. 131. Web Status Panels NameNode http://localhost:50070/ JobTracker http://localhost:50030/
  132. 132. View Consoles • Start job • Watch consoles • Browse HDFS via web interface Lab
  133. 133. Configuration Files
  134. 134. Robust, Primary 1 Hot Backup 0..1 NameNode NFS Secondary NameNode JobTracker JobTracker Commodity Commodity 1..N 1..N TaskTracker TaskTracker DataNode DataNode
  135. 135. Grid Execution
  136. 136. Single Node start-all.sh Needs XML config ssh account setup Multiple JVMs List of processes jps -l
  137. 137. Start and Stop Grid •cd /usr/lib/hadoop-0.20/bin •sudo -u hadoop ./stop-all.sh •sudo jps -l •tail •sudo -u hadoop ./start-all.sh •sudo jps -l Lab
  138. 138. SSH Setup create key ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  139. 139. Commodity Robust, Primary 1 1..N TaskTracker start-all.sh NameNode start-dfs.sh start-mapred.sh JobTracker DataNode
  140. 140. Complete Grid start-all.sh Host names, ssh setup Default out-of-box experience
  141. 141. Commodity 1..N TaskTracker h dfs.s Robust, Primary rt- 1 sta h DataNode e d.s -m apr s tart start-all.sh NameNode Commodity 1..N JobTracker sta rt-d fs. sta sh rt- ma TaskTracker pr ed .sh DataNode
  142. 142. Grid on the Cloud
  143. 143. EMR Pricing
  144. 144. Summary
  145. 145. Sapir-WHorF Hypothesis... Remixed The ability to store and process massive data influences what you decide to store.
  146. 146. Hadoop divide and conquer petabyte-scale data © Matthew McCullough, Ambient Ideas, LLC
  147. 147. Metadata Matthew McCullough Ambient Ideas, LLC matthewm@ambientideas.com http://ambientideas.com/blog @matthewmccull Code http://github.com/matthewmccullough/hadoop-intro

×