Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to hadoop

PDF export of slides from my 2014 CodeMash presentation. Original reveal.js slides available at http://bit.ly/cm14_hadoop.

  • Login to see the comments

  • Be the first to like this

Introduction to hadoop

  1. 1. INTRODUCTION TO HADOOP Keegan Witt ( )@keeganwitt
  2. 2. SLIDES http://bit.ly/cm14_hadoop
  3. 3. WHO'S THE SCHLUB?
  4. 4. AGENDA
  5. 5. THINGS I'LL TALK ABOUT Why Hadoop? Hadoop ecosystem Deploying Hadoop Writing your first job Testing your first job Why not Hadoop? Advanced usages
  6. 6. THINGS I WON'T TALK ABOUT Anything I lack prod experience in Configuring & managing a Hadoop cluster Querying & data mining (e.g. Hive, Pig, Mahout, Flume)
  7. 7. WHY RIDE THE ELEPHANT? Source: Hadoop
  8. 8. THE PROBLEM Growing data Disks are slow Need higher throughput More unstructured data
  9. 9. DESIRABLE FEATURES Scale out, not up Easy to use Built­in backups Built­in fault tolerance
  10. 10. USE CASES Text mining/pattern recognition Graph processing Collaborative filtering Clustering
  11. 11. Amazon AOL Autodesk eBay Google* Groupon HP IBM Intel J.P. Morgan Last.fm LinkedIn NASA Navteq NSA Rackspace Samsung StumbleUpon Twitter Visa Yahoo WHO ELSE IS RIDING?
  12. 12. CONTRIBUTORS Source: Cloudera
  13. 13. WHAT IS HADOOP? Source: Unknown
  14. 14. HADOOP ECOSYSTEM
  15. 15. HDFS Source: Timo Elliot
  16. 16. HDFS ARCHITECTURE Source: Hadoop
  17. 17. HDFS ARCHITECTURE Source: Computer Geek Blog
  18. 18. HBASE Source: eQuest
  19. 19. HBASE ARCHITECTURE Source: Lars George's Blog
  20. 20. HBASE HDFS STRUCTURE HFILES HLOGS (WALS) /hbase     /<Table>         /<Region>             /<ColumnFamiy>                 /<StoreFile>                          /hbase     /.logs         /<RegionServer>             /<HLog>                         
  21. 21. LOGICAL VIEW Source: Manoj Khangaonkar's Blog
  22. 22. MAPREDUCE
  23. 23. DATA VIEW Source: Google
  24. 24. SERVER VIEW Source: Hortonworks
  25. 25. PHYSICAL VIEW Source: Microsoft
  26. 26. DISTRIBUTING LOAD
  27. 27. PROCESS VIEW Source: Rohit Menon's blog
  28. 28. YARN & MAPREDUCE 2 Source: Hortonworks
  29. 29. PARSING Source: Optimal.io
  30. 30. SHUFFLE Source: Yahoo
  31. 31. DEPLOYING HADOOP Source: Dilbert
  32. 32. DEPLOYING HADOOP FOR EXPERIMENTING FOR REAL  on  From distribution's packages From source Cloudera QuickStart VM Hortonworks Sandbox Amazon EMR Cloudera CDH Hortonworks HDP MapR Microsoft HDInsight Azure
  33. 33. CONFIGURING HADOOP DEFAULTS core­site.xml hdfs­site.xml mapred­site.xml hbase­site.xml hive­site.xml yarn­site.xml OVERRIDING Configuration conf = new Configuration(); conf.set("<optionKey>", "<optionValue>");
  34. 34. WRITING YOUR FIRST JOB Source: CloudTweaks
  35. 35. DRIVER Source:   (slightly modified) public class WordCount_Driver {     public static void main(String[] args) throws Exception {         Configuration conf = new Configuration();         Job job = new Job(conf, "wordcount");         job.setOutputKeyClass(Text.class);         job.setOutputValueClass(IntWritable.class);         job.setMapperClass(WordCount_Mapper.class);         job.setReducerClass(WordCount_Reducer.class);         job.setInputFormatClass(TextInputFormat.class);         job.setOutputFormatClass(TextOutputFormat.class);         FileInputFormat.addInputPath(job, new Path(args[0]));         FileOutputFormat.setOutputPath(job, new Path(args[1]));         System.exit(job.waitForCompletion(true) ? 0 : 1);     } } Hadoop
  36. 36. MAPPER Source:  public class WordCount_Mapper   extends Mapper<LongWritable, Text, Text, IntWritable> {     private final static IntWritable one = new IntWritable(1);     private Text word = new Text();     public void map(LongWritable key, Text value, Context context)       throws IOException, InterruptedException {         String line = value.toString();         StringTokenizer tokenizer = new StringTokenizer(line);         while (tokenizer.hasMoreTokens()) {             word.set(tokenizer.nextToken());             context.write(word, one);         }     } } Hadoop
  37. 37. REDUCER Source:  public class WordCount_Reducer   extends Reducer<Text, IntWritable, Text, IntWritable> {   public void reduce(Text key, Iterable<IntWritable> values,       Context context) throws IOException, InterruptedException {         int sum = 0;         for (IntWritable val : values) {             sum += val.get();         }         context.write(key, new IntWritable(sum));     } } Hadoop
  38. 38. TESTING YOUR FIRST JOB
  39. 39. MAP TEST Source:   (slightly modified) public class WordCount_Mapper_Test {     private MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;     @Before     public void setUp() {         WordCount_Mapper mapper = new WordCount_Mapper();         mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();         mapDriver.setMapper(mapper);     }     @Test     public void testMapper() {         mapDriver.withInput(new LongWritable(1), new Text("cat cat dog"))             .withOutput(new Text("cat"), new IntWritable(1))             .withOutput(new Text("cat"), new IntWritable(1))             .withOutput(new Text("dog"), new IntWritable(1))             .runTest();     } } MRUnit
  40. 40. REDUCE TEST Source:   (slightly modified) public class WordCount_Reducer_Test {     private ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver;     @Before     public void setUp() {         WordCount_Reducer reducer = new WordCount_Reducer();         reduceDriver = new ReduceDriver<Text, IntWritable, Text, IntWritable>();         reduceDriver.setReducer(reducer);     }     @Test     public void testReducer() {         List<IntWritable> catValues = new ArrayList<IntWritable>();         catValues.add(new IntWritable(1));         catValues.add(new IntWritable(1));         List<IntWritable> dogValues = new ArrayList<IntWritable>();         dogValues.add(new IntWritable(1));         reduceDriver.withInput(new Text("cat"), catValues)             .withInput(new Text("dog"), dogValues)             .withOutput(new Text("cat"), new IntWritable(2))             .withOutput(new Text("dog"), new IntWritable(1))             .runTest();     } } MRUnit
  41. 41. MAPREDUCE TEST Source:   (slightly modified) public class WordCount_MapReduce_Test {     private MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, IntWritable> mapReduceDriver;     @Before     public void setUp() {         WordCount_Mapper mapper = new WordCount_Mapper();         WordCount_Reducer reducer = new WordCount_Reducer();         mapReduceDriver = new MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, IntWritable>();         mapReduceDriver.setMapper(mapper);         mapReduceDriver.setReducer(reducer);     }     @Test     public void testMapReduce() {         mapReduceDriver.withInput(new LongWritable(1), new Text("cat cat dog"))             .addOutput(new Text("cat"), new IntWritable(2))             .addOutput(new Text("dog"), new IntWritable(1))             .runTest();     } } MRUnit
  42. 42. WHAT ABOUT TDD?
  43. 43. WHAT ABOUT SYSTEM TESTING?  (Sematext  ) MiniCluster HBaseTestingUtility example
  44. 44. DEMO
  45. 45. WHY NOT RIDE THE ELEPHANT? Source: geek & poke
  46. 46. WHY NOT RIDE THE ELEPHANT? Request/response model External clients Not much data Young
  47. 47. BEYOND WORD COUNT
  48. 48. DEPENDENCIES HADOOP_CLASSPATH Überjar ­libjars CLASSPATH ORDERING HADOOP_USER_CLASSPATH_FIRST mapreduce.task.classpath.first ­> true
  49. 49. CUSTOM COUNTERS public enum KeegansCounters {     FOO,     BAR; } // ... context.getCounter(KeegansCounters.FOO).increment(1);
  50. 50. JOB FLOWS  & Sequentially in main() Use JobControl in main() Multiple Hadoop jar commands Oozie Azkaban ChainMapper ChainReducer
  51. 51. SQOOP PROCESS OVERVIEW Source: DevX
  52. 52. SQOOPING DATA FROM RDBMSS sqoop import  ‐‐connect jdbc:mysql://foo.com/db  ‐‐table orders  ‐‐fields‐terminated‐by 't'  ‐‐lines‐terminated‐by 'n'                         
  53. 53. SQOOPING DATA INTO RDBMSS sqoop export  ‐‐connect jdbc:mysql://foo.com/db  ‐‐table bar  ‐‐export‐dir /hdfs_path/bar_data                         
  54. 54. COMPRESSING INTERMEDIATE DATA COMPRESSING OUTPUT mapred.compress.map.output ‐> true mapred.map.output.compression.codec ‐> com.hadoop.compression.lzo.SnappyCodec FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
  55. 55. SKIPPING BAD RECORDS
  56. 56. PROFILING JOBS HPROF Trial and error
  57. 57. DISTRIBUTED CACHE COMMANDLINE (USING   INTERFACE) ­files ­archives ­libjars PROGRAMMATICALLY TOOL public void addCacheFile(URI uri) public void addCacheArchive(URI uri) public void addFileToClassPath(Path file) public void addArchiveToClassPath(Path archive)
  58. 58. SECONDARY SORTING STEPS Change key to composite Create Partitioner and grouping Comparator on original key Create sort Comparator on composite key
  59. 59. SECONDARY SORTING EXAMPLE job.setPartitionerClass(FirstPartitioner.class); job.setSortComparatorClass(KeyComparator.class); job.setGroupingComparatorClass(GroupComparator.class);
  60. 60. RESOURCES
  61. 61. BOOKS
  62. 62. LINKS http://hadoop.apache.org/ http://mrunit.apache.org/ http://hbase.apache.org/ http://avro.apache.org/ http://www.cascading.org/ http://pig.apache.org/ http://hive.apache.org/ http://flume.apache.org/ http://oozie.apache.org/ https://github.com/azkaban/azkaban http://crunch.apache.org/ http://spark.incubator.apache.org/ http://developer.yahoo.com/hadoop/tutorial/ http://sortbenchmark.org/ https://github.com/cloudera/impala
  63. 63. QUESTIONS?

×