Doug Cutting on the State of the Hadoop Ecosystem

2,854 views
2,674 views

Published on

Doug Cutting, Apache Hadoop Co-founder, explains how the growth of the Hadoop ecosystem has made Hadoop a much more powerful machine, and how the continued expansion will lead to great things.

Published in: Technology

Doug Cutting on the State of the Hadoop Ecosystem

  1. 1. The  Hadoop  Ecosystem  Hidden  Gems  Doug  Cu7ng  Chief  Architect,  Cloudera  Chairman,  Apache  So>ware  FoundaAon  
  2. 2. Expanding  Hadoop  Ecosystem   •  Hadoop •  the kernel •  HDFS o  scalable storage •  MapReduce o  scalable computation •  HBase & Accumulo •  online key/value store •  Pig & Hive •  query languages •  Sqoop •  RDBMS integration •  Flume •  data collection •  Oozie •  workflow •  Whirr •  cloud deployment •  Mahout •  machine learning
  3. 3. Some  Hidden  Gems   •  YARN   •  Crunch   •  Avro   •  Trevni  
  4. 4. YARN  (Yet  Another  Resource  NegoAator)   •  generic  scheduler  for  distributed  applicaAons   o  will  permit  non-­‐MapReduce  applicaAons   •  consists  of:   o  Resource  Manager  (per  cluster)   o  Node  Manager  (per  node)   §  runs  ApplicaAon  Managers  (per  job)   §  &  ApplicaAon  Containers  (per  task)   •  in  Hadoop  2.0   o  replaces  JobTracker  &  TaskTracker  (MR1)  
  5. 5. YARN:  MR2   MapReduce Status Node Job Submission Manager Node Status Resource Request Container App Master Client Resource Node Manager Manager Client App Master Container Node Manager CDH4 includes both MR1 & MR2 Container Container
  6. 6. Crunch   •  an  API  for  MapReduce   o  alternaAve  to  Pig  &  Hive   o  inspired  by  Googles  FlumeJava  paper   o  in  Java  (&  Scala)   •  easier  to  integrate  applicaAon  logic   o  with  a  full  programming  language   •  concepts:   o  PCollecAon:  set  of  values  w/  parallelDo  operaAon   o  PTable:  key/value  mapping  w/  groupBy  operaAon   o  Pipeline:  executor  that  runs  MapReduce  jobs  
  7. 7. Crunch  Word  Count   public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection lines = pipeline.readTextFile(args[0]); PCollection words = lines.parallelDo("my splitter", new DoFn() { public void process(String line, Emitter emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable counts = Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
  8. 8. Scrunch  Word  Count   class WordCountExample { val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .filter(!_.isEmpty()) .count } }
  9. 9. Avro:  a  format  for  Big  Data   •  expressive   o  records,  arrays,  unions,  enums   •  efficient   o  compact  binary,  compressed,  spliable   •  interoperable   o  langs:  C,  C++,  C#,  Java,  Perl,  Python,  Ruby,  PHP   o  tools:  MR,  Pig,  Hive,  Crunch,  Flume,  Sqoop,  etc.   •  dynamic   o  can  read  &  write  without  generaAng  code   •  evolvable  
  10. 10. Column  Files   name id size record X { String name; Foo 0x0 5 long id; int size; Bar 0x1 7 } Baz 0x2 9 Row File Column File (Avro, SequenceFile) (Trevni) Foo 0x0 5 Foo Bar Bar 0x1 7 Baz ... Baz 0x2 9 0x0 0x1 ... ... ... 0x2 ... 5 7 9 ...
  11. 11. Column  Files   •  faster  queries   o  only  process  columns  in  query   •  beer  compression   o  since  like  data  is  together   •  data  set  split  into  row  groups   o  to  permit  parallelism   •  to  localize  processing,   o  row  group  should  be  in  single  HDFS  block   •  independent  of  record  serializaAon  format   o  need  shredder   •  primary  format?  
  12. 12. Trevni:  a  column  file  format   •  one  row  group  per  file   o  &  one  file  per  HDFS  block   o  minimizes  seeks,  localizes  query   •  shredder  &  assembler  for  Avro  records   o  supports  nested  structures   •  compression  codec  per  column   •  in  Avro  1.7.3+  

×