Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The	  Hadoop	  Ecosystem	  Hidden	  Gems	  Doug	  Cu7ng	  Chief	  Architect,	  Cloudera	  Chairman,	  Apache	  So>ware	  F...
Expanding	  Hadoop	  Ecosystem	   •  Hadoop                 •    the kernel    •    HDFS                    o  scalable st...
Some	  Hidden	  Gems	   •  YARN	   •  Crunch	   •  Avro	   •  Trevni	  
YARN	  (Yet	  Another	  Resource	  NegoAator)	   •    generic	  scheduler	  for	  distributed	  applicaAons	         o    ...
YARN:	  MR2	        MapReduce Status                                                Node       Job Submission             ...
Crunch	   •    an	  API	  for	  MapReduce	         o    alternaAve	  to	  Pig	  &	  Hive	         o    inspired	  by	  Goo...
Crunch	  Word	  Count	   public class WordCount {    public static void main(String[] args) throws Exception {      Pipeli...
Scrunch	  Word	  Count	   class WordCountExample {    val pipeline = new Pipeline[WordCountExample]      def wordCount(fil...
Avro:	  a	  format	  for	  Big	  Data	    •    expressive	          o    records,	  arrays,	  unions,	  enums	    •    effic...
Column	  Files	                              name   id                     size  record X {    String name;            Foo...
Column	  Files	   •    faster	  queries	         o    only	  process	  columns	  in	  query	   •    beer	  compression	   ...
Trevni:	  a	  column	  file	  format	   •    one	  row	  group	  per	  file	         o    &	  one	  file	  per	  HDFS	  block...
Doug Cutting on the State of the Hadoop Ecosystem
Upcoming SlideShare
Loading in …5
×

Doug Cutting on the State of the Hadoop Ecosystem

3,228 views

Published on

Doug Cutting, Apache Hadoop Co-founder, explains how the growth of the Hadoop ecosystem has made Hadoop a much more powerful machine, and how the continued expansion will lead to great things.

Published in: Technology

Doug Cutting on the State of the Hadoop Ecosystem

  1. 1. The  Hadoop  Ecosystem  Hidden  Gems  Doug  Cu7ng  Chief  Architect,  Cloudera  Chairman,  Apache  So>ware  FoundaAon  
  2. 2. Expanding  Hadoop  Ecosystem   •  Hadoop •  the kernel •  HDFS o  scalable storage •  MapReduce o  scalable computation •  HBase & Accumulo •  online key/value store •  Pig & Hive •  query languages •  Sqoop •  RDBMS integration •  Flume •  data collection •  Oozie •  workflow •  Whirr •  cloud deployment •  Mahout •  machine learning
  3. 3. Some  Hidden  Gems   •  YARN   •  Crunch   •  Avro   •  Trevni  
  4. 4. YARN  (Yet  Another  Resource  NegoAator)   •  generic  scheduler  for  distributed  applicaAons   o  will  permit  non-­‐MapReduce  applicaAons   •  consists  of:   o  Resource  Manager  (per  cluster)   o  Node  Manager  (per  node)   §  runs  ApplicaAon  Managers  (per  job)   §  &  ApplicaAon  Containers  (per  task)   •  in  Hadoop  2.0   o  replaces  JobTracker  &  TaskTracker  (MR1)  
  5. 5. YARN:  MR2   MapReduce Status Node Job Submission Manager Node Status Resource Request Container App Master Client Resource Node Manager Manager Client App Master Container Node Manager CDH4 includes both MR1 & MR2 Container Container
  6. 6. Crunch   •  an  API  for  MapReduce   o  alternaAve  to  Pig  &  Hive   o  inspired  by  Googles  FlumeJava  paper   o  in  Java  (&  Scala)   •  easier  to  integrate  applicaAon  logic   o  with  a  full  programming  language   •  concepts:   o  PCollecAon:  set  of  values  w/  parallelDo  operaAon   o  PTable:  key/value  mapping  w/  groupBy  operaAon   o  Pipeline:  executor  that  runs  MapReduce  jobs  
  7. 7. Crunch  Word  Count   public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection lines = pipeline.readTextFile(args[0]); PCollection words = lines.parallelDo("my splitter", new DoFn() { public void process(String line, Emitter emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable counts = Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
  8. 8. Scrunch  Word  Count   class WordCountExample { val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .filter(!_.isEmpty()) .count } }
  9. 9. Avro:  a  format  for  Big  Data   •  expressive   o  records,  arrays,  unions,  enums   •  efficient   o  compact  binary,  compressed,  spliable   •  interoperable   o  langs:  C,  C++,  C#,  Java,  Perl,  Python,  Ruby,  PHP   o  tools:  MR,  Pig,  Hive,  Crunch,  Flume,  Sqoop,  etc.   •  dynamic   o  can  read  &  write  without  generaAng  code   •  evolvable  
  10. 10. Column  Files   name id size record X { String name; Foo 0x0 5 long id; int size; Bar 0x1 7 } Baz 0x2 9 Row File Column File (Avro, SequenceFile) (Trevni) Foo 0x0 5 Foo Bar Bar 0x1 7 Baz ... Baz 0x2 9 0x0 0x1 ... ... ... 0x2 ... 5 7 9 ...
  11. 11. Column  Files   •  faster  queries   o  only  process  columns  in  query   •  beer  compression   o  since  like  data  is  together   •  data  set  split  into  row  groups   o  to  permit  parallelism   •  to  localize  processing,   o  row  group  should  be  in  single  HDFS  block   •  independent  of  record  serializaAon  format   o  need  shredder   •  primary  format?  
  12. 12. Trevni:  a  column  file  format   •  one  row  group  per  file   o  &  one  file  per  HDFS  block   o  minimizes  seeks,  localizes  query   •  shredder  &  assembler  for  Avro  records   o  supports  nested  structures   •  compression  codec  per  column   •  in  Avro  1.7.3+  

×