Your SlideShare is downloading. ×

Avro Data | Washington DC HUG


Published on

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Avro Data Doug Cutting Cloudera & ApacheAvro, Nutch, Hadoop, Pig, Hive, HBase, Zookeeper, Whirr, Cassandra and Mahout are trademarks of the Apache Software Foundation
  • 2. How did we get here?
  • 3. 2002-2003Nutch SequenceFileWritable
  • 4. 2004-2005Nutch MapReduce NDFS SequenceFileWritable
  • 5. 2006Hadoop MapReduce HDFS SequenceFileWritable
  • 6. 2007 HBase PigZookeeper Hadoop MapReduce HDFS SequenceFile Writable
  • 7. 2008 HBase Pig Hive MahoutZookeeper Hadoop MapReduce Cassandra HDFS SequenceFile Writable
  • 8. 2009-2010 Your Application Here Whirr Oozie Hue ... HBase Pig Hive MahoutFlume Zookeeper Hadoop MapReduce Cassandra HDFS SequenceFile Writable
  • 9. Today● face an exploding combination of ● tools ● data formats ● programming languages● may require new adapter for each combination● more tools and languages are good ● but more formats might not be ● Google claims benefits of common format
  • 10. Data Format Properties● expressive ● supports complex, nested data structures● efficient ● fast and small● dynamic ● programs can process & define new datatypes● file format ● standalone ● splittable, compressed, sortable
  • 11. Data Format Comparison CSV XML/JSON SequenceFile Thrift & PB Avrolanguage yes yes no yes yesindependentexpressive no yes yes yes yesefficient no no yes yes yesdynamic yes yes no no yesstandalone ? yes no no yessplittable ? ? yes ? yessortable yes ? yes no yes
  • 12. Avro● specification-based design ● permits independent implementations ● schema in JSON to simplify impls● dynamic implementations the norm ● static, codegen-based implementations too● file format specified ● standalone, splittable, compressed● efficient binary encoding ● factors schema out of instances● sortable
  • 13. IDL Schemas for authoring static datatypes// a simple three-element recordrecord Block { string id; int length; array<string> hosts;}// a linked list of intsrecord IntList { int value; union { null, IntList} next;}
  • 14. JSON Schemas for interchange// a simple three-element record{"name": "Block", "type": "record":, "fields": [ {"name": "id", "type": "string"}, {"name": "length", "type": "int"}, {"name": "hosts", "type": {"type": "array:, "items": "string"}} ]}// a linked list of ints{"name": "IntList", "type": "record":, "fields": [ {"name": "value", "type": "int"}, {"name": "next", "type": ["null", "IntList"]} ]}
  • 15. Dynamic Schemas e.g., in JavaSchema block = Schema.createRecord("Block", "a block", null, false);List<Field> fields = new ArrayList<Field>();fields.add(new Field("id", Schema.create(Type.STRING), null, null));fields.add(new Field("length", Schema.create(Type.INT), null, null));fields.add(new Field("hosts", Schema.createArray(Schema.create(Type.STRING)), null, null));block.setFields(fields);Schema list = Schema.createRecord("MyList", "a list", null, false);List<Field> fields = new ArrayList<Field>();fields.add(new Field("value", Schema.create(Type.INT), null, null));fields.add(new Field("next", Schema.createUnion(Arrays.asList(new Schema[] { Schema.create(Type.NULL), list }, null, null));list.setFields(fields);
  • 16. Avro Schema Evolution● writers schema always provided to reader● so reader can compare: ● the schema used to write with ● the schema expected by application● fields that match (name & type) are read● fields written that dont match are skipped● expected fields not written can be identified● same features as provided by numeric field ids
  • 17. Avro MapReduce API● Single-valued inputs and outputs ● key/value pairs only required for intermediate● map(IN, Collector<OUT>) ● map-only jobs never need to create k/v pairs● map(IN, Collector<Pair<K,V>>)● reduce(K, Iterable<V>, Collector<OUT>) ● if IN and OUT are pairs, default is sort
  • 18. Avro Java MapReduce Examplepublic void map(String text, AvroCollector<Pair<String,Long>> c,                Reporter r) throws IOException {  StringTokenizer i = new StringTokenizer(text.toString());  while (i.hasMoreTokens())    c.collect(new Pair<String,Long>(i.nextToken(), 1L));}public void reduce(String word, Iterable<Long> counts,                   AvroCollector<Pair<String,Long>> c,                   Reporter r) throws IOException {  long sum = 0;  for (long count : counts)    sum += count;  c.collect(new Pair<String,Long>(word, sum));}
  • 19. Avro Status● Current ● APIs: C, C++, C# Java, Python, PHP, Ruby – interoperable data & RPC ● Integration: Pig, Hive, Flume, Crunch, etc. ● Conversion: SequenceFile, Thrift, Protobuf ● Java Mapreduce API● Upcoming ● MapReduce APIs for more languages – efficient, rich data
  • 20. Summary● Ecosystem needs a common data format ● thats expressive, efficient, dynamic, etc.● Avro meets this need ● but switching data formats is a slow process