Avro Data | Washington DC HUG

3,025 views

Published on

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,025
On SlideShare
0
From Embeds
0
Number of Embeds
153
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

Avro Data | Washington DC HUG

  1. 1. Avro Data Doug Cutting Cloudera & ApacheAvro, Nutch, Hadoop, Pig, Hive, HBase, Zookeeper, Whirr, Cassandra and Mahout are trademarks of the Apache Software Foundation
  2. 2. How did we get here?
  3. 3. 2002-2003Nutch SequenceFileWritable
  4. 4. 2004-2005Nutch MapReduce NDFS SequenceFileWritable
  5. 5. 2006Hadoop MapReduce HDFS SequenceFileWritable
  6. 6. 2007 HBase PigZookeeper Hadoop MapReduce HDFS SequenceFile Writable
  7. 7. 2008 HBase Pig Hive MahoutZookeeper Hadoop MapReduce Cassandra HDFS SequenceFile Writable
  8. 8. 2009-2010 Your Application Here Whirr Oozie Hue ... HBase Pig Hive MahoutFlume Zookeeper Hadoop MapReduce Cassandra HDFS SequenceFile Writable
  9. 9. Today● face an exploding combination of ● tools ● data formats ● programming languages● may require new adapter for each combination● more tools and languages are good ● but more formats might not be ● Google claims benefits of common format
  10. 10. Data Format Properties● expressive ● supports complex, nested data structures● efficient ● fast and small● dynamic ● programs can process & define new datatypes● file format ● standalone ● splittable, compressed, sortable
  11. 11. Data Format Comparison CSV XML/JSON SequenceFile Thrift & PB Avrolanguage yes yes no yes yesindependentexpressive no yes yes yes yesefficient no no yes yes yesdynamic yes yes no no yesstandalone ? yes no no yessplittable ? ? yes ? yessortable yes ? yes no yes
  12. 12. Avro● specification-based design ● permits independent implementations ● schema in JSON to simplify impls● dynamic implementations the norm ● static, codegen-based implementations too● file format specified ● standalone, splittable, compressed● efficient binary encoding ● factors schema out of instances● sortable
  13. 13. IDL Schemas for authoring static datatypes// a simple three-element recordrecord Block { string id; int length; array<string> hosts;}// a linked list of intsrecord IntList { int value; union { null, IntList} next;}
  14. 14. JSON Schemas for interchange// a simple three-element record{"name": "Block", "type": "record":, "fields": [ {"name": "id", "type": "string"}, {"name": "length", "type": "int"}, {"name": "hosts", "type": {"type": "array:, "items": "string"}} ]}// a linked list of ints{"name": "IntList", "type": "record":, "fields": [ {"name": "value", "type": "int"}, {"name": "next", "type": ["null", "IntList"]} ]}
  15. 15. Dynamic Schemas e.g., in JavaSchema block = Schema.createRecord("Block", "a block", null, false);List<Field> fields = new ArrayList<Field>();fields.add(new Field("id", Schema.create(Type.STRING), null, null));fields.add(new Field("length", Schema.create(Type.INT), null, null));fields.add(new Field("hosts", Schema.createArray(Schema.create(Type.STRING)), null, null));block.setFields(fields);Schema list = Schema.createRecord("MyList", "a list", null, false);List<Field> fields = new ArrayList<Field>();fields.add(new Field("value", Schema.create(Type.INT), null, null));fields.add(new Field("next", Schema.createUnion(Arrays.asList(new Schema[] { Schema.create(Type.NULL), list }, null, null));list.setFields(fields);
  16. 16. Avro Schema Evolution● writers schema always provided to reader● so reader can compare: ● the schema used to write with ● the schema expected by application● fields that match (name & type) are read● fields written that dont match are skipped● expected fields not written can be identified● same features as provided by numeric field ids
  17. 17. Avro MapReduce API● Single-valued inputs and outputs ● key/value pairs only required for intermediate● map(IN, Collector<OUT>) ● map-only jobs never need to create k/v pairs● map(IN, Collector<Pair<K,V>>)● reduce(K, Iterable<V>, Collector<OUT>) ● if IN and OUT are pairs, default is sort
  18. 18. Avro Java MapReduce Examplepublic void map(String text, AvroCollector<Pair<String,Long>> c,                Reporter r) throws IOException {  StringTokenizer i = new StringTokenizer(text.toString());  while (i.hasMoreTokens())    c.collect(new Pair<String,Long>(i.nextToken(), 1L));}public void reduce(String word, Iterable<Long> counts,                   AvroCollector<Pair<String,Long>> c,                   Reporter r) throws IOException {  long sum = 0;  for (long count : counts)    sum += count;  c.collect(new Pair<String,Long>(word, sum));}
  19. 19. Avro Status● Current ● APIs: C, C++, C# Java, Python, PHP, Ruby – interoperable data & RPC ● Integration: Pig, Hive, Flume, Crunch, etc. ● Conversion: SequenceFile, Thrift, Protobuf ● Java Mapreduce API● Upcoming ● MapReduce APIs for more languages – efficient, rich data
  20. 20. Summary● Ecosystem needs a common data format ● thats expressive, efficient, dynamic, etc.● Avro meets this need ● but switching data formats is a slow process

×