3 avro hug-2010-07-21
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

3 avro hug-2010-07-21

on

  • 2,500 views

 

Statistics

Views

Total Views
2,500
Views on SlideShare
2,455
Embed Views
45

Actions

Likes
2
Downloads
56
Comments
0

3 Embeds 45

http://developer.yahoo.net 30
http://yahoohadoop.tumblr.com 9
https://www.tumblr.com 6

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

3 avro hug-2010-07-21 Presentation Transcript

  • 1. Introduction to Apache Avro Doug Cutting 21 July, 2010
  • 2. Avro is... ● data serialization ● file format ● RPC format
  • 3. Existing Serialization Systems: Protocol Buffers & Thrift ● expressive ● efficient (small & fast) ● but not very dynamic ● cannot browse arbitrary data ● viewing a new datatype – requires code generation & load ● writing a new datatype – requires generating schema text – plus code generation & load
  • 4. Avro Serialization ● spec's a serialization format ● schema language is in JSON ● each lang already has JSON parser ● each lang implements data reader & writer ● in normal code ● code generation is optional ● sometimes useful in statically typed languages ● data is untagged ● schema required to read/write
  • 5. Avro Schema Evolution ● writer's schema always provided to reader ● so reader can compare: ● the schema used to write with ● the schema expected by application ● fields that match (name & type) are read ● fields written that don't match are skipped ● expected fields not written can be identified ● same features as provided by numeric field ids
  • 6. Avro JSON Schemas // a simple three-element record {"name": "Block", "type": "record":, "fields": [ {"name": "id", "type": "string"}, {"name": "length", "type": "integer"}, {"name": "hosts", "type": {"type": "array:, "items": "string"}} ] } // a linked list of strings or ints {"name": "MyList", "type": "record":, "fields": [ {"name": "value", "type": ["string", "int"]}, {"name": "next", "type": ["MyList", "null"]} ] }
  • 7. Avro IDL Schemas // a simple three-element record record Block { string id; int length; array<string> hosts; } // a linked list of strings or ints record MyList { union {string, int} value; MyList next; }
  • 8. Hadoop Data Formats ● Today, primarily ●  text – pro: inter­operable – con: not expressive, inefficient ●  Java Writable – pro: expressive, efficient – con: platform­specific, fragile
  • 9. Avro Data ● expressive ● small & fast ● dynamic ● schema stored with data – but factored out of instances ● APIs permit reading & creating – new datatypes without generating & loading code
  • 10. Avro Data ● includes a file format ● replacement for SequenceFile ● includes a textual encoding ● handles versioning ● if schema changes ● can still process data ● hope Hadoop apps will ● upgrade from text; standardize on Avro for data
  • 11. Avro MapReduce API ● Single-valued inputs and outputs ● key/value pairs only required for intermediate ● map(IN, Collector<OUT>) ● map-only jobs never need to create k/v pairs ● map(IN, Collector<Pair<K,V>>) ● reduce(K, Iterable<V>, Collector<OUT>) ● if IN and OUT are pairs, default is sort ● In Avro trunk today, built on Hadoop 0.20 APIs. ● in Avro1.4.0 release next month
  • 12. Avro MapReduce Example public void map(Utf8 text, AvroCollector<Pair<Utf8,Long>> c,                 Reporter r) throws IOException {   StringTokenizer i = new StringTokenizer(text.toString());   while (i.hasMoreTokens())     c.collect(new Pair<Utf8,Long>(new Utf8(i.nextToken()), 1L)); } public void reduce(Utf8 word, Iterable<Long> counts,                    AvroCollector<Pair<Utf8,Long>> c,                    Reporter r) throws IOException {   long sum = 0;   for (long count : counts)     sum += count;   c.collect(new Pair<Utf8,Long>(word, sum)); }
  • 13. Avro RPC ● leverage versioning support ● permit different versions of services to interoperate ● for Hadoop, will ● let apps talk to clusters running different versions ● provide cross­language access
  • 14. Avro IDL Protocol @namespace("org.apache.avro.test") protocol HelloWorld { record Greeting { string who; string what; }   Greeting hello(Greeting greeting); }
  • 15. Avro Status ● Current ● C, C++, Java, Python & Ruby APIs ● Interoperable RPC and data ● Mapreduce API for Java ● Upcoming ● MapReduce APIs for other languages – efficient, rich data ● RPC used in Flume, Hbase, Cassandra, Hadoop, etc. – inter­version compatibility – non­Java clients
  • 16. Questions?