Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction to
 Apache Avro




  Doug Cutting
  21 July, 2010
Avro is...
●   data serialization



●   file format



●   RPC format
Existing Serialization Systems:
   Protocol Buffers & Thrift
●   expressive
●   efficient (small & fast)
●   but not very ...
Avro Serialization
●   spec's a serialization format
●   schema language is in JSON
    ●   each lang already has JSON par...
Avro Schema Evolution
●   writer's schema always provided to reader
●   so reader can compare:
    ●   the schema used to ...
Avro JSON Schemas
// a simple three-element record
{"name": "Block", "type": "record":,
  "fields": [
    {"name": "id", "...
Avro IDL Schemas
// a simple three-element record
record Block {
  string id;
  int length;
  array<string> hosts;
}

// a...
Hadoop Data Formats
●   Today, primarily
    ●    text
         –   pro: inter­operable
         –   con: not expressive, ...
Avro Data
●   expressive
●   small & fast
●   dynamic
    ●   schema stored with data
        –   but factored out of inst...
Avro Data
●   includes a file format
    ●   replacement for SequenceFile
●   includes a textual encoding
●   handles vers...
Avro MapReduce API
●   Single-valued inputs and outputs
    ●   key/value pairs only required for intermediate
●   map(IN,...
Avro MapReduce Example
public void map(Utf8 text, AvroCollector<Pair<Utf8,Long>> c,
                Reporter r) throws IOE...
Avro RPC


●   leverage versioning support
    ●   permit different versions of services to interoperate
●   for Hadoop, w...
Avro IDL Protocol

@namespace("org.apache.avro.test")

protocol HelloWorld {

 record Greeting {
   string who;
   string ...
Avro Status
●   Current
    ●   C, C++, Java, Python & Ruby APIs
    ●   Interoperable RPC and data
    ●   Mapreduce API ...
Questions?
Upcoming SlideShare
Loading in …5
×

of

3 avro hug-2010-07-21 Slide 1 3 avro hug-2010-07-21 Slide 2 3 avro hug-2010-07-21 Slide 3 3 avro hug-2010-07-21 Slide 4 3 avro hug-2010-07-21 Slide 5 3 avro hug-2010-07-21 Slide 6 3 avro hug-2010-07-21 Slide 7 3 avro hug-2010-07-21 Slide 8 3 avro hug-2010-07-21 Slide 9 3 avro hug-2010-07-21 Slide 10 3 avro hug-2010-07-21 Slide 11 3 avro hug-2010-07-21 Slide 12 3 avro hug-2010-07-21 Slide 13 3 avro hug-2010-07-21 Slide 14 3 avro hug-2010-07-21 Slide 15 3 avro hug-2010-07-21 Slide 16
Upcoming SlideShare
Avro
Next
Download to read offline and view in fullscreen.

2 Likes

Share

Download to read offline

3 avro hug-2010-07-21

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

3 avro hug-2010-07-21

  1. 1. Introduction to Apache Avro Doug Cutting 21 July, 2010
  2. 2. Avro is... ● data serialization ● file format ● RPC format
  3. 3. Existing Serialization Systems: Protocol Buffers & Thrift ● expressive ● efficient (small & fast) ● but not very dynamic ● cannot browse arbitrary data ● viewing a new datatype – requires code generation & load ● writing a new datatype – requires generating schema text – plus code generation & load
  4. 4. Avro Serialization ● spec's a serialization format ● schema language is in JSON ● each lang already has JSON parser ● each lang implements data reader & writer ● in normal code ● code generation is optional ● sometimes useful in statically typed languages ● data is untagged ● schema required to read/write
  5. 5. Avro Schema Evolution ● writer's schema always provided to reader ● so reader can compare: ● the schema used to write with ● the schema expected by application ● fields that match (name & type) are read ● fields written that don't match are skipped ● expected fields not written can be identified ● same features as provided by numeric field ids
  6. 6. Avro JSON Schemas // a simple three-element record {"name": "Block", "type": "record":, "fields": [ {"name": "id", "type": "string"}, {"name": "length", "type": "integer"}, {"name": "hosts", "type": {"type": "array:, "items": "string"}} ] } // a linked list of strings or ints {"name": "MyList", "type": "record":, "fields": [ {"name": "value", "type": ["string", "int"]}, {"name": "next", "type": ["MyList", "null"]} ] }
  7. 7. Avro IDL Schemas // a simple three-element record record Block { string id; int length; array<string> hosts; } // a linked list of strings or ints record MyList { union {string, int} value; MyList next; }
  8. 8. Hadoop Data Formats ● Today, primarily ●  text – pro: inter­operable – con: not expressive, inefficient ●  Java Writable – pro: expressive, efficient – con: platform­specific, fragile
  9. 9. Avro Data ● expressive ● small & fast ● dynamic ● schema stored with data – but factored out of instances ● APIs permit reading & creating – new datatypes without generating & loading code
  10. 10. Avro Data ● includes a file format ● replacement for SequenceFile ● includes a textual encoding ● handles versioning ● if schema changes ● can still process data ● hope Hadoop apps will ● upgrade from text; standardize on Avro for data
  11. 11. Avro MapReduce API ● Single-valued inputs and outputs ● key/value pairs only required for intermediate ● map(IN, Collector<OUT>) ● map-only jobs never need to create k/v pairs ● map(IN, Collector<Pair<K,V>>) ● reduce(K, Iterable<V>, Collector<OUT>) ● if IN and OUT are pairs, default is sort ● In Avro trunk today, built on Hadoop 0.20 APIs. ● in Avro1.4.0 release next month
  12. 12. Avro MapReduce Example public void map(Utf8 text, AvroCollector<Pair<Utf8,Long>> c,                 Reporter r) throws IOException {   StringTokenizer i = new StringTokenizer(text.toString());   while (i.hasMoreTokens())     c.collect(new Pair<Utf8,Long>(new Utf8(i.nextToken()), 1L)); } public void reduce(Utf8 word, Iterable<Long> counts,                    AvroCollector<Pair<Utf8,Long>> c,                    Reporter r) throws IOException {   long sum = 0;   for (long count : counts)     sum += count;   c.collect(new Pair<Utf8,Long>(word, sum)); }
  13. 13. Avro RPC ● leverage versioning support ● permit different versions of services to interoperate ● for Hadoop, will ● let apps talk to clusters running different versions ● provide cross­language access
  14. 14. Avro IDL Protocol @namespace("org.apache.avro.test") protocol HelloWorld { record Greeting { string who; string what; }   Greeting hello(Greeting greeting); }
  15. 15. Avro Status ● Current ● C, C++, Java, Python & Ruby APIs ● Interoperable RPC and data ● Mapreduce API for Java ● Upcoming ● MapReduce APIs for other languages – efficient, rich data ● RPC used in Flume, Hbase, Cassandra, Hadoop, etc. – inter­version compatibility – non­Java clients
  16. 16. Questions?
  • flei_98

    Nov. 16, 2011
  • schubertzhang

    Aug. 21, 2010

Views

Total views

3,020

On Slideshare

0

From embeds

0

Number of embeds

173

Actions

Downloads

78

Shares

0

Comments

0

Likes

2

×