Introduction to
 Apache Avro




  Doug Cutting
  21 July, 2010
Avro is...
●   data serialization



●   file format



●   RPC format
Existing Serialization Systems:
   Protocol Buffers & Thrift
●   expressive
●   efficient (small & fast)
●   but not very dynamic
    ●   cannot browse arbitrary data
    ●   viewing a new datatype
        –   requires code generation & load
    ●   writing a new datatype
        –   requires generating schema text
        –   plus code generation & load
Avro Serialization
●   spec's a serialization format
●   schema language is in JSON
    ●   each lang already has JSON parser
●   each lang implements data reader & writer
    ●   in normal code
●   code generation is optional
    ●   sometimes useful in statically typed languages
●   data is untagged
    ●   schema required to read/write
Avro Schema Evolution
●   writer's schema always provided to reader
●   so reader can compare:
    ●   the schema used to write with
    ●   the schema expected by application
●   fields that match (name & type) are read
●   fields written that don't match are skipped
●   expected fields not written can be identified
●   same features as provided by numeric field ids
Avro JSON Schemas
// a simple three-element record
{"name": "Block", "type": "record":,
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "length", "type": "integer"},
    {"name": "hosts", "type":
      {"type": "array:, "items": "string"}}
  ]
}

// a linked list of strings or ints
{"name": "MyList", "type": "record":,
  "fields": [
    {"name": "value", "type": ["string", "int"]},
    {"name": "next", "type": ["MyList", "null"]}
  ]
}
Avro IDL Schemas
// a simple three-element record
record Block {
  string id;
  int length;
  array<string> hosts;
}

// a linked list of strings or ints
record MyList {
  union {string, int} value;
  MyList next;
}
Hadoop Data Formats
●   Today, primarily
    ●    text
         –   pro: inter­operable
         –   con: not expressive, inefficient
    ●    Java Writable
         –   pro: expressive, efficient
         –   con: platform­specific, fragile
Avro Data
●   expressive
●   small & fast
●   dynamic
    ●   schema stored with data
        –   but factored out of instances
    ●   APIs permit reading & creating
        –   new datatypes without generating & loading code
Avro Data
●   includes a file format
    ●   replacement for SequenceFile
●   includes a textual encoding
●   handles versioning
    ●   if schema changes
    ●   can still process data
●   hope Hadoop apps will
    ●   upgrade from text; standardize on Avro for data
Avro MapReduce API
●   Single-valued inputs and outputs
    ●   key/value pairs only required for intermediate
●   map(IN, Collector<OUT>)
    ●   map-only jobs never need to create k/v pairs
●   map(IN, Collector<Pair<K,V>>)
●   reduce(K, Iterable<V>, Collector<OUT>)
    ●   if IN and OUT are pairs, default is sort
●   In Avro trunk today, built on Hadoop 0.20 APIs.
    ●   in Avro1.4.0 release next month
Avro MapReduce Example
public void map(Utf8 text, AvroCollector<Pair<Utf8,Long>> c,
                Reporter r) throws IOException {

  StringTokenizer i = new StringTokenizer(text.toString());

  while (i.hasMoreTokens())
    c.collect(new Pair<Utf8,Long>(new Utf8(i.nextToken()), 1L));
}

public void reduce(Utf8 word, Iterable<Long> counts,
                   AvroCollector<Pair<Utf8,Long>> c,
                   Reporter r) throws IOException {
  long sum = 0;
  for (long count : counts)
    sum += count;
  c.collect(new Pair<Utf8,Long>(word, sum));
}
Avro RPC


●   leverage versioning support
    ●   permit different versions of services to interoperate
●   for Hadoop, will
    ●   let apps talk to clusters running different versions
    ●   provide cross­language access
Avro IDL Protocol

@namespace("org.apache.avro.test")

protocol HelloWorld {

 record Greeting {
   string who;
   string what;
 }

  Greeting hello(Greeting greeting);
}
Avro Status
●   Current
    ●   C, C++, Java, Python & Ruby APIs
    ●   Interoperable RPC and data
    ●   Mapreduce API for Java
●   Upcoming
    ●   MapReduce APIs for other languages
        –   efficient, rich data
    ●   RPC used in Flume, Hbase, Cassandra, Hadoop, etc.
        –   inter­version compatibility
        –   non­Java clients
Questions?

3 avro hug-2010-07-21