3 avro hug-2010-07-21

Introduction to
Apache Avro

Doug Cutting
21 July, 2010

Avro is...
● data serialization

● file format

● RPC format

Existing Serialization Systems:
Protocol Buffers & Thrift
● expressive
● efficient (small & fast)
● but not very dynamic
● cannot browse arbitrary data
● viewing a new datatype
– requires code generation & load
● writing a new datatype
– requires generating schema text
– plus code generation & load

Avro Serialization
● spec's a serialization format
● schema language is in JSON
● each lang already has JSON parser
● each lang implements data reader & writer
● in normal code
● code generation is optional
● sometimes useful in statically typed languages
● data is untagged
● schema required to read/write

Avro Schema Evolution
● writer's schema always provided to reader
● so reader can compare:
● the schema used to write with
● the schema expected by application
● fields that match (name & type) are read
● fields written that don't match are skipped
● expected fields not written can be identified
● same features as provided by numeric field ids

Avro JSON Schemas
// a simple three-element record
{"name": "Block", "type": "record":,
"fields": [
{"name": "id", "type": "string"},
{"name": "length", "type": "integer"},
{"name": "hosts", "type":
{"type": "array:, "items": "string"}}
]
}

// a linked list of strings or ints
{"name": "MyList", "type": "record":,
"fields": [
{"name": "value", "type": ["string", "int"]},
{"name": "next", "type": ["MyList", "null"]}
]
}

Avro IDL Schemas
// a simple three-element record
record Block {
string id;
int length;
array<string> hosts;
}

// a linked list of strings or ints
record MyList {
union {string, int} value;
MyList next;
}

Hadoop Data Formats
● Today, primarily
● text
– pro: interoperable
– con: not expressive, inefficient
● Java Writable
– pro: expressive, efficient
– con: platformspecific, fragile

Avro Data
● expressive
● small & fast
● dynamic
● schema stored with data
– but factored out of instances
● APIs permit reading & creating
– new datatypes without generating & loading code

Avro Data
● includes a file format
● replacement for SequenceFile
● includes a textual encoding
● handles versioning
● if schema changes
● can still process data
● hope Hadoop apps will
● upgrade from text; standardize on Avro for data

Avro MapReduce API
● Single-valued inputs and outputs
● key/value pairs only required for intermediate
● map(IN, Collector<OUT>)
● map-only jobs never need to create k/v pairs
● map(IN, Collector<Pair<K,V>>)
● reduce(K, Iterable<V>, Collector<OUT>)
● if IN and OUT are pairs, default is sort
● In Avro trunk today, built on Hadoop 0.20 APIs.
● in Avro1.4.0 release next month

Avro MapReduce Example
public void map(Utf8 text, AvroCollector<Pair<Utf8,Long>> c,
                Reporter r) throws IOException {

  StringTokenizer i = new StringTokenizer(text.toString());

  while (i.hasMoreTokens())
    c.collect(new Pair<Utf8,Long>(new Utf8(i.nextToken()), 1L));
}

public void reduce(Utf8 word, Iterable<Long> counts,
                   AvroCollector<Pair<Utf8,Long>> c,
                   Reporter r) throws IOException {
  long sum = 0;
  for (long count : counts)
    sum += count;
  c.collect(new Pair<Utf8,Long>(word, sum));
}

Avro RPC

● leverage versioning support
● permit different versions of services to interoperate
● for Hadoop, will
● let apps talk to clusters running different versions
● provide crosslanguage access

Avro IDL Protocol

@namespace("org.apache.avro.test")

protocol HelloWorld {

record Greeting {
string who;
string what;
}

Greeting hello(Greeting greeting);
}

Avro Status
● Current
● C, C++, Java, Python & Ruby APIs
● Interoperable RPC and data
● Mapreduce API for Java
● Upcoming
● MapReduce APIs for other languages
– efficient, rich data
● RPC used in Flume, Hbase, Cassandra, Hadoop, etc.
– interversion compatibility
– nonJava clients

3 avro hug-2010-07-21

More Related Content

What's hot

Viewers also liked

Similar to 3 avro hug-2010-07-21

More from Hadoop User Group

Recently uploaded

3 avro hug-2010-07-21