Philip Zeyliger, Cloudera
January 19, 2009
A data serialization system
A schema language
A compact serialized form
An RPC framework
A handful of APIs, in a handful of languages
Support for dynamic access
Simple but expressive schema evolution
Same "space" as Apache Thrift, Google Protocol Buffers,
Binary JSON, and XDR. Subtle differences with all of them.
Data File Format
(important for Hadoop!)
* Append only with same
* Arbitrary metadata
Using AVRO in the shuffle (MR-1126)
Note that AVRO schemas let you specify sort order;
binary comparators are a thing of the past
Many Writables can be AVRO+Reflection instead
AVRO sort order leaves hand-writing RawComparators in
the past; for Streaming, you now get fast comparators for
AVRO for Hadoop RPC (e.g., HDFS-982)
Open up protocols for cross-language use
compile Generates Java code for the given schema.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
genavro Generates a JSON schema from a GenAvro file
getschema Prints out schema of an Avro data file.
induce Induce a schema/protocol from Java class/interface.
jsontofrag Renders a JSON-encoded Avro datum as binary.
rpcreceive Opens an HTTP RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tojson Dumps an Avro data file as JSON, one record per line.
1.3 to be released soon...
Good time to try it out!
Trying not to evolve the serialized format.
APIs are evolving.
Transports are evolving.