Apache AVRO
  What's new?
Philip Zeyliger, Cloudera
   (AVRO committer)

     Boston HUG
   January 19, 2009
What's AVRO?

 A data serialization system
 Includes:
     A schema language
     A compact serialized form
     An RPC framework
     A handful of APIs, in a handful of languages
 Goals:
     Cross-language
     Support for dynamic access
     Simple but expressive schema evolution

 Same "space" as Apache Thrift, Google Protocol Buffers,
 Binary JSON, and XDR. Subtle differences with all of them.
AVRO Protocols & Schemas
@namespace("org.apache.avro.demo")
protocol CurrencyConversion {
  enum Currency {
    USD,
    GBP,
    EUR,
    JPY
  }
  record Money {
    Currency currency;
    int amount;
  }
  error UnknownRateError {
    Currency currency;
  }
  Money convert(Money input, Currency targetCurrency)
    throws UnknownRateError;
  double rate(Currency input, Currency output) throws UnknownRateError;
}




              "genavro" IDL (AVRO-258)
$java -jar avro-tools-1.2.0-dev.jar genavro < demo.genavro   "messages" : {
{                                                               "convert" : {
  "protocol" : "CurrencyConversion",                              "request" : [ {
  "namespace" : "org.apache.avro.demo",                            "name" : "input",
  "types" : [ {                                                    "type" : "Money"
   "type" : "enum",                                               }, {
   "name" : "Currency",                                            "name" : "targetCurrency",
   "symbols" : [ "USD", "GBP", "EUR", "JPY" ]                      "type" : "Currency"
  }, {                                                            } ],
   "type" : "record",                                             "response" : "Money",
   "name" : "Money",                                              "errors" : [ "UnknownRateError" ]
   "fields" : [ {                                               },
     "name" : "currency",                                       "rate" : {
     "type" : "Currency"                                          "request" : [ {
   }, {                                                            "name" : "input",
     "name" : "amount",                                            "type" : "Currency"
     "type" : "int"                                               }, {
   }]                                                              "name" : "output",
  }, {                                                             "type" : "Currency"
   "type" : "error",                                              } ],
   "name" : "UnknownRateError",                                   "response" : "double",
   "fields" : [ {                                                 "errors" : [ "UnknownRateError" ]
     "name" : "currency",                                       }
     "type" : "Currency"                                      }
   }]                                                        }[
  } ],




             JSON Representation of Protocol and Schemas
Types

            primitive             complex
string                  record
bytes                   array
int & long              map: string -> T
float & double          union
boolean                 fixed<N>
null                    enum
Schema Evolution & Projection
         AVRO binary data never travels without its schema. This
         allows dynamic tooling.
         Writer's Schema and Reader's Schema may be different.
{       /* Writer */                { /* Reader */
     "type" : "record",               "type" : "record",
     "name" : "Person",               "name" : "Person",
     "fields" : [ {                   "fields" : [ {
       "name" : "first",                "name" : "first",
       "type" : "string"                "type" : "string"
     }, {                             }, {
       "name" : "sport",                "name" : "age",
       "type" : "string",               "type" : "int",
     }                                  "default": 0,
 }                                    }
                                    }

Serialized Data:                   Data presented to application:

    "Alice", "Ultimate Frisbee"     "Alice", 0
APIs

 Python
    Dynamic
 Java
    Specific (generated code)
    Generic (container-based)
    Reflection (induces schemas from classes)
 C
 C++
 Ruby
C API

char buf[64];
avro_writer_t writer = avro_writer_memory(buf, sizeof(buf));
avro_schema_t writers_schema = avro_schema_string();
avro_datum_t datum = avro_string("Hello, world!");
avro_write_data(writer, writers_schema, datum);

avro_reader_t reader = avro_reader_memory(buf, sizeof(buf));
avro_schema_t readers_schema = avro_schema_string();
avro_datum_t read_datum;
avro_read_data(reader, writers_schema, readers_schema, &read_datum);
Data File Format
(AVRO-160)

Features:
* Splittable
 (important for Hadoop!)
* Append only with same
  schema.
* Compression
* Arbitrary metadata
* Simple
Hadoop Integration
 Users
    AvroInputFormat/AvroOutputFormat (MR-815)
    Using AVRO in the shuffle (MR-1126)
        Note that AVRO schemas let you specify sort order;
        binary comparators are a thing of the past
    Many Writables can be AVRO+Reflection instead
    AVRO sort order leaves hand-writing RawComparators in
    the past; for Streaming, you now get fast comparators for
    free!
 Framework
    AVRO for Hadoop RPC (e.g., HDFS-982)
 Goals
    Open up protocols for cross-language use
avro-tools

Available tools:
   compile Generates Java code for the given schema.
fragtojson Renders a binary-encoded Avro datum as JSON.
  fromjson Reads JSON records and writes an Avro data file.
   genavro Generates a JSON schema from a GenAvro file
 getschema Prints out schema of an Avro data file.
    induce Induce a schema/protocol from Java class/interface.
jsontofrag Renders a JSON-encoded Avro datum as binary.
rpcreceive Opens an HTTP RPC Server and listens for one message.
   rpcsend Sends a single RPC message.
    tojson Dumps an Avro data file as JSON, one record per line.
1.3 to be released soon...

Good time to try it out!

What's evolving?
  Trying not to evolve the serialized format.
  APIs are evolving.
  Transports are evolving.
Obligatory Links

  Web page:
  http://hadoop.apache.org/avro/
  Mailing list:
  avro-user-subscribe@hadoop.apache.org
  Source repository:
  http://svn.apache.org/repos/asf/hadoop/avro/
Thanks!


    Questions?


    Philip Zeyliger
philip@cloudera.com

Apache AVRO (Boston HUG, Jan 19, 2010)

  • 1.
    Apache AVRO What's new? Philip Zeyliger, Cloudera (AVRO committer) Boston HUG January 19, 2009
  • 2.
    What's AVRO? Adata serialization system Includes: A schema language A compact serialized form An RPC framework A handful of APIs, in a handful of languages Goals: Cross-language Support for dynamic access Simple but expressive schema evolution Same "space" as Apache Thrift, Google Protocol Buffers, Binary JSON, and XDR. Subtle differences with all of them.
  • 3.
    AVRO Protocols &Schemas @namespace("org.apache.avro.demo") protocol CurrencyConversion { enum Currency { USD, GBP, EUR, JPY } record Money { Currency currency; int amount; } error UnknownRateError { Currency currency; } Money convert(Money input, Currency targetCurrency) throws UnknownRateError; double rate(Currency input, Currency output) throws UnknownRateError; } "genavro" IDL (AVRO-258)
  • 4.
    $java -jar avro-tools-1.2.0-dev.jargenavro < demo.genavro "messages" : { { "convert" : { "protocol" : "CurrencyConversion", "request" : [ { "namespace" : "org.apache.avro.demo", "name" : "input", "types" : [ { "type" : "Money" "type" : "enum", }, { "name" : "Currency", "name" : "targetCurrency", "symbols" : [ "USD", "GBP", "EUR", "JPY" ] "type" : "Currency" }, { } ], "type" : "record", "response" : "Money", "name" : "Money", "errors" : [ "UnknownRateError" ] "fields" : [ { }, "name" : "currency", "rate" : { "type" : "Currency" "request" : [ { }, { "name" : "input", "name" : "amount", "type" : "Currency" "type" : "int" }, { }] "name" : "output", }, { "type" : "Currency" "type" : "error", } ], "name" : "UnknownRateError", "response" : "double", "fields" : [ { "errors" : [ "UnknownRateError" ] "name" : "currency", } "type" : "Currency" } }] }[ } ], JSON Representation of Protocol and Schemas
  • 5.
    Types primitive complex string record bytes array int & long map: string -> T float & double union boolean fixed<N> null enum
  • 6.
    Schema Evolution &Projection AVRO binary data never travels without its schema. This allows dynamic tooling. Writer's Schema and Reader's Schema may be different. { /* Writer */ { /* Reader */ "type" : "record", "type" : "record", "name" : "Person", "name" : "Person", "fields" : [ { "fields" : [ { "name" : "first", "name" : "first", "type" : "string" "type" : "string" }, { }, { "name" : "sport", "name" : "age", "type" : "string", "type" : "int", } "default": 0, } } } Serialized Data: Data presented to application: "Alice", "Ultimate Frisbee" "Alice", 0
  • 7.
    APIs Python Dynamic Java Specific (generated code) Generic (container-based) Reflection (induces schemas from classes) C C++ Ruby
  • 8.
    C API char buf[64]; avro_writer_twriter = avro_writer_memory(buf, sizeof(buf)); avro_schema_t writers_schema = avro_schema_string(); avro_datum_t datum = avro_string("Hello, world!"); avro_write_data(writer, writers_schema, datum); avro_reader_t reader = avro_reader_memory(buf, sizeof(buf)); avro_schema_t readers_schema = avro_schema_string(); avro_datum_t read_datum; avro_read_data(reader, writers_schema, readers_schema, &read_datum);
  • 9.
    Data File Format (AVRO-160) Features: *Splittable (important for Hadoop!) * Append only with same schema. * Compression * Arbitrary metadata * Simple
  • 10.
    Hadoop Integration Users AvroInputFormat/AvroOutputFormat (MR-815) Using AVRO in the shuffle (MR-1126) Note that AVRO schemas let you specify sort order; binary comparators are a thing of the past Many Writables can be AVRO+Reflection instead AVRO sort order leaves hand-writing RawComparators in the past; for Streaming, you now get fast comparators for free! Framework AVRO for Hadoop RPC (e.g., HDFS-982) Goals Open up protocols for cross-language use
  • 11.
    avro-tools Available tools: compile Generates Java code for the given schema. fragtojson Renders a binary-encoded Avro datum as JSON. fromjson Reads JSON records and writes an Avro data file. genavro Generates a JSON schema from a GenAvro file getschema Prints out schema of an Avro data file. induce Induce a schema/protocol from Java class/interface. jsontofrag Renders a JSON-encoded Avro datum as binary. rpcreceive Opens an HTTP RPC Server and listens for one message. rpcsend Sends a single RPC message. tojson Dumps an Avro data file as JSON, one record per line.
  • 12.
    1.3 to bereleased soon... Good time to try it out! What's evolving? Trying not to evolve the serialized format. APIs are evolving. Transports are evolving.
  • 13.
    Obligatory Links Web page: http://hadoop.apache.org/avro/ Mailing list: avro-user-subscribe@hadoop.apache.org Source repository: http://svn.apache.org/repos/asf/hadoop/avro/
  • 14.
    Thanks! Questions? Philip Zeyliger philip@cloudera.com