Avro
 Etymology & History
 Sexy Tractors
 Project Drivers & Overview
 Serialization
 RPC
 Hadoop Support
Etymology
 British aircraft manufacturer
 1910-1963
History
 Doug Cutting – Cloudera, Hadoop project founder
 2002 – Nutch
 2004 – Google GFS, MapReduce whitepapers
 2005 – NDFS & MR, Writable & SequenceFile
 2006 – Hadoop split from Nutch, renamed NDFS to
  HDFS
 2007 – Yahoo gets involved, HBase, Pig, Zookeeper
 2008 – Terrasort contest winner, Hive, Mahout,
  Cassandra
 2009 – Oozie, Flume, Hue
History
 Underlying serialization system basically unchanged
 Additional language support and data formats
 Language, data format combinatorial explosion
    C++ JSON to Java BSON
    Python Smile to PHP CSV
 Apr 2009 – Avro proposal
 May 2010 – Top-level project
Sexy Tractors
 Data serialization tools, like tractors, aren’t sexy
 They should be!
 Dollar for dollar storage capacity has increased
  exponentially, doubling every 1.5-2 years
 Throughput of magnetic storage and network has not
  maintained this pace
 Distributed systems are the norm
 Efficient data serialization techniques and tools are
  vital
Project Drivers
 Common data format for serialization and RPC
 Dynamic
 Expressive
 Efficient
 File format
    Well defined
    Standalone
    Splittable & compressed
Biased Comparison
              CSV   XML/JSON   SequenceFile   Thrift & PB   Avro

Language      Yes   Yes        No             Yes           Yes
Independent
Expressive    No    Yes        Yes            Yes           Yes

Efficient     No    No         Yes            Yes           Yes

Dynamic       Yes   Yes        No             No            Yes

Standalone    ?     Yes        No             No            Yes

Splittable    ?     ?          Yes            ?             Yes
Project Overview
 Specification based design
 Dynamic implementations
 File format
 Schemas
    Must support JSON implementation
    IDL often supported
    Evolvable
 First class Hadoop support
Specification Based Design
 Schemas
 Encoding
 Sort order
 Object container files
 Codecs
 Protocol
 Protocol write format
 Schema resolution
Specification Based Design
 Schemas
    Primitive types
        Null, boolean, int, long, float, double, bytes, string
    Complex types
      Records, enums, arrays, maps, unions and fixed

    Named types
      Records, enums, fixed
      Name & namespace

    Aliases
    http://avro.apache.org/docs/current/spec.html#schema
     s
Schema Example
log-message.avpr

{
    "namespace": "com.emoney",
    "name": "LogMessage",
    "type": "record",
    "fields": [
       {"name": "level", "type": "string", "comment" : "this is ignored"},
       {"name": "message", "type": "string", "description" : "this is the message"},
       {"name": "dateTime", "type": "long"},
       {"name": "exceptionMessage", "type": ["null", "string"]}
    ]
}
Specification Based Design
 Encodings
    JSON – for debugging
    Binary
 Sort order
    Efficient sorting by system other than writer
    Sorting binary-encoded data without deserialization
Specification Based Design
 Object container files
    Schema
    Serialized data written to binary-encoded blocks
    Blocks may be compressed
    Synchronization markers
 Codecs
    Null
    Deflate
    Snappy (optional)
    LZO (future)
Specification Based Design
 Protocol
    Protocol name
    Namespace
    Types
        Named types used in messages
    Messages
        Uniquely named message
        Request
        Response
        Errors
 Wire format
   Transports
   Framing
   Handshake
Protocol
{
    "namespace": "com.acme",
    "protocol": "HelloWorld",
    "doc": "Protocol Greetings",

    "types": [
       {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]},
       {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ],

    "messages": {
      "hello": {
        "doc": "Say hello.",
        "request": [{"name": "greeting", "type": "Greeting" }],
        "response": "Greeting",
        "errors": ["Curse"]
      }
    }
}
Schema Resolution & Evolution
   Writers schema always provided to reader
   Compare schema used by writer & schema expected by reader
   Fields that match name & type are read
   Fields written that don’t match are skipped
   Expected fields not written can be identified
      Error or provide default value
 Same features as provided by numeric field ids
    Keeps fields symbolic, no index IDs written in data
 Allows for projections
    Very efficient at skipping fields
 Aliases
    Allows projections from 2 different types using aliases
    User transaction
          Count, date
      Batch
        Count, date
Implementations
   Core – parse schemas, read & write binary data for a schema
   Data file – read & write Avro data files
   Codec – supported codecs
   RPC/HTTP – make and receive calls over HTTP
Implementation         Core         Data file         Codec          RPC/HTTP
C                Yes           Yes              Deflate         Yes
C++              Yes           Yes              ?               Yes
C#               Yes           No               N/A             No
Java             Yes           Yes              Deflate, Snappy Yes
Python           Yes           Yes              Deflate         Yes
Ruby             Yes           Yes              Deflate         Yes
PHP              Yes           Yes              ?               No
API
 Generic
    Generic attribute/value data structure
    Best suited for dynamic processing
 Specific
    Each record corresponds to a different kind of object in the
     programming language
    RPC systems typically use this
 Reflect
    Schemas generated via reflection
    Converting an existing codebase to use Avro
API
 Low-level
    Schema
    Encoders
    DatumWriter
    DatumReader
 High-level
    DataFileWriter
    DataFileReader
Java Example
Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr"));

OutputStream outputStream = new FileOutputStream("data.avro");

DataFileWriter<Message> writer =
        new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema));

writer.setCodec(CodecFactory.deflateCodec(1));
writer.create(schema, outputStream);

writer.append(new Message ());

writer.close();
Java Example
DataFileReader<Message> reader = new DataFileReader<Message>(
         new File("data.avro"),
         new GenericDatumReader<Message>());

for (Message next : reader) {
  System.out.println("next: " + next);
}
RPC
 Server
    SocketServer (non-standard)
    SaslSocketServer
    HttpServer
    NettyServer
    DatagramServer (non-standard)
 Responder
    Generic
    Reflect
    Specific
 Client
    Corresponding Transceiver
    LocalTransceiver
 Requestor
RPC
 Client
    Corresponding Transceiver for each server
    LocalTransceiver
 Requestor
RPC Server
Protocol protocol = Protocol.parse(new File("protocol.avpr"));

InetSocketAddress address = new InetSocketAddress("localhost", 33333);

GenericResponder responder = new GenericResponder(protocol) {
   @Override
   public Object respond(Protocol.Message message, Object request)
   throws Exception {
     ...
   }
};

new SocketServer(responder, address).join();
Hadoop Support
 File writers and readers
 Replacing RPC with Avro
    In Flume already
 Pig support is in
 Splittable
    Set block size when writing
 Tether jobs
    Connector framework for other languages
    Hadoop Pipes
Future
 RPC
    Hbase, Cassandra, Hadoop core
 Hive in progress
 Tether jobs
    Actual MapReduce implementations in other languages
Avro
 Dynamic
 Expressive
 Efficient
 Specification based design
 Language implementations are fairly solid
 Serialization or RPC or both
 First class Hadoop support
 Currently 1.5.1
 Sexy tractors

Avro

  • 2.
    Avro  Etymology &History  Sexy Tractors  Project Drivers & Overview  Serialization  RPC  Hadoop Support
  • 3.
    Etymology  British aircraftmanufacturer  1910-1963
  • 4.
    History  Doug Cutting– Cloudera, Hadoop project founder  2002 – Nutch  2004 – Google GFS, MapReduce whitepapers  2005 – NDFS & MR, Writable & SequenceFile  2006 – Hadoop split from Nutch, renamed NDFS to HDFS  2007 – Yahoo gets involved, HBase, Pig, Zookeeper  2008 – Terrasort contest winner, Hive, Mahout, Cassandra  2009 – Oozie, Flume, Hue
  • 5.
    History  Underlying serializationsystem basically unchanged  Additional language support and data formats  Language, data format combinatorial explosion  C++ JSON to Java BSON  Python Smile to PHP CSV  Apr 2009 – Avro proposal  May 2010 – Top-level project
  • 6.
    Sexy Tractors  Dataserialization tools, like tractors, aren’t sexy  They should be!  Dollar for dollar storage capacity has increased exponentially, doubling every 1.5-2 years  Throughput of magnetic storage and network has not maintained this pace  Distributed systems are the norm  Efficient data serialization techniques and tools are vital
  • 7.
    Project Drivers  Commondata format for serialization and RPC  Dynamic  Expressive  Efficient  File format  Well defined  Standalone  Splittable & compressed
  • 8.
    Biased Comparison CSV XML/JSON SequenceFile Thrift & PB Avro Language Yes Yes No Yes Yes Independent Expressive No Yes Yes Yes Yes Efficient No No Yes Yes Yes Dynamic Yes Yes No No Yes Standalone ? Yes No No Yes Splittable ? ? Yes ? Yes
  • 9.
    Project Overview  Specificationbased design  Dynamic implementations  File format  Schemas  Must support JSON implementation  IDL often supported  Evolvable  First class Hadoop support
  • 10.
    Specification Based Design Schemas  Encoding  Sort order  Object container files  Codecs  Protocol  Protocol write format  Schema resolution
  • 11.
    Specification Based Design Schemas  Primitive types  Null, boolean, int, long, float, double, bytes, string  Complex types  Records, enums, arrays, maps, unions and fixed  Named types  Records, enums, fixed  Name & namespace  Aliases  http://avro.apache.org/docs/current/spec.html#schema s
  • 12.
    Schema Example log-message.avpr { "namespace": "com.emoney", "name": "LogMessage", "type": "record", "fields": [ {"name": "level", "type": "string", "comment" : "this is ignored"}, {"name": "message", "type": "string", "description" : "this is the message"}, {"name": "dateTime", "type": "long"}, {"name": "exceptionMessage", "type": ["null", "string"]} ] }
  • 13.
    Specification Based Design Encodings  JSON – for debugging  Binary  Sort order  Efficient sorting by system other than writer  Sorting binary-encoded data without deserialization
  • 14.
    Specification Based Design Object container files  Schema  Serialized data written to binary-encoded blocks  Blocks may be compressed  Synchronization markers  Codecs  Null  Deflate  Snappy (optional)  LZO (future)
  • 15.
    Specification Based Design Protocol  Protocol name  Namespace  Types  Named types used in messages  Messages  Uniquely named message  Request  Response  Errors  Wire format  Transports  Framing  Handshake
  • 16.
    Protocol { "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } } }
  • 17.
    Schema Resolution &Evolution  Writers schema always provided to reader  Compare schema used by writer & schema expected by reader  Fields that match name & type are read  Fields written that don’t match are skipped  Expected fields not written can be identified  Error or provide default value  Same features as provided by numeric field ids  Keeps fields symbolic, no index IDs written in data  Allows for projections  Very efficient at skipping fields  Aliases  Allows projections from 2 different types using aliases  User transaction  Count, date  Batch  Count, date
  • 18.
    Implementations  Core – parse schemas, read & write binary data for a schema  Data file – read & write Avro data files  Codec – supported codecs  RPC/HTTP – make and receive calls over HTTP Implementation Core Data file Codec RPC/HTTP C Yes Yes Deflate Yes C++ Yes Yes ? Yes C# Yes No N/A No Java Yes Yes Deflate, Snappy Yes Python Yes Yes Deflate Yes Ruby Yes Yes Deflate Yes PHP Yes Yes ? No
  • 19.
    API  Generic  Generic attribute/value data structure  Best suited for dynamic processing  Specific  Each record corresponds to a different kind of object in the programming language  RPC systems typically use this  Reflect  Schemas generated via reflection  Converting an existing codebase to use Avro
  • 20.
    API  Low-level  Schema  Encoders  DatumWriter  DatumReader  High-level  DataFileWriter  DataFileReader
  • 21.
    Java Example Schema schema= Schema.parse(getClass().getResourceAsStream("schema.avpr")); OutputStream outputStream = new FileOutputStream("data.avro"); DataFileWriter<Message> writer = new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema)); writer.setCodec(CodecFactory.deflateCodec(1)); writer.create(schema, outputStream); writer.append(new Message ()); writer.close();
  • 22.
    Java Example DataFileReader<Message> reader= new DataFileReader<Message>( new File("data.avro"), new GenericDatumReader<Message>()); for (Message next : reader) { System.out.println("next: " + next); }
  • 23.
    RPC  Server  SocketServer (non-standard)  SaslSocketServer  HttpServer  NettyServer  DatagramServer (non-standard)  Responder  Generic  Reflect  Specific  Client  Corresponding Transceiver  LocalTransceiver  Requestor
  • 24.
    RPC  Client  Corresponding Transceiver for each server  LocalTransceiver  Requestor
  • 25.
    RPC Server Protocol protocol= Protocol.parse(new File("protocol.avpr")); InetSocketAddress address = new InetSocketAddress("localhost", 33333); GenericResponder responder = new GenericResponder(protocol) { @Override public Object respond(Protocol.Message message, Object request) throws Exception { ... } }; new SocketServer(responder, address).join();
  • 26.
    Hadoop Support  Filewriters and readers  Replacing RPC with Avro  In Flume already  Pig support is in  Splittable  Set block size when writing  Tether jobs  Connector framework for other languages  Hadoop Pipes
  • 27.
    Future  RPC  Hbase, Cassandra, Hadoop core  Hive in progress  Tether jobs  Actual MapReduce implementations in other languages
  • 28.
    Avro  Dynamic  Expressive Efficient  Specification based design  Language implementations are fairly solid  Serialization or RPC or both  First class Hadoop support  Currently 1.5.1  Sexy tractors