Avro

Avro
 Etymology & History
 Sexy Tractors
 Project Drivers & Overview
 Serialization
 RPC
 Hadoop Support

Etymology
 British aircraft manufacturer
 1910-1963

History
 Doug Cutting – Cloudera, Hadoop project founder
 2002 – Nutch
 2004 – Google GFS, MapReduce whitepapers
 2005 – NDFS & MR, Writable & SequenceFile
 2006 – Hadoop split from Nutch, renamed NDFS to
HDFS
 2007 – Yahoo gets involved, HBase, Pig, Zookeeper
 2008 – Terrasort contest winner, Hive, Mahout,
Cassandra
 2009 – Oozie, Flume, Hue

History
 Underlying serialization system basically unchanged
 Additional language support and data formats
 Language, data format combinatorial explosion
 C++ JSON to Java BSON
 Python Smile to PHP CSV
 Apr 2009 – Avro proposal
 May 2010 – Top-level project

Sexy Tractors
 Data serialization tools, like tractors, aren’t sexy
 They should be!
 Dollar for dollar storage capacity has increased
exponentially, doubling every 1.5-2 years
 Throughput of magnetic storage and network has not
maintained this pace
 Distributed systems are the norm
 Efficient data serialization techniques and tools are
vital

Project Drivers
 Common data format for serialization and RPC
 Dynamic
 Expressive
 Efficient
 File format
 Well defined
 Standalone
 Splittable & compressed

Biased Comparison
CSV XML/JSON SequenceFile Thrift & PB Avro

Language Yes Yes No Yes Yes
Independent
Expressive No Yes Yes Yes Yes

Efficient No No Yes Yes Yes

Dynamic Yes Yes No No Yes

Standalone ? Yes No No Yes

Splittable ? ? Yes ? Yes

Project Overview
 Specification based design
 Dynamic implementations
 File format
 Schemas
 Must support JSON implementation
 IDL often supported
 Evolvable
 First class Hadoop support

Specification Based Design
 Schemas
 Encoding
 Sort order
 Object container files
 Codecs
 Protocol
 Protocol write format
 Schema resolution

 Schemas
 Primitive types
 Null, boolean, int, long, float, double, bytes, string
 Complex types
 Records, enums, arrays, maps, unions and fixed

 Named types
 Records, enums, fixed
 Name & namespace

 Aliases
 http://avro.apache.org/docs/current/spec.html#schema
s

Schema Example
log-message.avpr

{
"namespace": "com.emoney",
"name": "LogMessage",
"type": "record",
"fields": [
{"name": "level", "type": "string", "comment" : "this is ignored"},
{"name": "message", "type": "string", "description" : "this is the message"},
{"name": "dateTime", "type": "long"},
{"name": "exceptionMessage", "type": ["null", "string"]}
]
}

 Encodings
 JSON – for debugging
 Binary
 Sort order
 Efficient sorting by system other than writer
 Sorting binary-encoded data without deserialization

 Object container files
 Schema
 Serialized data written to binary-encoded blocks
 Blocks may be compressed
 Synchronization markers
 Codecs
 Null
 Deflate
 Snappy (optional)
 LZO (future)

 Protocol
 Protocol name
 Namespace
 Types
 Named types used in messages
 Messages
 Uniquely named message
 Request
 Response
 Errors
 Wire format
 Transports
 Framing
 Handshake

Protocol
{
"namespace": "com.acme",
"protocol": "HelloWorld",
"doc": "Protocol Greetings",

"types": [
{"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]},
{"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ],

"messages": {
"hello": {
"doc": "Say hello.",
"request": [{"name": "greeting", "type": "Greeting" }],
"response": "Greeting",
"errors": ["Curse"]
}
}
}

Schema Resolution & Evolution
 Writers schema always provided to reader
 Compare schema used by writer & schema expected by reader
 Fields that match name & type are read
 Fields written that don’t match are skipped
 Expected fields not written can be identified
 Error or provide default value
 Same features as provided by numeric field ids
 Keeps fields symbolic, no index IDs written in data
 Allows for projections
 Very efficient at skipping fields
 Aliases
 Allows projections from 2 different types using aliases
 User transaction
 Count, date
 Batch
 Count, date

Implementations
 Core – parse schemas, read & write binary data for a schema
 Data file – read & write Avro data files
 Codec – supported codecs
 RPC/HTTP – make and receive calls over HTTP
Implementation Core Data file Codec RPC/HTTP
C Yes Yes Deflate Yes
C++ Yes Yes ? Yes
C# Yes No N/A No
Java Yes Yes Deflate, Snappy Yes
Python Yes Yes Deflate Yes
Ruby Yes Yes Deflate Yes
PHP Yes Yes ? No

API
 Generic
 Generic attribute/value data structure
 Best suited for dynamic processing
 Specific
 Each record corresponds to a different kind of object in the
programming language
 RPC systems typically use this
 Reflect
 Schemas generated via reflection
 Converting an existing codebase to use Avro

API
 Low-level
 Schema
 Encoders
 DatumWriter
 DatumReader
 High-level
 DataFileWriter
 DataFileReader

Java Example
Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr"));

OutputStream outputStream = new FileOutputStream("data.avro");

DataFileWriter<Message> writer =
new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema));

writer.setCodec(CodecFactory.deflateCodec(1));
writer.create(schema, outputStream);

writer.append(new Message ());

writer.close();

Java Example
DataFileReader<Message> reader = new DataFileReader<Message>(
new File("data.avro"),
new GenericDatumReader<Message>());

for (Message next : reader) {
System.out.println("next: " + next);
}

RPC
 Server
 SocketServer (non-standard)
 SaslSocketServer
 HttpServer
 NettyServer
 DatagramServer (non-standard)
 Responder
 Generic
 Reflect
 Specific
 Client
 Corresponding Transceiver
 LocalTransceiver
 Requestor

RPC
 Client
 Corresponding Transceiver for each server
 LocalTransceiver
 Requestor

RPC Server
Protocol protocol = Protocol.parse(new File("protocol.avpr"));

InetSocketAddress address = new InetSocketAddress("localhost", 33333);

GenericResponder responder = new GenericResponder(protocol) {
@Override
public Object respond(Protocol.Message message, Object request)
throws Exception {
...
}
};

new SocketServer(responder, address).join();

Hadoop Support
 File writers and readers
 Replacing RPC with Avro
 In Flume already
 Pig support is in
 Splittable
 Set block size when writing
 Tether jobs
 Connector framework for other languages
 Hadoop Pipes

Future
 RPC
 Hbase, Cassandra, Hadoop core
 Hive in progress
 Tether jobs
 Actual MapReduce implementations in other languages

Avro
 Dynamic
 Expressive
 Efficient
 Specification based design
 Language implementations are fairly solid
 Serialization or RPC or both
 First class Hadoop support
 Currently 1.5.1
 Sexy tractors

Avro

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Avro

Similar to Avro (20)

Recently uploaded

Recently uploaded (20)

Avro