5. History
Underlying serialization system basically unchanged
Additional language support and data formats
Language, data format combinatorial explosion
C++ JSON to Java BSON
Python Smile to PHP CSV
Apr 2009 – Avro proposal
May 2010 – Top-level project
6. Sexy Tractors
Data serialization tools, like tractors, aren’t sexy
They should be!
Dollar for dollar storage capacity has increased
exponentially, doubling every 1.5-2 years
Throughput of magnetic storage and network has not
maintained this pace
Distributed systems are the norm
Efficient data serialization techniques and tools are
vital
7. Project Drivers
Common data format for serialization and RPC
Dynamic
Expressive
Efficient
File format
Well defined
Standalone
Splittable & compressed
8. Biased Comparison
CSV XML/JSON SequenceFile Thrift & PB Avro
Language Yes Yes No Yes Yes
Independent
Expressive No Yes Yes Yes Yes
Efficient No No Yes Yes Yes
Dynamic Yes Yes No No Yes
Standalone ? Yes No No Yes
Splittable ? ? Yes ? Yes
9. Project Overview
Specification based design
Dynamic implementations
File format
Schemas
Must support JSON implementation
IDL often supported
Evolvable
First class Hadoop support
10. Specification Based Design
Schemas
Encoding
Sort order
Object container files
Codecs
Protocol
Protocol write format
Schema resolution
11. Specification Based Design
Schemas
Primitive types
Null, boolean, int, long, float, double, bytes, string
Complex types
Records, enums, arrays, maps, unions and fixed
Named types
Records, enums, fixed
Name & namespace
Aliases
http://avro.apache.org/docs/current/spec.html#schema
s
13. Specification Based Design
Encodings
JSON – for debugging
Binary
Sort order
Efficient sorting by system other than writer
Sorting binary-encoded data without deserialization
14. Specification Based Design
Object container files
Schema
Serialized data written to binary-encoded blocks
Blocks may be compressed
Synchronization markers
Codecs
Null
Deflate
Snappy (optional)
LZO (future)
15. Specification Based Design
Protocol
Protocol name
Namespace
Types
Named types used in messages
Messages
Uniquely named message
Request
Response
Errors
Wire format
Transports
Framing
Handshake
17. Schema Resolution & Evolution
Writers schema always provided to reader
Compare schema used by writer & schema expected by reader
Fields that match name & type are read
Fields written that don’t match are skipped
Expected fields not written can be identified
Error or provide default value
Same features as provided by numeric field ids
Keeps fields symbolic, no index IDs written in data
Allows for projections
Very efficient at skipping fields
Aliases
Allows projections from 2 different types using aliases
User transaction
Count, date
Batch
Count, date
18. Implementations
Core – parse schemas, read & write binary data for a schema
Data file – read & write Avro data files
Codec – supported codecs
RPC/HTTP – make and receive calls over HTTP
Implementation Core Data file Codec RPC/HTTP
C Yes Yes Deflate Yes
C++ Yes Yes ? Yes
C# Yes No N/A No
Java Yes Yes Deflate, Snappy Yes
Python Yes Yes Deflate Yes
Ruby Yes Yes Deflate Yes
PHP Yes Yes ? No
19. API
Generic
Generic attribute/value data structure
Best suited for dynamic processing
Specific
Each record corresponds to a different kind of object in the
programming language
RPC systems typically use this
Reflect
Schemas generated via reflection
Converting an existing codebase to use Avro
21. Java Example
Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr"));
OutputStream outputStream = new FileOutputStream("data.avro");
DataFileWriter<Message> writer =
new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema));
writer.setCodec(CodecFactory.deflateCodec(1));
writer.create(schema, outputStream);
writer.append(new Message ());
writer.close();
22. Java Example
DataFileReader<Message> reader = new DataFileReader<Message>(
new File("data.avro"),
new GenericDatumReader<Message>());
for (Message next : reader) {
System.out.println("next: " + next);
}
24. RPC
Client
Corresponding Transceiver for each server
LocalTransceiver
Requestor
25. RPC Server
Protocol protocol = Protocol.parse(new File("protocol.avpr"));
InetSocketAddress address = new InetSocketAddress("localhost", 33333);
GenericResponder responder = new GenericResponder(protocol) {
@Override
public Object respond(Protocol.Message message, Object request)
throws Exception {
...
}
};
new SocketServer(responder, address).join();
26. Hadoop Support
File writers and readers
Replacing RPC with Avro
In Flume already
Pig support is in
Splittable
Set block size when writing
Tether jobs
Connector framework for other languages
Hadoop Pipes
27. Future
RPC
Hbase, Cassandra, Hadoop core
Hive in progress
Tether jobs
Actual MapReduce implementations in other languages
28. Avro
Dynamic
Expressive
Efficient
Specification based design
Language implementations are fairly solid
Serialization or RPC or both
First class Hadoop support
Currently 1.5.1
Sexy tractors