• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Avro Avro Presentation Transcript

    • Avro Etymology & History Sexy Tractors Project Drivers & Overview Serialization RPC Hadoop Support
    • Etymology British aircraft manufacturer 1910-1963
    • History Doug Cutting – Cloudera, Hadoop project founder 2002 – Nutch 2004 – Google GFS, MapReduce whitepapers 2005 – NDFS & MR, Writable & SequenceFile 2006 – Hadoop split from Nutch, renamed NDFS to HDFS 2007 – Yahoo gets involved, HBase, Pig, Zookeeper 2008 – Terrasort contest winner, Hive, Mahout, Cassandra 2009 – Oozie, Flume, Hue
    • History Underlying serialization system basically unchanged Additional language support and data formats Language, data format combinatorial explosion  C++ JSON to Java BSON  Python Smile to PHP CSV Apr 2009 – Avro proposal May 2010 – Top-level project
    • Sexy Tractors Data serialization tools, like tractors, aren’t sexy They should be! Dollar for dollar storage capacity has increased exponentially, doubling every 1.5-2 years Throughput of magnetic storage and network has not maintained this pace Distributed systems are the norm Efficient data serialization techniques and tools are vital
    • Project Drivers Common data format for serialization and RPC Dynamic Expressive Efficient File format  Well defined  Standalone  Splittable & compressed
    • Biased Comparison CSV XML/JSON SequenceFile Thrift & PB AvroLanguage Yes Yes No Yes YesIndependentExpressive No Yes Yes Yes YesEfficient No No Yes Yes YesDynamic Yes Yes No No YesStandalone ? Yes No No YesSplittable ? ? Yes ? Yes
    • Project Overview Specification based design Dynamic implementations File format Schemas  Must support JSON implementation  IDL often supported  Evolvable First class Hadoop support
    • Specification Based Design Schemas Encoding Sort order Object container files Codecs Protocol Protocol write format Schema resolution
    • Specification Based Design Schemas  Primitive types  Null, boolean, int, long, float, double, bytes, string  Complex types  Records, enums, arrays, maps, unions and fixed  Named types  Records, enums, fixed  Name & namespace  Aliases  http://avro.apache.org/docs/current/spec.html#schema s
    • Schema Examplelog-message.avpr{ "namespace": "com.emoney", "name": "LogMessage", "type": "record", "fields": [ {"name": "level", "type": "string", "comment" : "this is ignored"}, {"name": "message", "type": "string", "description" : "this is the message"}, {"name": "dateTime", "type": "long"}, {"name": "exceptionMessage", "type": ["null", "string"]} ]}
    • Specification Based Design Encodings  JSON – for debugging  Binary Sort order  Efficient sorting by system other than writer  Sorting binary-encoded data without deserialization
    • Specification Based Design Object container files  Schema  Serialized data written to binary-encoded blocks  Blocks may be compressed  Synchronization markers Codecs  Null  Deflate  Snappy (optional)  LZO (future)
    • Specification Based Design Protocol  Protocol name  Namespace  Types  Named types used in messages  Messages  Uniquely named message  Request  Response  Errors Wire format  Transports  Framing  Handshake
    • Protocol{ "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } }}
    • Schema Resolution & Evolution Writers schema always provided to reader Compare schema used by writer & schema expected by reader Fields that match name & type are read Fields written that don’t match are skipped Expected fields not written can be identified  Error or provide default value Same features as provided by numeric field ids  Keeps fields symbolic, no index IDs written in data Allows for projections  Very efficient at skipping fields Aliases  Allows projections from 2 different types using aliases  User transaction  Count, date  Batch  Count, date
    • Implementations Core – parse schemas, read & write binary data for a schema Data file – read & write Avro data files Codec – supported codecs RPC/HTTP – make and receive calls over HTTPImplementation Core Data file Codec RPC/HTTPC Yes Yes Deflate YesC++ Yes Yes ? YesC# Yes No N/A NoJava Yes Yes Deflate, Snappy YesPython Yes Yes Deflate YesRuby Yes Yes Deflate YesPHP Yes Yes ? No
    • API Generic  Generic attribute/value data structure  Best suited for dynamic processing Specific  Each record corresponds to a different kind of object in the programming language  RPC systems typically use this Reflect  Schemas generated via reflection  Converting an existing codebase to use Avro
    • API Low-level  Schema  Encoders  DatumWriter  DatumReader High-level  DataFileWriter  DataFileReader
    • Java ExampleSchema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr"));OutputStream outputStream = new FileOutputStream("data.avro");DataFileWriter<Message> writer = new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema));writer.setCodec(CodecFactory.deflateCodec(1));writer.create(schema, outputStream);writer.append(new Message ());writer.close();
    • Java ExampleDataFileReader<Message> reader = new DataFileReader<Message>( new File("data.avro"), new GenericDatumReader<Message>());for (Message next : reader) { System.out.println("next: " + next);}
    • RPC Server  SocketServer (non-standard)  SaslSocketServer  HttpServer  NettyServer  DatagramServer (non-standard) Responder  Generic  Reflect  Specific Client  Corresponding Transceiver  LocalTransceiver Requestor
    • RPC Client  Corresponding Transceiver for each server  LocalTransceiver Requestor
    • RPC ServerProtocol protocol = Protocol.parse(new File("protocol.avpr"));InetSocketAddress address = new InetSocketAddress("localhost", 33333);GenericResponder responder = new GenericResponder(protocol) { @Override public Object respond(Protocol.Message message, Object request) throws Exception { ... }};new SocketServer(responder, address).join();
    • Hadoop Support File writers and readers Replacing RPC with Avro  In Flume already Pig support is in Splittable  Set block size when writing Tether jobs  Connector framework for other languages  Hadoop Pipes
    • Future RPC  Hbase, Cassandra, Hadoop core Hive in progress Tether jobs  Actual MapReduce implementations in other languages
    • Avro Dynamic Expressive Efficient Specification based design Language implementations are fairly solid Serialization or RPC or both First class Hadoop support Currently 1.5.1 Sexy tractors