Avro

3,140 views
2,890 views

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,140
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
71
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Avro

  1. 1. Avro Etymology & History Sexy Tractors Project Drivers & Overview Serialization RPC Hadoop Support
  2. 2. Etymology British aircraft manufacturer 1910-1963
  3. 3. History Doug Cutting – Cloudera, Hadoop project founder 2002 – Nutch 2004 – Google GFS, MapReduce whitepapers 2005 – NDFS & MR, Writable & SequenceFile 2006 – Hadoop split from Nutch, renamed NDFS to HDFS 2007 – Yahoo gets involved, HBase, Pig, Zookeeper 2008 – Terrasort contest winner, Hive, Mahout, Cassandra 2009 – Oozie, Flume, Hue
  4. 4. History Underlying serialization system basically unchanged Additional language support and data formats Language, data format combinatorial explosion  C++ JSON to Java BSON  Python Smile to PHP CSV Apr 2009 – Avro proposal May 2010 – Top-level project
  5. 5. Sexy Tractors Data serialization tools, like tractors, aren’t sexy They should be! Dollar for dollar storage capacity has increased exponentially, doubling every 1.5-2 years Throughput of magnetic storage and network has not maintained this pace Distributed systems are the norm Efficient data serialization techniques and tools are vital
  6. 6. Project Drivers Common data format for serialization and RPC Dynamic Expressive Efficient File format  Well defined  Standalone  Splittable & compressed
  7. 7. Biased Comparison CSV XML/JSON SequenceFile Thrift & PB AvroLanguage Yes Yes No Yes YesIndependentExpressive No Yes Yes Yes YesEfficient No No Yes Yes YesDynamic Yes Yes No No YesStandalone ? Yes No No YesSplittable ? ? Yes ? Yes
  8. 8. Project Overview Specification based design Dynamic implementations File format Schemas  Must support JSON implementation  IDL often supported  Evolvable First class Hadoop support
  9. 9. Specification Based Design Schemas Encoding Sort order Object container files Codecs Protocol Protocol write format Schema resolution
  10. 10. Specification Based Design Schemas  Primitive types  Null, boolean, int, long, float, double, bytes, string  Complex types  Records, enums, arrays, maps, unions and fixed  Named types  Records, enums, fixed  Name & namespace  Aliases  http://avro.apache.org/docs/current/spec.html#schema s
  11. 11. Schema Examplelog-message.avpr{ "namespace": "com.emoney", "name": "LogMessage", "type": "record", "fields": [ {"name": "level", "type": "string", "comment" : "this is ignored"}, {"name": "message", "type": "string", "description" : "this is the message"}, {"name": "dateTime", "type": "long"}, {"name": "exceptionMessage", "type": ["null", "string"]} ]}
  12. 12. Specification Based Design Encodings  JSON – for debugging  Binary Sort order  Efficient sorting by system other than writer  Sorting binary-encoded data without deserialization
  13. 13. Specification Based Design Object container files  Schema  Serialized data written to binary-encoded blocks  Blocks may be compressed  Synchronization markers Codecs  Null  Deflate  Snappy (optional)  LZO (future)
  14. 14. Specification Based Design Protocol  Protocol name  Namespace  Types  Named types used in messages  Messages  Uniquely named message  Request  Response  Errors Wire format  Transports  Framing  Handshake
  15. 15. Protocol{ "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } }}
  16. 16. Schema Resolution & Evolution Writers schema always provided to reader Compare schema used by writer & schema expected by reader Fields that match name & type are read Fields written that don’t match are skipped Expected fields not written can be identified  Error or provide default value Same features as provided by numeric field ids  Keeps fields symbolic, no index IDs written in data Allows for projections  Very efficient at skipping fields Aliases  Allows projections from 2 different types using aliases  User transaction  Count, date  Batch  Count, date
  17. 17. Implementations Core – parse schemas, read & write binary data for a schema Data file – read & write Avro data files Codec – supported codecs RPC/HTTP – make and receive calls over HTTPImplementation Core Data file Codec RPC/HTTPC Yes Yes Deflate YesC++ Yes Yes ? YesC# Yes No N/A NoJava Yes Yes Deflate, Snappy YesPython Yes Yes Deflate YesRuby Yes Yes Deflate YesPHP Yes Yes ? No
  18. 18. API Generic  Generic attribute/value data structure  Best suited for dynamic processing Specific  Each record corresponds to a different kind of object in the programming language  RPC systems typically use this Reflect  Schemas generated via reflection  Converting an existing codebase to use Avro
  19. 19. API Low-level  Schema  Encoders  DatumWriter  DatumReader High-level  DataFileWriter  DataFileReader
  20. 20. Java ExampleSchema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr"));OutputStream outputStream = new FileOutputStream("data.avro");DataFileWriter<Message> writer = new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema));writer.setCodec(CodecFactory.deflateCodec(1));writer.create(schema, outputStream);writer.append(new Message ());writer.close();
  21. 21. Java ExampleDataFileReader<Message> reader = new DataFileReader<Message>( new File("data.avro"), new GenericDatumReader<Message>());for (Message next : reader) { System.out.println("next: " + next);}
  22. 22. RPC Server  SocketServer (non-standard)  SaslSocketServer  HttpServer  NettyServer  DatagramServer (non-standard) Responder  Generic  Reflect  Specific Client  Corresponding Transceiver  LocalTransceiver Requestor
  23. 23. RPC Client  Corresponding Transceiver for each server  LocalTransceiver Requestor
  24. 24. RPC ServerProtocol protocol = Protocol.parse(new File("protocol.avpr"));InetSocketAddress address = new InetSocketAddress("localhost", 33333);GenericResponder responder = new GenericResponder(protocol) { @Override public Object respond(Protocol.Message message, Object request) throws Exception { ... }};new SocketServer(responder, address).join();
  25. 25. Hadoop Support File writers and readers Replacing RPC with Avro  In Flume already Pig support is in Splittable  Set block size when writing Tether jobs  Connector framework for other languages  Hadoop Pipes
  26. 26. Future RPC  Hbase, Cassandra, Hadoop core Hive in progress Tether jobs  Actual MapReduce implementations in other languages
  27. 27. Avro Dynamic Expressive Efficient Specification based design Language implementations are fairly solid Serialization or RPC or both First class Hadoop support Currently 1.5.1 Sexy tractors

×