Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Avro

3,602 views

Published on

Published in: Technology
  • http://www.dbmanagement.info/Tutorials/Hadoop.htm #Hadoop #Avro #Cassandro #Drill #Flume Tutorial (Videos and Books)at $7.95
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Avro

  1. 1. Avro Etymology & History Sexy Tractors Project Drivers & Overview Serialization RPC Hadoop Support
  2. 2. Etymology British aircraft manufacturer 1910-1963
  3. 3. History Doug Cutting – Cloudera, Hadoop project founder 2002 – Nutch 2004 – Google GFS, MapReduce whitepapers 2005 – NDFS & MR, Writable & SequenceFile 2006 – Hadoop split from Nutch, renamed NDFS to HDFS 2007 – Yahoo gets involved, HBase, Pig, Zookeeper 2008 – Terrasort contest winner, Hive, Mahout, Cassandra 2009 – Oozie, Flume, Hue
  4. 4. History Underlying serialization system basically unchanged Additional language support and data formats Language, data format combinatorial explosion  C++ JSON to Java BSON  Python Smile to PHP CSV Apr 2009 – Avro proposal May 2010 – Top-level project
  5. 5. Sexy Tractors Data serialization tools, like tractors, aren’t sexy They should be! Dollar for dollar storage capacity has increased exponentially, doubling every 1.5-2 years Throughput of magnetic storage and network has not maintained this pace Distributed systems are the norm Efficient data serialization techniques and tools are vital
  6. 6. Project Drivers Common data format for serialization and RPC Dynamic Expressive Efficient File format  Well defined  Standalone  Splittable & compressed
  7. 7. Biased Comparison CSV XML/JSON SequenceFile Thrift & PB AvroLanguage Yes Yes No Yes YesIndependentExpressive No Yes Yes Yes YesEfficient No No Yes Yes YesDynamic Yes Yes No No YesStandalone ? Yes No No YesSplittable ? ? Yes ? Yes
  8. 8. Project Overview Specification based design Dynamic implementations File format Schemas  Must support JSON implementation  IDL often supported  Evolvable First class Hadoop support
  9. 9. Specification Based Design Schemas Encoding Sort order Object container files Codecs Protocol Protocol write format Schema resolution
  10. 10. Specification Based Design Schemas  Primitive types  Null, boolean, int, long, float, double, bytes, string  Complex types  Records, enums, arrays, maps, unions and fixed  Named types  Records, enums, fixed  Name & namespace  Aliases  http://avro.apache.org/docs/current/spec.html#schema s
  11. 11. Schema Examplelog-message.avpr{ "namespace": "com.emoney", "name": "LogMessage", "type": "record", "fields": [ {"name": "level", "type": "string", "comment" : "this is ignored"}, {"name": "message", "type": "string", "description" : "this is the message"}, {"name": "dateTime", "type": "long"}, {"name": "exceptionMessage", "type": ["null", "string"]} ]}
  12. 12. Specification Based Design Encodings  JSON – for debugging  Binary Sort order  Efficient sorting by system other than writer  Sorting binary-encoded data without deserialization
  13. 13. Specification Based Design Object container files  Schema  Serialized data written to binary-encoded blocks  Blocks may be compressed  Synchronization markers Codecs  Null  Deflate  Snappy (optional)  LZO (future)
  14. 14. Specification Based Design Protocol  Protocol name  Namespace  Types  Named types used in messages  Messages  Uniquely named message  Request  Response  Errors Wire format  Transports  Framing  Handshake
  15. 15. Protocol{ "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } }}
  16. 16. Schema Resolution & Evolution Writers schema always provided to reader Compare schema used by writer & schema expected by reader Fields that match name & type are read Fields written that don’t match are skipped Expected fields not written can be identified  Error or provide default value Same features as provided by numeric field ids  Keeps fields symbolic, no index IDs written in data Allows for projections  Very efficient at skipping fields Aliases  Allows projections from 2 different types using aliases  User transaction  Count, date  Batch  Count, date
  17. 17. Implementations Core – parse schemas, read & write binary data for a schema Data file – read & write Avro data files Codec – supported codecs RPC/HTTP – make and receive calls over HTTPImplementation Core Data file Codec RPC/HTTPC Yes Yes Deflate YesC++ Yes Yes ? YesC# Yes No N/A NoJava Yes Yes Deflate, Snappy YesPython Yes Yes Deflate YesRuby Yes Yes Deflate YesPHP Yes Yes ? No
  18. 18. API Generic  Generic attribute/value data structure  Best suited for dynamic processing Specific  Each record corresponds to a different kind of object in the programming language  RPC systems typically use this Reflect  Schemas generated via reflection  Converting an existing codebase to use Avro
  19. 19. API Low-level  Schema  Encoders  DatumWriter  DatumReader High-level  DataFileWriter  DataFileReader
  20. 20. Java ExampleSchema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr"));OutputStream outputStream = new FileOutputStream("data.avro");DataFileWriter<Message> writer = new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema));writer.setCodec(CodecFactory.deflateCodec(1));writer.create(schema, outputStream);writer.append(new Message ());writer.close();
  21. 21. Java ExampleDataFileReader<Message> reader = new DataFileReader<Message>( new File("data.avro"), new GenericDatumReader<Message>());for (Message next : reader) { System.out.println("next: " + next);}
  22. 22. RPC Server  SocketServer (non-standard)  SaslSocketServer  HttpServer  NettyServer  DatagramServer (non-standard) Responder  Generic  Reflect  Specific Client  Corresponding Transceiver  LocalTransceiver Requestor
  23. 23. RPC Client  Corresponding Transceiver for each server  LocalTransceiver Requestor
  24. 24. RPC ServerProtocol protocol = Protocol.parse(new File("protocol.avpr"));InetSocketAddress address = new InetSocketAddress("localhost", 33333);GenericResponder responder = new GenericResponder(protocol) { @Override public Object respond(Protocol.Message message, Object request) throws Exception { ... }};new SocketServer(responder, address).join();
  25. 25. Hadoop Support File writers and readers Replacing RPC with Avro  In Flume already Pig support is in Splittable  Set block size when writing Tether jobs  Connector framework for other languages  Hadoop Pipes
  26. 26. Future RPC  Hbase, Cassandra, Hadoop core Hive in progress Tether jobs  Actual MapReduce implementations in other languages
  27. 27. Avro Dynamic Expressive Efficient Specification based design Language implementations are fairly solid Serialization or RPC or both First class Hadoop support Currently 1.5.1 Sexy tractors

×