Avro intro


Published on

An introduction to Apache Avro

Published in: Software, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Avro intro

  1. 1. A Brief Introduction -- Randy Abernethy, rx-m llc, 2014
  2. 2. unstructured data • Databases provide their own internal serialization – A fast but completely closed system – Not perfect for unstructured data • Systems requiring the ability to process arbitrary datasets generated by a wide range of programs need open solutions – XML – gigantic, glacial to parse – JSON – big, slow to parse – >> Something else in the middle? << – Binary – small, fast and inflexible • Size matters! – 1 Gigabyte or 6, maybe doesn’t matter – 1 Petabytes or 6 usually matters $ ls -l order.* -rw-r--r-- 1 tm tm 152 May 28 02:12 order.bin -rw-r--r-- 1 tm tm 798 May 28 02:05 order.json -rw-r--r-- 1 tm tm 958 May 28 02:04 order.xml
  3. 3. avro • Apache Avro – Apache Avro is a data serialization system • What? – Data Serialization – Apache Avro is first and foremost an excellent means for reading and writing data structures in an open and dynamic way – Compact – Apache Avro uses an efficient binary data format – Fast – Apache Avro data can be parsed quickly with low overhead – RPC Support – Apache Avro supplies RPC mechanisms using the Apache Avro serialization system for platform and language independence – Reach – Apache Avro supports a wide range of languages (some supported outside of the project itself): C, C++, Java, Python, PHP, C#, Ruby and other externally supported languages – Flexibility – Apache Avro supports data evolution, allowing data structures to change over time without breaking existing systems – No IDL – Apache Avro uses schemas to describe data structures, schemas are encoded with the data eliminating the need for IDL code generation and repetitive field tags embedded with the data – JSON – Apache Avro schemas are described using JSON – Hadoop – Apache Avro is built into the Hadoop Framework
  4. 4. • Built for big data – Unlike other systems, Apache Avro is particularly focused on features making it efficient for use with large data sets • Schemas are always present – The schema is required to parse the data • Apache Avro leverages dynamic typing, making it possible for dynamic programming languages to generate data types from Apache Avro Schemas on the fly • Completely generic data processing systems can be developed with no preexisting knowledge of the data formats to be processed • Apache Avro data structures are based on field names not field IDs – This makes the selection and consistency of field names important to Apache Avro readers and writers different? Schema: the structure of a data system described in a formal language
  5. 5. • Primitive types – null: no value – boolean: a binary value – int: 32-bit signed integer – long: 64-bit signed integer – float: single precision (32-bit) IEEE 754 floating-point number – double: double precision (64-bit) IEEE 754 floating-point number – bytes: sequence of 8-bit unsigned bytes – string: unicode character sequence • Complex Types – records – named sets of fields • Fields are JSON objects with a name, type and optionally a default value – enums – named set of strings, instances may hold any one of the string values – fixed – a named fixed size set of bytes – arrays – a collection of a single type – maps – a string keyed set of key/value pairs – unions – an array of types, instances may assume any of the types schematypes Records, enums and fixed types have fullnames Fullnames contain two parts • Namespace (e.g. com.example) • Name (e.g. Order) Namespaces are dot separated sequences and case sensitive. Elements referenced by undotted name alone inherit the most closely enclosing namespace. Fullnames must be unique and defined before use. Avro Java strings are: org.apache.avro.util.Utf8
  6. 6. • Apache Avro supports Binary and JSON encoding schemes encoding • null: 0 bytes • boolean: 1 byte • int & long: varint/zigzag • float: 4 bytes • double: 8 bytes • bytes: a long encoded length followed by the bytes • string: a long encoded length (in bytes) followed by UTF-8 characters • records: fields encoded in the order declared • enums: int encoded as the 0 based position of the value in the enum array • fixed: a fixed size set of bytes • arrays: a set of blocks – A block is typically a long encoded count followed by that many items – Small arrays are represented with a single block – Large data sets can be broken up into multiple blocks such that each block can be buffered in memory – The final block has a count of 0 and no items – A negative count flags abs(count) many items but is immediately followed by a long encoded size in bytes • This allows the reader to skip ahead quickly without decoding individual items (which may be of variable length) • maps: like arrays but with key/value pairs for items • unions: a long encoded 0 based index identifying the type to use followed by the value Binary
  7. 7. • Apache Avro schemas must always be present • The Apache Avro system defines a container file used to house a schema and its data • Files consist of – A header which contains metadata and a sync marker – One or more data blocks containing data as defined in the schema • Metadata – The header meta data can be any data useful to the file’s author – Apache Avro assigns two metadata values: • avro.schema – containing the schema in JSON format • avro.codec – defining the name of the compression codec used on the blocks if any – Mandatory codecs are null and deflate, null (no compression) is assumed if the metadata field is absent – Snappy is another common codec • Data Blocks – Data blocks contain a long count of objects in the block, a long size in bytes, the serialized objects (possibly compressed) and the 16 byte sync mark – The block prefix and sync mark allow data to be efficiently skipped during HDFS mapred splits and other similar processing • Corrupt blocks can also be detected containers
  8. 8. • Apache Avro defines a sort order for data – This can be a key optimization in large data processing environments • Data items with identical schemas can be compared • Record fields can optionally specify one of three sort orders – ascending – standard sort order – descending – reverse sort order – ignore – values are to be ignored when sorting the records sortorder
  9. 9. • Apache Avro defines RPC interfaces using protocol schemas • Protocols are defined in JSON and contain – protocol – the name of the protocol (i.e. service) – namespace – the containing namespace (optional) – types – the data types defined within the protocol – messages – the RPC message sequences supported • Messages define an RPC method – request – the list of parameters passed – response – the normal return type – error – the possible error return types (optional) – one-way – optional Boolean for request only messages protocols
  10. 10. aschema { "namespace": "com.example", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } } }
  11. 11. • Schemas must always be present • Therefore when using Protocols: – The client must send the request schema to the server – The server must send the response and error schema to the client • Stateless transports require a schema handshake before each request • Stateful transports can maintain a schema cache eliminating the schema exchange on successive calls • Handshakes use a hash of the schema to avoid sending schemas which are already consistent on both sides handshakes { "type": "record", "name": "HandshakeRequest", "namespace":"org.apache.avro.ipc", "fields": [ {"name": "clientHash", "type": {"type": "fixed", "name": "MD5", "size": 16}}, {"name": "clientProtocol", "type": ["null", "string"]}, {"name": "serverHash", "type": "MD5"}, {"name": "meta", "type": ["null", {"type": "map", "values": "bytes"}]} ] } { "type": "record", "name": "HandshakeResponse", "namespace": "org.apache.avro.ipc", "fields": [ {"name": "match", "type": {"type": "enum", "name": "HandshakeMatch", "symbols": ["BOTH", "CLIENT", "NONE"]}}, {"name": "serverProtocol", "type": ["null", "string"]}, {"name": "serverHash", "type": ["null", {"type": "fixed", "name": "MD5", "size": 16}]}, {"name": "meta", "type": ["null", {"type": "map", "values": "bytes"}]} ] } Apache Avro Handshake Records
  12. 12. protocol session $ sudo apt-get install libsnappy-dev … $ sudo apt-get install python-pip python-dev build-essential … $ sudo pip install python-snappy … $ sudo pip install avro … $ cat hello.avpr { "namespace": "com.example", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Message", "type": "record", "fields": [ {"name": "to", "type": "string"}, {"name": "from", "type": "string"}, {"name": "body", "type": "string"} ] } ], "messages": { "send": { "request": [{"name": "message", "type": "Message"}], "response": "string" } } } $ python server.py - - [28/May/2014 06:27:31] "POST / HTTP/1.1" 200 - https://github.com/phunt/avro-rpc-quickstart $ python client.py Thurston_Howell Ginger "Hello Avro" Result: Sent message to Thurston_Howell from Ginger with body Hello Avro
  13. 13. helloserver
  14. 14. helloclient
  15. 15. • The Avro serialization system was originally developed by Doug Cutting for use with Hadoop, it became an Apache Software Foundation project in 2009. • 0.0.0 Apache Inception 2009-04-01 • 1.0.0 released 2009-07-15 • 1.1.0 released 2009-09-15 • 1.2.0 released 2009-10-15 • 1.3.0 released 2010-02-26 • 1.4.0 released 2010-09-08 • 1.5.0 released 2011-03-11 • 1.6.0 released 2011-11-02 • 1.7.0 released 2012-06-11 • 1.7.6 released 2014-01-22 versions Open Source Community Developed Apache License Version 2.0
  16. 16. avro resources • Web – avro.apache.org – github.com/apache/avro • Mail – Users: user@avro.apache.org – Developers: dev@avro.apache.org • Chat – #avro • Book – White (2012), Hadoop, The Definitive Guide, O’Reilly Media Inc. [http://www.oreilly.com] – Features 20 pages of Apache Avro coverage Randy Abernethy ra@apache.org