Apache Avro

    Zafar Gilani
Muhammad Adnan Khan
     Hui Shang
Outline
•   Overview
•   Comparison
•   Specification
•   SASL profile and usage
•   References
Overview
•   A data serialization system.
•   An RPC framework.
•   For: storage & comm.
•   Purpose:
    – Provide rich data structures.
    – A compact and fast binary data format.
    – Simple integration with dynamic languages.
Overview
• Avro uses JSON for Interface Description
  Language (IDL).
  – To specify data types.
  – To specify protocols.
• Review: JavaScript Object Notation is just a
  light-weight text-based standard for data
  interchange.
Why the need for Avro?
• Primary usage in Hadoop, provides standard:
  1. Serialization format for persistent data.
  2. Wire format for communication ..
    •   .. among Hadoop nodes.
    •   .. from client programs to Hadoop services.
Overview
• Avro relies on schemas.
  – Schema stored with data.
  – Each datum written with no per-value overheads.
     • Thus serialization is fast and small.
• Avro in RPC:
  – Schema exchange during client-server handshake.
  – Correspondence in fields can be easily resolved.
APIs
• Supporting API for:
  – Java
  –C
  – C++
  – C#
  – Python
  – Ruby
Comparison with other systems
• Avro vs. Protobuf and Thrift.
• A quick note about Thrift:
  – Initially developed at Facebook by a Google intern.
  – Closer to Google’s protobuf.
Comparison with other systems
                 Avro                Google protobuf       Thrift

Implementation   Hmm..               Cleaner              Hmm..

Error handling   Complex             Simple                OK

Extensibility    Hmm..               Richer                OK

Compatibility    Java, C, C++, C#,   That and much         About the same as
                 Python and Ruby     more such as          protobuf
                                     Adobe Actionscript,
                                     Microsoft
                                     Silverlight, etc.
Specification
• Schema represented in one of:
   – JSON string, naming a defined type.
   – JSON object of the form:
      • {"type": "typeName" ...attributes...}
   – JSON array
• Primitive types: null, boolean, int, long, float,
  double, bytes, string
   – {"type": "string"}
• Complex types: records, enums, arrays, maps,
  unions, fixed
Specification, example protocol
{
    "namespace": "com.acme",
    "protocol": "HelloWorld",
    "doc": "Protocol Greetings",

    "types": [
      {"name": "Greeting", "type": "record", "fields": [
        {"name": "message", "type": "string"}]},
      {"name": "Curse", "type": "error", "fields": [
        {"name": "message", "type": "string"}]}
    ],

    "messages": {
      "hello": {
        "doc": "Say hello.",
        "request": [{"name": "greeting", "type": "Greeting" }],
        "response": "Greeting",
        "errors": ["Curse"]
      }
    }
}
SASL profile
• Simple Authentication and Security Layer.
• Provides a framework for
  – Authentication.
  – Security of network protocols.
SASL usage
• Negotiation procedure to use connection-
  oriented Avro RPC:
  – 0: START Used in a client's initial message.
  – 1: CONTINUE Used while negotiation is
    ongoing.
  – 2: FAIL Terminates negotiation unsuccessfully.
  – 3: COMPLETE Terminates negotiation
    sucessfully.
References
1. Apache Avro,
   http://avro.apache.org/docs/current/
2. Google protocol buffers vs Apache Avro,
   http://www.sammur.com/?p=36
3. Avro vs Thrift,
   http://tech.puredanger.com/2011/05/27/serializ
   ation-comparison/
4. SASL,
   http://avro.apache.org/docs/current/sasl.html
Apache Avro

    Zafar Gilani
Muhammad Adnan Khan
     Hui Shang

3 apache-avro

  • 1.
    Apache Avro Zafar Gilani Muhammad Adnan Khan Hui Shang
  • 2.
    Outline • Overview • Comparison • Specification • SASL profile and usage • References
  • 3.
    Overview • A data serialization system. • An RPC framework. • For: storage & comm. • Purpose: – Provide rich data structures. – A compact and fast binary data format. – Simple integration with dynamic languages.
  • 4.
    Overview • Avro usesJSON for Interface Description Language (IDL). – To specify data types. – To specify protocols. • Review: JavaScript Object Notation is just a light-weight text-based standard for data interchange.
  • 5.
    Why the needfor Avro? • Primary usage in Hadoop, provides standard: 1. Serialization format for persistent data. 2. Wire format for communication .. • .. among Hadoop nodes. • .. from client programs to Hadoop services.
  • 6.
    Overview • Avro relieson schemas. – Schema stored with data. – Each datum written with no per-value overheads. • Thus serialization is fast and small. • Avro in RPC: – Schema exchange during client-server handshake. – Correspondence in fields can be easily resolved.
  • 7.
    APIs • Supporting APIfor: – Java –C – C++ – C# – Python – Ruby
  • 8.
    Comparison with othersystems • Avro vs. Protobuf and Thrift. • A quick note about Thrift: – Initially developed at Facebook by a Google intern. – Closer to Google’s protobuf.
  • 9.
    Comparison with othersystems Avro Google protobuf Thrift Implementation Hmm.. Cleaner  Hmm.. Error handling Complex Simple OK Extensibility Hmm.. Richer OK Compatibility Java, C, C++, C#, That and much About the same as Python and Ruby more such as protobuf Adobe Actionscript, Microsoft Silverlight, etc.
  • 10.
    Specification • Schema representedin one of: – JSON string, naming a defined type. – JSON object of the form: • {"type": "typeName" ...attributes...} – JSON array • Primitive types: null, boolean, int, long, float, double, bytes, string – {"type": "string"} • Complex types: records, enums, arrays, maps, unions, fixed
  • 11.
    Specification, example protocol { "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } } }
  • 12.
    SASL profile • SimpleAuthentication and Security Layer. • Provides a framework for – Authentication. – Security of network protocols.
  • 13.
    SASL usage • Negotiationprocedure to use connection- oriented Avro RPC: – 0: START Used in a client's initial message. – 1: CONTINUE Used while negotiation is ongoing. – 2: FAIL Terminates negotiation unsuccessfully. – 3: COMPLETE Terminates negotiation sucessfully.
  • 14.
    References 1. Apache Avro, http://avro.apache.org/docs/current/ 2. Google protocol buffers vs Apache Avro, http://www.sammur.com/?p=36 3. Avro vs Thrift, http://tech.puredanger.com/2011/05/27/serializ ation-comparison/ 4. SASL, http://avro.apache.org/docs/current/sasl.html
  • 15.
    Apache Avro Zafar Gilani Muhammad Adnan Khan Hui Shang