SlideShare a Scribd company logo
1 of 28
Avro
 Etymology & History
 Sexy Tractors
 Project Drivers & Overview
 Serialization
 RPC
 Hadoop Support
Etymology
 British aircraft manufacturer
 1910-1963
History
 Doug Cutting – Cloudera, Hadoop project founder
 2002 – Nutch
 2004 – Google GFS, MapReduce whitepapers
 2005 – NDFS & MR, Writable & SequenceFile
 2006 – Hadoop split from Nutch, renamed NDFS to
  HDFS
 2007 – Yahoo gets involved, HBase, Pig, Zookeeper
 2008 – Terrasort contest winner, Hive, Mahout,
  Cassandra
 2009 – Oozie, Flume, Hue
History
 Underlying serialization system basically unchanged
 Additional language support and data formats
 Language, data format combinatorial explosion
    C++ JSON to Java BSON
    Python Smile to PHP CSV
 Apr 2009 – Avro proposal
 May 2010 – Top-level project
Sexy Tractors
 Data serialization tools, like tractors, aren’t sexy
 They should be!
 Dollar for dollar storage capacity has increased
  exponentially, doubling every 1.5-2 years
 Throughput of magnetic storage and network has not
  maintained this pace
 Distributed systems are the norm
 Efficient data serialization techniques and tools are
  vital
Project Drivers
 Common data format for serialization and RPC
 Dynamic
 Expressive
 Efficient
 File format
    Well defined
    Standalone
    Splittable & compressed
Biased Comparison
              CSV   XML/JSON   SequenceFile   Thrift & PB   Avro

Language      Yes   Yes        No             Yes           Yes
Independent
Expressive    No    Yes        Yes            Yes           Yes

Efficient     No    No         Yes            Yes           Yes

Dynamic       Yes   Yes        No             No            Yes

Standalone    ?     Yes        No             No            Yes

Splittable    ?     ?          Yes            ?             Yes
Project Overview
 Specification based design
 Dynamic implementations
 File format
 Schemas
    Must support JSON implementation
    IDL often supported
    Evolvable
 First class Hadoop support
Specification Based Design
 Schemas
 Encoding
 Sort order
 Object container files
 Codecs
 Protocol
 Protocol write format
 Schema resolution
Specification Based Design
 Schemas
    Primitive types
        Null, boolean, int, long, float, double, bytes, string
    Complex types
      Records, enums, arrays, maps, unions and fixed

    Named types
      Records, enums, fixed
      Name & namespace

    Aliases
    http://avro.apache.org/docs/current/spec.html#schema
     s
Schema Example
log-message.avpr

{
    "namespace": "com.emoney",
    "name": "LogMessage",
    "type": "record",
    "fields": [
       {"name": "level", "type": "string", "comment" : "this is ignored"},
       {"name": "message", "type": "string", "description" : "this is the message"},
       {"name": "dateTime", "type": "long"},
       {"name": "exceptionMessage", "type": ["null", "string"]}
    ]
}
Specification Based Design
 Encodings
    JSON – for debugging
    Binary
 Sort order
    Efficient sorting by system other than writer
    Sorting binary-encoded data without deserialization
Specification Based Design
 Object container files
    Schema
    Serialized data written to binary-encoded blocks
    Blocks may be compressed
    Synchronization markers
 Codecs
    Null
    Deflate
    Snappy (optional)
    LZO (future)
Specification Based Design
 Protocol
    Protocol name
    Namespace
    Types
        Named types used in messages
    Messages
        Uniquely named message
        Request
        Response
        Errors
 Wire format
   Transports
   Framing
   Handshake
Protocol
{
    "namespace": "com.acme",
    "protocol": "HelloWorld",
    "doc": "Protocol Greetings",

    "types": [
       {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]},
       {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ],

    "messages": {
      "hello": {
        "doc": "Say hello.",
        "request": [{"name": "greeting", "type": "Greeting" }],
        "response": "Greeting",
        "errors": ["Curse"]
      }
    }
}
Schema Resolution & Evolution
   Writers schema always provided to reader
   Compare schema used by writer & schema expected by reader
   Fields that match name & type are read
   Fields written that don’t match are skipped
   Expected fields not written can be identified
      Error or provide default value
 Same features as provided by numeric field ids
    Keeps fields symbolic, no index IDs written in data
 Allows for projections
    Very efficient at skipping fields
 Aliases
    Allows projections from 2 different types using aliases
    User transaction
          Count, date
      Batch
        Count, date
Implementations
   Core – parse schemas, read & write binary data for a schema
   Data file – read & write Avro data files
   Codec – supported codecs
   RPC/HTTP – make and receive calls over HTTP
Implementation         Core         Data file         Codec          RPC/HTTP
C                Yes           Yes              Deflate         Yes
C++              Yes           Yes              ?               Yes
C#               Yes           No               N/A             No
Java             Yes           Yes              Deflate, Snappy Yes
Python           Yes           Yes              Deflate         Yes
Ruby             Yes           Yes              Deflate         Yes
PHP              Yes           Yes              ?               No
API
 Generic
    Generic attribute/value data structure
    Best suited for dynamic processing
 Specific
    Each record corresponds to a different kind of object in the
     programming language
    RPC systems typically use this
 Reflect
    Schemas generated via reflection
    Converting an existing codebase to use Avro
API
 Low-level
    Schema
    Encoders
    DatumWriter
    DatumReader
 High-level
    DataFileWriter
    DataFileReader
Java Example
Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr"));

OutputStream outputStream = new FileOutputStream("data.avro");

DataFileWriter<Message> writer =
        new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema));

writer.setCodec(CodecFactory.deflateCodec(1));
writer.create(schema, outputStream);

writer.append(new Message ());

writer.close();
Java Example
DataFileReader<Message> reader = new DataFileReader<Message>(
         new File("data.avro"),
         new GenericDatumReader<Message>());

for (Message next : reader) {
  System.out.println("next: " + next);
}
RPC
 Server
    SocketServer (non-standard)
    SaslSocketServer
    HttpServer
    NettyServer
    DatagramServer (non-standard)
 Responder
    Generic
    Reflect
    Specific
 Client
    Corresponding Transceiver
    LocalTransceiver
 Requestor
RPC
 Client
    Corresponding Transceiver for each server
    LocalTransceiver
 Requestor
RPC Server
Protocol protocol = Protocol.parse(new File("protocol.avpr"));

InetSocketAddress address = new InetSocketAddress("localhost", 33333);

GenericResponder responder = new GenericResponder(protocol) {
   @Override
   public Object respond(Protocol.Message message, Object request)
   throws Exception {
     ...
   }
};

new SocketServer(responder, address).join();
Hadoop Support
 File writers and readers
 Replacing RPC with Avro
    In Flume already
 Pig support is in
 Splittable
    Set block size when writing
 Tether jobs
    Connector framework for other languages
    Hadoop Pipes
Future
 RPC
    Hbase, Cassandra, Hadoop core
 Hive in progress
 Tether jobs
    Actual MapReduce implementations in other languages
Avro
 Dynamic
 Expressive
 Efficient
 Specification based design
 Language implementations are fairly solid
 Serialization or RPC or both
 First class Hadoop support
 Currently 1.5.1
 Sexy tractors

More Related Content

What's hot

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

What's hot (20)

Avro Tutorial - Records with Schema for Kafka and Hadoop
Avro Tutorial - Records with Schema for Kafka and HadoopAvro Tutorial - Records with Schema for Kafka and Hadoop
Avro Tutorial - Records with Schema for Kafka and Hadoop
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
kafka
kafkakafka
kafka
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 

Viewers also liked

Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)
Cloudera, Inc.
 
G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2
eleksdev
 
Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
Alex Tumanoff
 

Viewers also liked (20)

Avro intro
Avro introAvro intro
Avro intro
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 
3 apache-avro
3 apache-avro3 apache-avro
3 apache-avro
 
Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)
 
排队排队--kafka
排队排队--kafka排队排队--kafka
排队排队--kafka
 
맛만 보자 Finagle이란
맛만 보자 Finagle이란 맛만 보자 Finagle이란
맛만 보자 Finagle이란
 
java thrift
java thriftjava thrift
java thrift
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise
 
RPC protocols
RPC protocolsRPC protocols
RPC protocols
 
Protobuf & Code Generation + Go-Kit
Protobuf & Code Generation + Go-KitProtobuf & Code Generation + Go-Kit
Protobuf & Code Generation + Go-Kit
 
OpenFest 2016 - Open Microservice Architecture
OpenFest 2016 - Open Microservice ArchitectureOpenFest 2016 - Open Microservice Architecture
OpenFest 2016 - Open Microservice Architecture
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
G rpc lection1
G rpc lection1G rpc lection1
G rpc lection1
 
G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPC
 
Apache Avro and You
Apache Avro and YouApache Avro and You
Apache Avro and You
 
아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift) 아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift)
 
Building High Performance APIs In Go Using gRPC And Protocol Buffers
Building High Performance APIs In Go Using gRPC And Protocol BuffersBuilding High Performance APIs In Go Using gRPC And Protocol Buffers
Building High Performance APIs In Go Using gRPC And Protocol Buffers
 
Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
 

Similar to Avro

Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the rest
george.james
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
Sri Ambati
 

Similar to Avro (20)

Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?
 
Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the rest
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Not only SQL
Not only SQL Not only SQL
Not only SQL
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDB
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
 
The Glory of Rest
The Glory of RestThe Glory of Rest
The Glory of Rest
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
 
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
 
Webtechnologies
Webtechnologies Webtechnologies
Webtechnologies
 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Avro

  • 1.
  • 2. Avro  Etymology & History  Sexy Tractors  Project Drivers & Overview  Serialization  RPC  Hadoop Support
  • 3. Etymology  British aircraft manufacturer  1910-1963
  • 4. History  Doug Cutting – Cloudera, Hadoop project founder  2002 – Nutch  2004 – Google GFS, MapReduce whitepapers  2005 – NDFS & MR, Writable & SequenceFile  2006 – Hadoop split from Nutch, renamed NDFS to HDFS  2007 – Yahoo gets involved, HBase, Pig, Zookeeper  2008 – Terrasort contest winner, Hive, Mahout, Cassandra  2009 – Oozie, Flume, Hue
  • 5. History  Underlying serialization system basically unchanged  Additional language support and data formats  Language, data format combinatorial explosion  C++ JSON to Java BSON  Python Smile to PHP CSV  Apr 2009 – Avro proposal  May 2010 – Top-level project
  • 6. Sexy Tractors  Data serialization tools, like tractors, aren’t sexy  They should be!  Dollar for dollar storage capacity has increased exponentially, doubling every 1.5-2 years  Throughput of magnetic storage and network has not maintained this pace  Distributed systems are the norm  Efficient data serialization techniques and tools are vital
  • 7. Project Drivers  Common data format for serialization and RPC  Dynamic  Expressive  Efficient  File format  Well defined  Standalone  Splittable & compressed
  • 8. Biased Comparison CSV XML/JSON SequenceFile Thrift & PB Avro Language Yes Yes No Yes Yes Independent Expressive No Yes Yes Yes Yes Efficient No No Yes Yes Yes Dynamic Yes Yes No No Yes Standalone ? Yes No No Yes Splittable ? ? Yes ? Yes
  • 9. Project Overview  Specification based design  Dynamic implementations  File format  Schemas  Must support JSON implementation  IDL often supported  Evolvable  First class Hadoop support
  • 10. Specification Based Design  Schemas  Encoding  Sort order  Object container files  Codecs  Protocol  Protocol write format  Schema resolution
  • 11. Specification Based Design  Schemas  Primitive types  Null, boolean, int, long, float, double, bytes, string  Complex types  Records, enums, arrays, maps, unions and fixed  Named types  Records, enums, fixed  Name & namespace  Aliases  http://avro.apache.org/docs/current/spec.html#schema s
  • 12. Schema Example log-message.avpr { "namespace": "com.emoney", "name": "LogMessage", "type": "record", "fields": [ {"name": "level", "type": "string", "comment" : "this is ignored"}, {"name": "message", "type": "string", "description" : "this is the message"}, {"name": "dateTime", "type": "long"}, {"name": "exceptionMessage", "type": ["null", "string"]} ] }
  • 13. Specification Based Design  Encodings  JSON – for debugging  Binary  Sort order  Efficient sorting by system other than writer  Sorting binary-encoded data without deserialization
  • 14. Specification Based Design  Object container files  Schema  Serialized data written to binary-encoded blocks  Blocks may be compressed  Synchronization markers  Codecs  Null  Deflate  Snappy (optional)  LZO (future)
  • 15. Specification Based Design  Protocol  Protocol name  Namespace  Types  Named types used in messages  Messages  Uniquely named message  Request  Response  Errors  Wire format  Transports  Framing  Handshake
  • 16. Protocol { "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } } }
  • 17. Schema Resolution & Evolution  Writers schema always provided to reader  Compare schema used by writer & schema expected by reader  Fields that match name & type are read  Fields written that don’t match are skipped  Expected fields not written can be identified  Error or provide default value  Same features as provided by numeric field ids  Keeps fields symbolic, no index IDs written in data  Allows for projections  Very efficient at skipping fields  Aliases  Allows projections from 2 different types using aliases  User transaction  Count, date  Batch  Count, date
  • 18. Implementations  Core – parse schemas, read & write binary data for a schema  Data file – read & write Avro data files  Codec – supported codecs  RPC/HTTP – make and receive calls over HTTP Implementation Core Data file Codec RPC/HTTP C Yes Yes Deflate Yes C++ Yes Yes ? Yes C# Yes No N/A No Java Yes Yes Deflate, Snappy Yes Python Yes Yes Deflate Yes Ruby Yes Yes Deflate Yes PHP Yes Yes ? No
  • 19. API  Generic  Generic attribute/value data structure  Best suited for dynamic processing  Specific  Each record corresponds to a different kind of object in the programming language  RPC systems typically use this  Reflect  Schemas generated via reflection  Converting an existing codebase to use Avro
  • 20. API  Low-level  Schema  Encoders  DatumWriter  DatumReader  High-level  DataFileWriter  DataFileReader
  • 21. Java Example Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr")); OutputStream outputStream = new FileOutputStream("data.avro"); DataFileWriter<Message> writer = new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema)); writer.setCodec(CodecFactory.deflateCodec(1)); writer.create(schema, outputStream); writer.append(new Message ()); writer.close();
  • 22. Java Example DataFileReader<Message> reader = new DataFileReader<Message>( new File("data.avro"), new GenericDatumReader<Message>()); for (Message next : reader) { System.out.println("next: " + next); }
  • 23. RPC  Server  SocketServer (non-standard)  SaslSocketServer  HttpServer  NettyServer  DatagramServer (non-standard)  Responder  Generic  Reflect  Specific  Client  Corresponding Transceiver  LocalTransceiver  Requestor
  • 24. RPC  Client  Corresponding Transceiver for each server  LocalTransceiver  Requestor
  • 25. RPC Server Protocol protocol = Protocol.parse(new File("protocol.avpr")); InetSocketAddress address = new InetSocketAddress("localhost", 33333); GenericResponder responder = new GenericResponder(protocol) { @Override public Object respond(Protocol.Message message, Object request) throws Exception { ... } }; new SocketServer(responder, address).join();
  • 26. Hadoop Support  File writers and readers  Replacing RPC with Avro  In Flume already  Pig support is in  Splittable  Set block size when writing  Tether jobs  Connector framework for other languages  Hadoop Pipes
  • 27. Future  RPC  Hbase, Cassandra, Hadoop core  Hive in progress  Tether jobs  Actual MapReduce implementations in other languages
  • 28. Avro  Dynamic  Expressive  Efficient  Specification based design  Language implementations are fairly solid  Serialization or RPC or both  First class Hadoop support  Currently 1.5.1  Sexy tractors