SlideShare a Scribd company logo
1 of 28
Avro
 Etymology & History
 Sexy Tractors
 Project Drivers & Overview
 Serialization
 RPC
 Hadoop Support
Etymology
 British aircraft manufacturer
 1910-1963
History
 Doug Cutting – Cloudera, Hadoop project founder
 2002 – Nutch
 2004 – Google GFS, MapReduce whitepapers
 2005 – NDFS & MR, Writable & SequenceFile
 2006 – Hadoop split from Nutch, renamed NDFS to
  HDFS
 2007 – Yahoo gets involved, HBase, Pig, Zookeeper
 2008 – Terrasort contest winner, Hive, Mahout,
  Cassandra
 2009 – Oozie, Flume, Hue
History
 Underlying serialization system basically unchanged
 Additional language support and data formats
 Language, data format combinatorial explosion
    C++ JSON to Java BSON
    Python Smile to PHP CSV
 Apr 2009 – Avro proposal
 May 2010 – Top-level project
Sexy Tractors
 Data serialization tools, like tractors, aren’t sexy
 They should be!
 Dollar for dollar storage capacity has increased
  exponentially, doubling every 1.5-2 years
 Throughput of magnetic storage and network has not
  maintained this pace
 Distributed systems are the norm
 Efficient data serialization techniques and tools are
  vital
Project Drivers
 Common data format for serialization and RPC
 Dynamic
 Expressive
 Efficient
 File format
    Well defined
    Standalone
    Splittable & compressed
Biased Comparison
              CSV   XML/JSON   SequenceFile   Thrift & PB   Avro

Language      Yes   Yes        No             Yes           Yes
Independent
Expressive    No    Yes        Yes            Yes           Yes

Efficient     No    No         Yes            Yes           Yes

Dynamic       Yes   Yes        No             No            Yes

Standalone    ?     Yes        No             No            Yes

Splittable    ?     ?          Yes            ?             Yes
Project Overview
 Specification based design
 Dynamic implementations
 File format
 Schemas
    Must support JSON implementation
    IDL often supported
    Evolvable
 First class Hadoop support
Specification Based Design
 Schemas
 Encoding
 Sort order
 Object container files
 Codecs
 Protocol
 Protocol write format
 Schema resolution
Specification Based Design
 Schemas
    Primitive types
        Null, boolean, int, long, float, double, bytes, string
    Complex types
      Records, enums, arrays, maps, unions and fixed

    Named types
      Records, enums, fixed
      Name & namespace

    Aliases
    http://avro.apache.org/docs/current/spec.html#schema
     s
Schema Example
log-message.avpr

{
    "namespace": "com.emoney",
    "name": "LogMessage",
    "type": "record",
    "fields": [
       {"name": "level", "type": "string", "comment" : "this is ignored"},
       {"name": "message", "type": "string", "description" : "this is the message"},
       {"name": "dateTime", "type": "long"},
       {"name": "exceptionMessage", "type": ["null", "string"]}
    ]
}
Specification Based Design
 Encodings
    JSON – for debugging
    Binary
 Sort order
    Efficient sorting by system other than writer
    Sorting binary-encoded data without deserialization
Specification Based Design
 Object container files
    Schema
    Serialized data written to binary-encoded blocks
    Blocks may be compressed
    Synchronization markers
 Codecs
    Null
    Deflate
    Snappy (optional)
    LZO (future)
Specification Based Design
 Protocol
    Protocol name
    Namespace
    Types
        Named types used in messages
    Messages
        Uniquely named message
        Request
        Response
        Errors
 Wire format
   Transports
   Framing
   Handshake
Protocol
{
    "namespace": "com.acme",
    "protocol": "HelloWorld",
    "doc": "Protocol Greetings",

    "types": [
       {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]},
       {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ],

    "messages": {
      "hello": {
        "doc": "Say hello.",
        "request": [{"name": "greeting", "type": "Greeting" }],
        "response": "Greeting",
        "errors": ["Curse"]
      }
    }
}
Schema Resolution & Evolution
   Writers schema always provided to reader
   Compare schema used by writer & schema expected by reader
   Fields that match name & type are read
   Fields written that don’t match are skipped
   Expected fields not written can be identified
      Error or provide default value
 Same features as provided by numeric field ids
    Keeps fields symbolic, no index IDs written in data
 Allows for projections
    Very efficient at skipping fields
 Aliases
    Allows projections from 2 different types using aliases
    User transaction
          Count, date
      Batch
        Count, date
Implementations
   Core – parse schemas, read & write binary data for a schema
   Data file – read & write Avro data files
   Codec – supported codecs
   RPC/HTTP – make and receive calls over HTTP
Implementation         Core         Data file         Codec          RPC/HTTP
C                Yes           Yes              Deflate         Yes
C++              Yes           Yes              ?               Yes
C#               Yes           No               N/A             No
Java             Yes           Yes              Deflate, Snappy Yes
Python           Yes           Yes              Deflate         Yes
Ruby             Yes           Yes              Deflate         Yes
PHP              Yes           Yes              ?               No
API
 Generic
    Generic attribute/value data structure
    Best suited for dynamic processing
 Specific
    Each record corresponds to a different kind of object in the
     programming language
    RPC systems typically use this
 Reflect
    Schemas generated via reflection
    Converting an existing codebase to use Avro
API
 Low-level
    Schema
    Encoders
    DatumWriter
    DatumReader
 High-level
    DataFileWriter
    DataFileReader
Java Example
Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr"));

OutputStream outputStream = new FileOutputStream("data.avro");

DataFileWriter<Message> writer =
        new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema));

writer.setCodec(CodecFactory.deflateCodec(1));
writer.create(schema, outputStream);

writer.append(new Message ());

writer.close();
Java Example
DataFileReader<Message> reader = new DataFileReader<Message>(
         new File("data.avro"),
         new GenericDatumReader<Message>());

for (Message next : reader) {
  System.out.println("next: " + next);
}
RPC
 Server
    SocketServer (non-standard)
    SaslSocketServer
    HttpServer
    NettyServer
    DatagramServer (non-standard)
 Responder
    Generic
    Reflect
    Specific
 Client
    Corresponding Transceiver
    LocalTransceiver
 Requestor
RPC
 Client
    Corresponding Transceiver for each server
    LocalTransceiver
 Requestor
RPC Server
Protocol protocol = Protocol.parse(new File("protocol.avpr"));

InetSocketAddress address = new InetSocketAddress("localhost", 33333);

GenericResponder responder = new GenericResponder(protocol) {
   @Override
   public Object respond(Protocol.Message message, Object request)
   throws Exception {
     ...
   }
};

new SocketServer(responder, address).join();
Hadoop Support
 File writers and readers
 Replacing RPC with Avro
    In Flume already
 Pig support is in
 Splittable
    Set block size when writing
 Tether jobs
    Connector framework for other languages
    Hadoop Pipes
Future
 RPC
    Hbase, Cassandra, Hadoop core
 Hive in progress
 Tether jobs
    Actual MapReduce implementations in other languages
Avro
 Dynamic
 Expressive
 Efficient
 Specification based design
 Language implementations are fairly solid
 Serialization or RPC or both
 First class Hadoop support
 Currently 1.5.1
 Sexy tractors

More Related Content

What's hot

Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우Arawn Park
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Micro services Architecture
Micro services ArchitectureMicro services Architecture
Micro services ArchitectureAraf Karsh Hamid
 
Spring Cloud Function: Where We Were, Where We Are, and Where We’re Going
Spring Cloud Function: Where We Were, Where We Are, and Where We’re GoingSpring Cloud Function: Where We Were, Where We Are, and Where We’re Going
Spring Cloud Function: Where We Were, Where We Are, and Where We’re GoingVMware Tanzu
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Migrating Oracle database to Cassandra
Migrating Oracle database to CassandraMigrating Oracle database to Cassandra
Migrating Oracle database to CassandraUmair Mansoob
 
Serverless integration with Knative and Apache Camel on Kubernetes
Serverless integration with Knative and Apache Camel on KubernetesServerless integration with Knative and Apache Camel on Kubernetes
Serverless integration with Knative and Apache Camel on KubernetesClaus Ibsen
 
The Art of Discovering Bounded Contexts
The Art of Discovering Bounded ContextsThe Art of Discovering Bounded Contexts
The Art of Discovering Bounded ContextsNick Tune
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...confluent
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafkaemreakis
 

What's hot (20)

Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우코틀린 멀티플랫폼, 미지와의 조우
코틀린 멀티플랫폼, 미지와의 조우
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Micro services Architecture
Micro services ArchitectureMicro services Architecture
Micro services Architecture
 
Spring Cloud Function: Where We Were, Where We Are, and Where We’re Going
Spring Cloud Function: Where We Were, Where We Are, and Where We’re GoingSpring Cloud Function: Where We Were, Where We Are, and Where We’re Going
Spring Cloud Function: Where We Were, Where We Are, and Where We’re Going
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Migrating Oracle database to Cassandra
Migrating Oracle database to CassandraMigrating Oracle database to Cassandra
Migrating Oracle database to Cassandra
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQL
 
Serverless integration with Knative and Apache Camel on Kubernetes
Serverless integration with Knative and Apache Camel on KubernetesServerless integration with Knative and Apache Camel on Kubernetes
Serverless integration with Knative and Apache Camel on Kubernetes
 
The Art of Discovering Bounded Contexts
The Art of Discovering Bounded ContextsThe Art of Discovering Bounded Contexts
The Art of Discovering Bounded Contexts
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 

Viewers also liked

Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Chicago Hadoop Users Group
 
Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Cloudera, Inc.
 
排队排队--kafka
排队排队--kafka排队排队--kafka
排队排队--kafkachernbb
 
맛만 보자 Finagle이란
맛만 보자 Finagle이란 맛만 보자 Finagle이란
맛만 보자 Finagle이란 jbugkorea
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise Jesus Rodriguez
 
Protobuf & Code Generation + Go-Kit
Protobuf & Code Generation + Go-KitProtobuf & Code Generation + Go-Kit
Protobuf & Code Generation + Go-KitManfred Touron
 
OpenFest 2016 - Open Microservice Architecture
OpenFest 2016 - Open Microservice ArchitectureOpenFest 2016 - Open Microservice Architecture
OpenFest 2016 - Open Microservice ArchitectureNikolay Stoitsev
 
G rpc lection1
G rpc lection1G rpc lection1
G rpc lection1eleksdev
 
G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2eleksdev
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure callSunita Sahu
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPCGuo Jing
 
아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift) 아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift) Jin wook
 
Building High Performance APIs In Go Using gRPC And Protocol Buffers
Building High Performance APIs In Go Using gRPC And Protocol BuffersBuilding High Performance APIs In Go Using gRPC And Protocol Buffers
Building High Performance APIs In Go Using gRPC And Protocol BuffersShiju Varghese
 

Viewers also liked (20)

Avro intro
Avro introAvro intro
Avro intro
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 
3 apache-avro
3 apache-avro3 apache-avro
3 apache-avro
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 
Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)
 
排队排队--kafka
排队排队--kafka排队排队--kafka
排队排队--kafka
 
맛만 보자 Finagle이란
맛만 보자 Finagle이란 맛만 보자 Finagle이란
맛만 보자 Finagle이란
 
java thrift
java thriftjava thrift
java thrift
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise
 
RPC protocols
RPC protocolsRPC protocols
RPC protocols
 
Protobuf & Code Generation + Go-Kit
Protobuf & Code Generation + Go-KitProtobuf & Code Generation + Go-Kit
Protobuf & Code Generation + Go-Kit
 
OpenFest 2016 - Open Microservice Architecture
OpenFest 2016 - Open Microservice ArchitectureOpenFest 2016 - Open Microservice Architecture
OpenFest 2016 - Open Microservice Architecture
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
G rpc lection1
G rpc lection1G rpc lection1
G rpc lection1
 
G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPC
 
Apache Avro and You
Apache Avro and YouApache Avro and You
Apache Avro and You
 
아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift) 아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift)
 
Building High Performance APIs In Go Using gRPC And Protocol Buffers
Building High Performance APIs In Go Using gRPC And Protocol BuffersBuilding High Performance APIs In Go Using gRPC And Protocol Buffers
Building High Performance APIs In Go Using gRPC And Protocol Buffers
 

Similar to Avro

Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?mikaelbarbero
 
Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the restgeorge.james
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...IndicThreads
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftTalentica Software
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?lichtkind
 
Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDBBrian Ritchie
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]LivePerson
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005Tugdual Grall
 
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...Maarten Balliauw
 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Guillaume Laforge
 

Similar to Avro (20)

Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?
 
Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the rest
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Not only SQL
Not only SQL Not only SQL
Not only SQL
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDB
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
 
The Glory of Rest
The Glory of RestThe Glory of Rest
The Glory of Rest
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
 
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
 
Webtechnologies
Webtechnologies Webtechnologies
Webtechnologies
 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Avro

  • 1.
  • 2. Avro  Etymology & History  Sexy Tractors  Project Drivers & Overview  Serialization  RPC  Hadoop Support
  • 3. Etymology  British aircraft manufacturer  1910-1963
  • 4. History  Doug Cutting – Cloudera, Hadoop project founder  2002 – Nutch  2004 – Google GFS, MapReduce whitepapers  2005 – NDFS & MR, Writable & SequenceFile  2006 – Hadoop split from Nutch, renamed NDFS to HDFS  2007 – Yahoo gets involved, HBase, Pig, Zookeeper  2008 – Terrasort contest winner, Hive, Mahout, Cassandra  2009 – Oozie, Flume, Hue
  • 5. History  Underlying serialization system basically unchanged  Additional language support and data formats  Language, data format combinatorial explosion  C++ JSON to Java BSON  Python Smile to PHP CSV  Apr 2009 – Avro proposal  May 2010 – Top-level project
  • 6. Sexy Tractors  Data serialization tools, like tractors, aren’t sexy  They should be!  Dollar for dollar storage capacity has increased exponentially, doubling every 1.5-2 years  Throughput of magnetic storage and network has not maintained this pace  Distributed systems are the norm  Efficient data serialization techniques and tools are vital
  • 7. Project Drivers  Common data format for serialization and RPC  Dynamic  Expressive  Efficient  File format  Well defined  Standalone  Splittable & compressed
  • 8. Biased Comparison CSV XML/JSON SequenceFile Thrift & PB Avro Language Yes Yes No Yes Yes Independent Expressive No Yes Yes Yes Yes Efficient No No Yes Yes Yes Dynamic Yes Yes No No Yes Standalone ? Yes No No Yes Splittable ? ? Yes ? Yes
  • 9. Project Overview  Specification based design  Dynamic implementations  File format  Schemas  Must support JSON implementation  IDL often supported  Evolvable  First class Hadoop support
  • 10. Specification Based Design  Schemas  Encoding  Sort order  Object container files  Codecs  Protocol  Protocol write format  Schema resolution
  • 11. Specification Based Design  Schemas  Primitive types  Null, boolean, int, long, float, double, bytes, string  Complex types  Records, enums, arrays, maps, unions and fixed  Named types  Records, enums, fixed  Name & namespace  Aliases  http://avro.apache.org/docs/current/spec.html#schema s
  • 12. Schema Example log-message.avpr { "namespace": "com.emoney", "name": "LogMessage", "type": "record", "fields": [ {"name": "level", "type": "string", "comment" : "this is ignored"}, {"name": "message", "type": "string", "description" : "this is the message"}, {"name": "dateTime", "type": "long"}, {"name": "exceptionMessage", "type": ["null", "string"]} ] }
  • 13. Specification Based Design  Encodings  JSON – for debugging  Binary  Sort order  Efficient sorting by system other than writer  Sorting binary-encoded data without deserialization
  • 14. Specification Based Design  Object container files  Schema  Serialized data written to binary-encoded blocks  Blocks may be compressed  Synchronization markers  Codecs  Null  Deflate  Snappy (optional)  LZO (future)
  • 15. Specification Based Design  Protocol  Protocol name  Namespace  Types  Named types used in messages  Messages  Uniquely named message  Request  Response  Errors  Wire format  Transports  Framing  Handshake
  • 16. Protocol { "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } } }
  • 17. Schema Resolution & Evolution  Writers schema always provided to reader  Compare schema used by writer & schema expected by reader  Fields that match name & type are read  Fields written that don’t match are skipped  Expected fields not written can be identified  Error or provide default value  Same features as provided by numeric field ids  Keeps fields symbolic, no index IDs written in data  Allows for projections  Very efficient at skipping fields  Aliases  Allows projections from 2 different types using aliases  User transaction  Count, date  Batch  Count, date
  • 18. Implementations  Core – parse schemas, read & write binary data for a schema  Data file – read & write Avro data files  Codec – supported codecs  RPC/HTTP – make and receive calls over HTTP Implementation Core Data file Codec RPC/HTTP C Yes Yes Deflate Yes C++ Yes Yes ? Yes C# Yes No N/A No Java Yes Yes Deflate, Snappy Yes Python Yes Yes Deflate Yes Ruby Yes Yes Deflate Yes PHP Yes Yes ? No
  • 19. API  Generic  Generic attribute/value data structure  Best suited for dynamic processing  Specific  Each record corresponds to a different kind of object in the programming language  RPC systems typically use this  Reflect  Schemas generated via reflection  Converting an existing codebase to use Avro
  • 20. API  Low-level  Schema  Encoders  DatumWriter  DatumReader  High-level  DataFileWriter  DataFileReader
  • 21. Java Example Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr")); OutputStream outputStream = new FileOutputStream("data.avro"); DataFileWriter<Message> writer = new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema)); writer.setCodec(CodecFactory.deflateCodec(1)); writer.create(schema, outputStream); writer.append(new Message ()); writer.close();
  • 22. Java Example DataFileReader<Message> reader = new DataFileReader<Message>( new File("data.avro"), new GenericDatumReader<Message>()); for (Message next : reader) { System.out.println("next: " + next); }
  • 23. RPC  Server  SocketServer (non-standard)  SaslSocketServer  HttpServer  NettyServer  DatagramServer (non-standard)  Responder  Generic  Reflect  Specific  Client  Corresponding Transceiver  LocalTransceiver  Requestor
  • 24. RPC  Client  Corresponding Transceiver for each server  LocalTransceiver  Requestor
  • 25. RPC Server Protocol protocol = Protocol.parse(new File("protocol.avpr")); InetSocketAddress address = new InetSocketAddress("localhost", 33333); GenericResponder responder = new GenericResponder(protocol) { @Override public Object respond(Protocol.Message message, Object request) throws Exception { ... } }; new SocketServer(responder, address).join();
  • 26. Hadoop Support  File writers and readers  Replacing RPC with Avro  In Flume already  Pig support is in  Splittable  Set block size when writing  Tether jobs  Connector framework for other languages  Hadoop Pipes
  • 27. Future  RPC  Hbase, Cassandra, Hadoop core  Hive in progress  Tether jobs  Actual MapReduce implementations in other languages
  • 28. Avro  Dynamic  Expressive  Efficient  Specification based design  Language implementations are fairly solid  Serialization or RPC or both  First class Hadoop support  Currently 1.5.1  Sexy tractors