SlideShare a Scribd company logo
A Brief Introduction
-- Randy Abernethy, rx-m llc, 2014
unstructured
data
• Databases provide their own
internal serialization
– A fast but completely closed system
– Not perfect for unstructured data
• Systems requiring the ability to
process arbitrary datasets generated
by a wide range of programs need
open solutions
– XML – gigantic, glacial to parse
– JSON – big, slow to parse
– >> Something else in the middle? <<
– Binary – small, fast and inflexible
• Size matters!
– 1 Gigabyte or 6, maybe doesn’t matter
– 1 Petabytes or 6 usually matters
$ ls -l order.*
-rw-r--r-- 1 tm tm 152 May 28 02:12 order.bin
-rw-r--r-- 1 tm tm 798 May 28 02:05 order.json
-rw-r--r-- 1 tm tm 958 May 28 02:04 order.xml
avro • Apache Avro
– Apache Avro is a data serialization system
• What?
– Data Serialization – Apache Avro is first and foremost an excellent means
for reading and writing data structures in an open and dynamic way
– Compact – Apache Avro uses an efficient binary data format
– Fast – Apache Avro data can be parsed quickly with low overhead
– RPC Support – Apache Avro supplies RPC mechanisms using the Apache
Avro serialization system for platform and language independence
– Reach – Apache Avro supports a wide range of languages (some supported
outside of the project itself): C, C++, Java, Python, PHP, C#, Ruby and other
externally supported languages
– Flexibility – Apache Avro supports data evolution, allowing data structures
to change over time without breaking existing systems
– No IDL – Apache Avro uses schemas to describe data structures, schemas
are encoded with the data eliminating the need for IDL code generation
and repetitive field tags embedded with the data
– JSON – Apache Avro schemas are described using JSON
– Hadoop – Apache Avro is built into the Hadoop Framework
• Built for big data
– Unlike other systems, Apache Avro is
particularly focused on features making it
efficient for use with large data sets
• Schemas are always present
– The schema is required to parse the data
• Apache Avro leverages dynamic typing, making it possible for
dynamic programming languages to generate data types
from Apache Avro Schemas on the fly
• Completely generic data processing systems can be
developed with no preexisting knowledge of the data
formats to be processed
• Apache Avro data structures are based on field names not
field IDs
– This makes the selection and consistency of field names important to Apache Avro
readers and writers
different?
Schema: the
structure of a data
system described in
a formal language
• Primitive types
– null: no value
– boolean: a binary value
– int: 32-bit signed integer
– long: 64-bit signed integer
– float: single precision (32-bit) IEEE 754
floating-point number
– double: double precision (64-bit) IEEE 754
floating-point number
– bytes: sequence of 8-bit unsigned bytes
– string: unicode character sequence
• Complex Types
– records – named sets of fields
• Fields are JSON objects with a name, type and optionally a default value
– enums – named set of strings, instances may hold any one of the string values
– fixed – a named fixed size set of bytes
– arrays – a collection of a single type
– maps – a string keyed set of key/value pairs
– unions – an array of types, instances may assume any of the types
schematypes
Records, enums and fixed types
have fullnames
Fullnames contain two parts
• Namespace (e.g. com.example)
• Name (e.g. Order)
Namespaces are dot separated
sequences and case sensitive.
Elements referenced by undotted name
alone inherit the most closely enclosing
namespace.
Fullnames must be unique and defined
before use.
Avro Java strings are:
org.apache.avro.util.Utf8
• Apache Avro supports Binary and JSON encoding schemes
encoding • null: 0 bytes
• boolean: 1 byte
• int & long: varint/zigzag
• float: 4 bytes
• double: 8 bytes
• bytes: a long encoded length followed by the bytes
• string: a long encoded length (in bytes) followed by UTF-8 characters
• records: fields encoded in the order declared
• enums: int encoded as the 0 based position of the value in the enum array
• fixed: a fixed size set of bytes
• arrays: a set of blocks
– A block is typically a long encoded count followed by that many items
– Small arrays are represented with a single block
– Large data sets can be broken up into multiple blocks such that each block can be buffered in memory
– The final block has a count of 0 and no items
– A negative count flags abs(count) many items but is immediately followed by a long encoded size in bytes
• This allows the reader to skip ahead quickly without decoding individual items (which may be of variable
length)
• maps: like arrays but with key/value pairs for items
• unions: a long encoded 0 based index identifying the type to use followed by the value
Binary
• Apache Avro schemas must always be present
• The Apache Avro system defines a container file used to house a
schema and its data
• Files consist of
– A header which contains metadata and a sync marker
– One or more data blocks containing data as defined in the schema
• Metadata
– The header meta data can be any data useful to the file’s author
– Apache Avro assigns two metadata values:
• avro.schema – containing the schema in JSON format
• avro.codec – defining the name of the compression codec used on the blocks if any
– Mandatory codecs are null and deflate, null (no compression) is assumed if the metadata field
is absent
– Snappy is another common codec
• Data Blocks
– Data blocks contain a long count of objects in the block, a long size in
bytes, the serialized objects (possibly compressed) and the 16 byte sync
mark
– The block prefix and sync mark allow data to be efficiently skipped during
HDFS mapred splits and other similar processing
• Corrupt blocks can also be detected
containers
• Apache Avro defines a sort order for data
– This can be a key optimization in large data processing
environments
• Data items with identical schemas can be
compared
• Record fields can optionally specify one of three
sort orders
– ascending – standard sort order
– descending – reverse sort order
– ignore – values are to be ignored when sorting the records
sortorder
• Apache Avro defines RPC interfaces using protocol
schemas
• Protocols are defined in JSON and contain
– protocol – the name of the protocol (i.e. service)
– namespace – the containing namespace (optional)
– types – the data types defined within the protocol
– messages – the RPC message sequences supported
• Messages define an RPC method
– request – the list of parameters passed
– response – the normal return type
– error – the possible error return types (optional)
– one-way – optional Boolean for request only messages
protocols
aschema
{
"namespace": "com.example",
"protocol": "HelloWorld",
"doc": "Protocol Greetings",
"types": [
{"name": "Greeting", "type": "record", "fields": [
{"name": "message", "type": "string"}]},
{"name": "Curse", "type": "error", "fields": [
{"name": "message", "type": "string"}]}
],
"messages": {
"hello": {
"doc": "Say hello.",
"request": [{"name": "greeting", "type": "Greeting" }],
"response": "Greeting",
"errors": ["Curse"]
}
}
}
• Schemas must always be present
• Therefore when using Protocols:
– The client must send the request schema to the server
– The server must send the response and error schema to the client
• Stateless transports
require a schema
handshake before
each request
• Stateful transports
can maintain a
schema cache
eliminating the
schema exchange
on successive calls
• Handshakes use a
hash of the schema
to avoid sending
schemas which are
already consistent on
both sides
handshakes
{
"type": "record",
"name": "HandshakeRequest", "namespace":"org.apache.avro.ipc",
"fields": [
{"name": "clientHash",
"type": {"type": "fixed", "name": "MD5", "size": 16}},
{"name": "clientProtocol", "type": ["null", "string"]},
{"name": "serverHash", "type": "MD5"},
{"name": "meta", "type": ["null", {"type": "map", "values": "bytes"}]}
]
}
{
"type": "record",
"name": "HandshakeResponse", "namespace": "org.apache.avro.ipc",
"fields": [
{"name": "match",
"type": {"type": "enum", "name": "HandshakeMatch",
"symbols": ["BOTH", "CLIENT", "NONE"]}},
{"name": "serverProtocol",
"type": ["null", "string"]},
{"name": "serverHash",
"type": ["null", {"type": "fixed", "name": "MD5", "size": 16}]},
{"name": "meta",
"type": ["null", {"type": "map", "values": "bytes"}]}
]
}
Apache Avro Handshake Records
protocol
session
$ sudo apt-get install libsnappy-dev
…
$ sudo apt-get install python-pip python-dev build-essential
…
$ sudo pip install python-snappy
…
$ sudo pip install avro
…
$ cat hello.avpr
{
"namespace": "com.example",
"protocol": "HelloWorld",
"doc": "Protocol Greetings",
"types": [
{"name": "Message", "type": "record",
"fields": [
{"name": "to", "type": "string"},
{"name": "from", "type": "string"},
{"name": "body", "type": "string"}
]
}
],
"messages": {
"send": {
"request": [{"name": "message", "type": "Message"}],
"response": "string"
}
}
}
$ python server.py
127.0.0.1 - - [28/May/2014 06:27:31] "POST / HTTP/1.1" 200 -
https://github.com/phunt/avro-rpc-quickstart
$ python client.py Thurston_Howell Ginger "Hello Avro"
Result: Sent message to Thurston_Howell from Ginger with body Hello Avro
helloserver
helloclient
• The Avro serialization system was originally
developed by Doug Cutting for use with Hadoop, it
became an Apache Software Foundation project in
2009.
• 0.0.0 Apache Inception 2009-04-01
• 1.0.0 released 2009-07-15
• 1.1.0 released 2009-09-15
• 1.2.0 released 2009-10-15
• 1.3.0 released 2010-02-26
• 1.4.0 released 2010-09-08
• 1.5.0 released 2011-03-11
• 1.6.0 released 2011-11-02
• 1.7.0 released 2012-06-11
• 1.7.6 released 2014-01-22
versions
Open Source
Community Developed
Apache License Version 2.0
avro
resources
• Web
– avro.apache.org
– github.com/apache/avro
• Mail
– Users: user@avro.apache.org
– Developers: dev@avro.apache.org
• Chat
– #avro
• Book
– White (2012), Hadoop, The Definitive Guide,
O’Reilly Media Inc. [http://www.oreilly.com]
– Features 20 pages of Apache Avro coverage
Randy Abernethy
ra@apache.org

More Related Content

What's hot

Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsAlex Tumanoff
 
Beyond JSON - An Introduction to FlatBuffers
Beyond JSON - An Introduction to FlatBuffersBeyond JSON - An Introduction to FlatBuffers
Beyond JSON - An Introduction to FlatBuffers
Maxim Zaks
 
Google Protocol Buffers
Google Protocol BuffersGoogle Protocol Buffers
Google Protocol Buffers
Sergey Podolsky
 
Apache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePersonApache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePerson
LivePerson
 
F# Type Provider for R Statistical Platform
F# Type Provider for R Statistical PlatformF# Type Provider for R Statistical Platform
F# Type Provider for R Statistical PlatformHoward Mansell
 
RESTLess Design with Apache Thrift: Experiences from Apache Airavata
RESTLess Design with Apache Thrift: Experiences from Apache AiravataRESTLess Design with Apache Thrift: Experiences from Apache Airavata
RESTLess Design with Apache Thrift: Experiences from Apache Airavata
smarru
 
Rest style web services (google protocol buffers) prasad nirantar
Rest style web services (google protocol buffers)   prasad nirantarRest style web services (google protocol buffers)   prasad nirantar
Rest style web services (google protocol buffers) prasad nirantar
IndicThreads
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Elizabeth Smith
 
System Programming and Administration
System Programming and AdministrationSystem Programming and Administration
System Programming and Administration
Krasimir Berov (Красимир Беров)
 
PDF in Smalltalk
PDF in SmalltalkPDF in Smalltalk
PDF in Smalltalk
ESUG
 
Jena Programming
Jena ProgrammingJena Programming
Jena Programming
Myungjin Lee
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query ParsingErik Hatcher
 
Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0
Lucidworks (Archived)
 
Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the restgeorge.james
 
Php’s guts
Php’s gutsPhp’s guts
Php’s guts
Elizabeth Smith
 
Materi Dasar PHP
Materi Dasar PHPMateri Dasar PHP
Materi Dasar PHP
Robby Firmansyah
 
Php extensions
Php extensionsPhp extensions
Php extensions
Elizabeth Smith
 

What's hot (19)

Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
 
Beyond JSON - An Introduction to FlatBuffers
Beyond JSON - An Introduction to FlatBuffersBeyond JSON - An Introduction to FlatBuffers
Beyond JSON - An Introduction to FlatBuffers
 
Google Protocol Buffers
Google Protocol BuffersGoogle Protocol Buffers
Google Protocol Buffers
 
Apache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePersonApache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePerson
 
F# Type Provider for R Statistical Platform
F# Type Provider for R Statistical PlatformF# Type Provider for R Statistical Platform
F# Type Provider for R Statistical Platform
 
RESTLess Design with Apache Thrift: Experiences from Apache Airavata
RESTLess Design with Apache Thrift: Experiences from Apache AiravataRESTLess Design with Apache Thrift: Experiences from Apache Airavata
RESTLess Design with Apache Thrift: Experiences from Apache Airavata
 
Rest style web services (google protocol buffers) prasad nirantar
Rest style web services (google protocol buffers)   prasad nirantarRest style web services (google protocol buffers)   prasad nirantar
Rest style web services (google protocol buffers) prasad nirantar
 
Php
PhpPhp
Php
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
System Programming and Administration
System Programming and AdministrationSystem Programming and Administration
System Programming and Administration
 
PDF in Smalltalk
PDF in SmalltalkPDF in Smalltalk
PDF in Smalltalk
 
Jena Programming
Jena ProgrammingJena Programming
Jena Programming
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0
 
Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the rest
 
Php’s guts
Php’s gutsPhp’s guts
Php’s guts
 
Materi Dasar PHP
Materi Dasar PHPMateri Dasar PHP
Materi Dasar PHP
 
Php Basics
Php BasicsPhp Basics
Php Basics
 
Php extensions
Php extensionsPhp extensions
Php extensions
 

Similar to Avro intro

Designing Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, AivenDesigning Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
HostedbyConfluent
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
DataWorks Summit
 
Hadoop
HadoopHadoop
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
mattlieber
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Python first day
Python first dayPython first day
Python first day
MARISSTELLA2
 
Python first day
Python first dayPython first day
Python first day
farkhand
 
Datatype
DatatypeDatatype
Datatype
baran19901990
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Efficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema RegistryEfficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema Registry
Pat Patterson
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandrarantav
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
MapR Technologies
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
Ted Dunning
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19jasonfrantz
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
Yahoo Developer Network
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
confluent
 
Ch06Part1.ppt
Ch06Part1.pptCh06Part1.ppt
Ch06Part1.ppt
kavitamittal18
 

Similar to Avro intro (20)

Designing Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, AivenDesigning Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
Designing Payloads for Event-Driven Systems | Lorna Mitchell, Aiven
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
 
Hadoop
HadoopHadoop
Hadoop
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Python first day
Python first dayPython first day
Python first day
 
Python first day
Python first dayPython first day
Python first day
 
Datatype
DatatypeDatatype
Datatype
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Efficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema RegistryEfficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema Registry
 
3 apache-avro
3 apache-avro3 apache-avro
3 apache-avro
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 
Avro
AvroAvro
Avro
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
Ch06Part1.ppt
Ch06Part1.pptCh06Part1.ppt
Ch06Part1.ppt
 

Recently uploaded

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 

Recently uploaded (20)

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 

Avro intro

  • 1. A Brief Introduction -- Randy Abernethy, rx-m llc, 2014
  • 2. unstructured data • Databases provide their own internal serialization – A fast but completely closed system – Not perfect for unstructured data • Systems requiring the ability to process arbitrary datasets generated by a wide range of programs need open solutions – XML – gigantic, glacial to parse – JSON – big, slow to parse – >> Something else in the middle? << – Binary – small, fast and inflexible • Size matters! – 1 Gigabyte or 6, maybe doesn’t matter – 1 Petabytes or 6 usually matters $ ls -l order.* -rw-r--r-- 1 tm tm 152 May 28 02:12 order.bin -rw-r--r-- 1 tm tm 798 May 28 02:05 order.json -rw-r--r-- 1 tm tm 958 May 28 02:04 order.xml
  • 3. avro • Apache Avro – Apache Avro is a data serialization system • What? – Data Serialization – Apache Avro is first and foremost an excellent means for reading and writing data structures in an open and dynamic way – Compact – Apache Avro uses an efficient binary data format – Fast – Apache Avro data can be parsed quickly with low overhead – RPC Support – Apache Avro supplies RPC mechanisms using the Apache Avro serialization system for platform and language independence – Reach – Apache Avro supports a wide range of languages (some supported outside of the project itself): C, C++, Java, Python, PHP, C#, Ruby and other externally supported languages – Flexibility – Apache Avro supports data evolution, allowing data structures to change over time without breaking existing systems – No IDL – Apache Avro uses schemas to describe data structures, schemas are encoded with the data eliminating the need for IDL code generation and repetitive field tags embedded with the data – JSON – Apache Avro schemas are described using JSON – Hadoop – Apache Avro is built into the Hadoop Framework
  • 4. • Built for big data – Unlike other systems, Apache Avro is particularly focused on features making it efficient for use with large data sets • Schemas are always present – The schema is required to parse the data • Apache Avro leverages dynamic typing, making it possible for dynamic programming languages to generate data types from Apache Avro Schemas on the fly • Completely generic data processing systems can be developed with no preexisting knowledge of the data formats to be processed • Apache Avro data structures are based on field names not field IDs – This makes the selection and consistency of field names important to Apache Avro readers and writers different? Schema: the structure of a data system described in a formal language
  • 5. • Primitive types – null: no value – boolean: a binary value – int: 32-bit signed integer – long: 64-bit signed integer – float: single precision (32-bit) IEEE 754 floating-point number – double: double precision (64-bit) IEEE 754 floating-point number – bytes: sequence of 8-bit unsigned bytes – string: unicode character sequence • Complex Types – records – named sets of fields • Fields are JSON objects with a name, type and optionally a default value – enums – named set of strings, instances may hold any one of the string values – fixed – a named fixed size set of bytes – arrays – a collection of a single type – maps – a string keyed set of key/value pairs – unions – an array of types, instances may assume any of the types schematypes Records, enums and fixed types have fullnames Fullnames contain two parts • Namespace (e.g. com.example) • Name (e.g. Order) Namespaces are dot separated sequences and case sensitive. Elements referenced by undotted name alone inherit the most closely enclosing namespace. Fullnames must be unique and defined before use. Avro Java strings are: org.apache.avro.util.Utf8
  • 6. • Apache Avro supports Binary and JSON encoding schemes encoding • null: 0 bytes • boolean: 1 byte • int & long: varint/zigzag • float: 4 bytes • double: 8 bytes • bytes: a long encoded length followed by the bytes • string: a long encoded length (in bytes) followed by UTF-8 characters • records: fields encoded in the order declared • enums: int encoded as the 0 based position of the value in the enum array • fixed: a fixed size set of bytes • arrays: a set of blocks – A block is typically a long encoded count followed by that many items – Small arrays are represented with a single block – Large data sets can be broken up into multiple blocks such that each block can be buffered in memory – The final block has a count of 0 and no items – A negative count flags abs(count) many items but is immediately followed by a long encoded size in bytes • This allows the reader to skip ahead quickly without decoding individual items (which may be of variable length) • maps: like arrays but with key/value pairs for items • unions: a long encoded 0 based index identifying the type to use followed by the value Binary
  • 7. • Apache Avro schemas must always be present • The Apache Avro system defines a container file used to house a schema and its data • Files consist of – A header which contains metadata and a sync marker – One or more data blocks containing data as defined in the schema • Metadata – The header meta data can be any data useful to the file’s author – Apache Avro assigns two metadata values: • avro.schema – containing the schema in JSON format • avro.codec – defining the name of the compression codec used on the blocks if any – Mandatory codecs are null and deflate, null (no compression) is assumed if the metadata field is absent – Snappy is another common codec • Data Blocks – Data blocks contain a long count of objects in the block, a long size in bytes, the serialized objects (possibly compressed) and the 16 byte sync mark – The block prefix and sync mark allow data to be efficiently skipped during HDFS mapred splits and other similar processing • Corrupt blocks can also be detected containers
  • 8. • Apache Avro defines a sort order for data – This can be a key optimization in large data processing environments • Data items with identical schemas can be compared • Record fields can optionally specify one of three sort orders – ascending – standard sort order – descending – reverse sort order – ignore – values are to be ignored when sorting the records sortorder
  • 9. • Apache Avro defines RPC interfaces using protocol schemas • Protocols are defined in JSON and contain – protocol – the name of the protocol (i.e. service) – namespace – the containing namespace (optional) – types – the data types defined within the protocol – messages – the RPC message sequences supported • Messages define an RPC method – request – the list of parameters passed – response – the normal return type – error – the possible error return types (optional) – one-way – optional Boolean for request only messages protocols
  • 10. aschema { "namespace": "com.example", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } } }
  • 11. • Schemas must always be present • Therefore when using Protocols: – The client must send the request schema to the server – The server must send the response and error schema to the client • Stateless transports require a schema handshake before each request • Stateful transports can maintain a schema cache eliminating the schema exchange on successive calls • Handshakes use a hash of the schema to avoid sending schemas which are already consistent on both sides handshakes { "type": "record", "name": "HandshakeRequest", "namespace":"org.apache.avro.ipc", "fields": [ {"name": "clientHash", "type": {"type": "fixed", "name": "MD5", "size": 16}}, {"name": "clientProtocol", "type": ["null", "string"]}, {"name": "serverHash", "type": "MD5"}, {"name": "meta", "type": ["null", {"type": "map", "values": "bytes"}]} ] } { "type": "record", "name": "HandshakeResponse", "namespace": "org.apache.avro.ipc", "fields": [ {"name": "match", "type": {"type": "enum", "name": "HandshakeMatch", "symbols": ["BOTH", "CLIENT", "NONE"]}}, {"name": "serverProtocol", "type": ["null", "string"]}, {"name": "serverHash", "type": ["null", {"type": "fixed", "name": "MD5", "size": 16}]}, {"name": "meta", "type": ["null", {"type": "map", "values": "bytes"}]} ] } Apache Avro Handshake Records
  • 12. protocol session $ sudo apt-get install libsnappy-dev … $ sudo apt-get install python-pip python-dev build-essential … $ sudo pip install python-snappy … $ sudo pip install avro … $ cat hello.avpr { "namespace": "com.example", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Message", "type": "record", "fields": [ {"name": "to", "type": "string"}, {"name": "from", "type": "string"}, {"name": "body", "type": "string"} ] } ], "messages": { "send": { "request": [{"name": "message", "type": "Message"}], "response": "string" } } } $ python server.py 127.0.0.1 - - [28/May/2014 06:27:31] "POST / HTTP/1.1" 200 - https://github.com/phunt/avro-rpc-quickstart $ python client.py Thurston_Howell Ginger "Hello Avro" Result: Sent message to Thurston_Howell from Ginger with body Hello Avro
  • 15. • The Avro serialization system was originally developed by Doug Cutting for use with Hadoop, it became an Apache Software Foundation project in 2009. • 0.0.0 Apache Inception 2009-04-01 • 1.0.0 released 2009-07-15 • 1.1.0 released 2009-09-15 • 1.2.0 released 2009-10-15 • 1.3.0 released 2010-02-26 • 1.4.0 released 2010-09-08 • 1.5.0 released 2011-03-11 • 1.6.0 released 2011-11-02 • 1.7.0 released 2012-06-11 • 1.7.6 released 2014-01-22 versions Open Source Community Developed Apache License Version 2.0
  • 16. avro resources • Web – avro.apache.org – github.com/apache/avro • Mail – Users: user@avro.apache.org – Developers: dev@avro.apache.org • Chat – #avro • Book – White (2012), Hadoop, The Definitive Guide, O’Reilly Media Inc. [http://www.oreilly.com] – Features 20 pages of Apache Avro coverage Randy Abernethy ra@apache.org