Data exchange formats

Data exchange
formats
Because JSON isn’t enough
Przemysław Kamiński

About me
● Python & JavaScript programmer at Mirantis - Pure Play OpenStack™
● Hobbyist Haskeller
● Soon-to-be father :)

Dislaimer
This is a research-based presentation, not
experience-based

CGenie/data-exchange-formats
GitHub

Chatroom ver. 1
User:
email: String
username: String
RoomType: Private | Public
Room:
name: String
type: RoomType
Message:
user: User
room: Room
timestamp: DateTime
message: String

Chatroom ver. 2
User:
email: String
username: String
firstName: String
lastName: String
age: Integer
badges: String[]

What is data?
email: president@whitehouse.gov
username: MrPresident
firstName: Frank
lastName: Underwood
age: 50
badges: {caring, loving}

Data
email: president@whitehouse.gov
username: MrPresident
firstName: Frank
lastName: Underwood
age: 50
badges: {caring, loving}

Schema
email: Email
username: String
firstName: String
lastName: String
age: Int
badges: String[]

Serialized data
{
“email”: “president@whitehouse.gov”,
“username”: “MrPresident”,
“firstName”: “Frank”,
“lastName”: “Underwood”,
“age”: 50,
“badges”: [“caring”, “loving”]
}

What can change?
● Data
● Schema
● Encoding/serialization

Data exchange formats!
● Schema
● Encoding/serialization

Can be serialized in only one way
{"username": "MrPresident", "firstName":
"Frank", "lastName": "Underwood", "age": 50,
"email": "president@whitehouse.gov",
"badges": ["caring", "loving"]}

Schema is sent with every message
{
“age”: 50,
}

Schema can change at any time
{
“first_name”: “Frank”,
“last_name”: “Underwood”,
“age”: 50,
}

No built-in validation
{
“age”: “50”,
}

Readable?
https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=twitterapi&count=2
[{"user": {"follow_request_sent": false, "profile_use_background_image": true, "default_profile_image": false, "id": 6253282, "verified": true, "entities": {"url": {"urls": [{"url": "http://dev.twitter.com",
"indices": [0, 22], "expanded_url": null}]}, "description": {"urls": []}}, "profile_image_url_https": "https://si0.twimg.com/profile_images/2284174872/7df3h38zabcvjylnyfe3_normal.png",
"profile_sidebar_fill_color": "DDEEF6", "geo_enabled": true, "profile_text_color": "333333", "followers_count": 1212864, "protected": false, "location": "San Francisco, CA", "profile_background_color":
"C0DEED", "listed_count": 10775, "utc_offset": -28800, "statuses_count": 3333, "description": "The Real Twitter API. I tweet about API changes, service issues and happily answer questions about Twitter and
our API. Don't get an answer? It's on my website.", "friends_count": 31, "profile_link_color": "0084B4", "profile_image_url": "http://a0.twimg.
com/profile_images/2284174872/7df3h38zabcvjylnyfe3_normal.png", "notifications": null, "show_all_inline_media": false, "profile_background_image_url_https": "https://si0.twimg.
com/images/themes/theme1/bg.png", "id_str": "6253282", "profile_background_image_url": "http://a0.twimg.com/images/themes/theme1/bg.png", "screen_name": "twitterapi", "lang": "en", "following":
null, "profile_background_tile": false, "favourites_count": 24, "name": "Twitter API", "url": "http://dev.twitter.com", "created_at": "Wed May 23 06:01:13 +0000 2007", "contributors_enabled": true,
"time_zone": "Pacific Time (US & Canada)", "profile_sidebar_border_color": "C0DEED", "default_profile": true, "is_translator": false}, "favorited": false, "contributors": null, "truncated": false, "source": "<a
href="//sites.google.com/site/yorufukurou/"" rel=""nofollow"">YoruFukurou</a>", "text": "Introducing the Twitter Certified Products Program: https://t.co/MjJ8xAnT", "created_at": "Wed Aug 29
17:12:58 +0000 2012", "retweeted": false, "in_reply_to_status_id": null, "coordinates": null, "id": 240859602684612608, "entities": {"user_mentions": [], "hashtags": [], "urls": [{"url": "https://t.
co/MjJ8xAnT", "indices": [52, 73], "expanded_url": "https://dev.twitter.com/blog/twitter-certified-products", "display_url": "dev.twitter.com/blog/twitter-cu2026"}]}, "in_reply_to_status_id_str": null,
"place": null, "id_str": "240859602684612608", "in_reply_to_screen_name": null, "retweet_count": 121, "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "in_reply_to_user_id": null},
{"user": {"follow_request_sent": false, "profile_use_background_image": true, "default_profile_image": false, "id": 6253282, "verified": true, "entities": {"url": {"urls": [{"url": "http://dev.twitter.com",
"indices": [0, 22], "expanded_url": null}]}, "description": {"urls": []}}, "profile_image_url_https": "https://si0.twimg.com/profile_images/2284174872/7df3h38zabcvjylnyfe3_normal.png",
"profile_sidebar_fill_color": "DDEEF6", "geo_enabled": true, "profile_text_color": "333333", "followers_count": 1212864, "protected": false, "location": "San Francisco, CA", "profile_background_color":
"C0DEED", "listed_count": 10775, "utc_offset": -28800, "statuses_count": 3333, "description": "The Real Twitter API. I tweet about API changes, service issues and happily answer questions about Twitter and
our API. Don't get an answer? It's on my website.", "friends_count": 31, "profile_link_color": "0084B4", "profile_image_url": "http://a0.twimg.
com/profile_images/2284174872/7df3h38zabcvjylnyfe3_normal.png", "notifications": null,"show_all_inline_media": false, "profile_background_image_url_https": "https://si0.twimg.
com/images/themes/theme1/bg.png", "id_str": "6253282", "profile_background_image_url": "http://a0.twimg.com/images/themes/theme1/bg.png", "screen_name": "twitterapi", "lang": "en", "following":
null, "profile_background_tile": false, "favourites_count": 24, "name": "Twitter API", "url": "http://dev.twitter.com", "created_at": "Wed May 23 06:01:13 +0000 2007", "contributors_enabled": true,
"time_zone": "Pacific Time (US & Canada)", "profile_sidebar_border_color": "C0DEED", "default_profile": true, "is_translator": false}, "favorited": false, "contributors": null, "truncated": false, "source": "<a
href="//sites.google.com/site/yorufukurou/"" rel=""nofollow"">YoruFukurou</a>", "text": "We are working to resolve issues with application management & logging in to the dev portal: https://t.
co/p5bOzH0k ^TS", "created_at": "Sat Aug 25 17:26:51 +0000 2012", "retweeted": false, "in_reply_to_status_id": null, "coordinates": null, "id": 239413543487819778, "entities": {"user_mentions": [],
"hashtags": [], "urls": [{"url": "https://t.co/p5bOzH0k", "indices": [97, 118], "expanded_url": "https://dev.twitter.com/issues/485", "display_url": "dev.twitter.com/issues/485"}]},
"in_reply_to_status_id_str": null, "place": null, "id_str": "239413543487819778", "in_reply_to_screen_name": null,"retweet_count": 105, "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive":
false, "in_reply_to_user_id": null}]

XML: Redability
http://en.wikipedia.org/wiki/Data_exchange

Eric Naggum
“(...) those who think it is rational to require parsing of character
data at each and every application interface are literally retarded
and willfully blind. (...) But, alas, people prefer buggy text formats
that they can approximate rather than precise binary formats that
follow general rules that are make them as easy to use as text
formats.”
http://www.xach.com/naggum/articles/3242274237190594@naggum.no.txt

MessagePack
MessagePack is just another way of serializing dynamic
data and has the same problems as JSON.

About ASN.1
● Created in 1984 by ITU (International Telecommunication Union)
● Revised and changed many times later
● There is no ASN.2 :)
● Heavily used by telecommunications industry (UMTS, LTE, SIM
cards)
● Cryptography: PKCS, PKI (X.509)
● LDAP, RFID

ASN.1 - goodies
● INTEGER ranges: INTEGER (12345..12346)
● String ranges: UTF8String (FROM("A".."Z"))
● Regexes:
VisibleString (PATTERN "d#2/d#2/d#4-d#2:d#2") --
DD/MM/YYYY-HH:MM
● Many encodings supported: BER, CER, DER, PER, XER (XML), GSER
(human-readable)
● Can be encoded into JSON too:
http://www.obj-sys.com/docs/JSONEncodingRules.pdf

ASN.1 - parametrized types
My-Type {INTEGER : dummy1, Dummy2} ::=
SEQUENCE {
first-field Dummy2,
second-field INTEGER (1..dummy1)
}
my-field ::= My-Type{10, UTF8String}

ASN.1 - so many string types
NumericString, PrintableString, VisibleString, ISO646String,
IA5String, TeletexString, T61String, VideotexString,
GraphicString, GeneralString, UniversalString, UTF8String

ASN.1 - VideotexString - Minitel???

ASN.1- quality of libraries
Because the language is quite rich the quality of libraries
varies greatly.
Most featureful seems to be OSS Noklava (oss.com) but
their product is proprietary and not free.

ASN.1- my opinion
Opinions about ASN.1 vary. IMHO it hasn’t
reached the critical mass required for the open-
source ecosystem to develop and great tools to
appear that will support all of its advanced
features.
Bandwagon effect: once a product becomes popular, more people tend to "get
on the bandwagon" and buy it, too [Wikipedia]

ASN.1 demo
And let’s not forget about
openssl asn1parse -inform DER -in server-message-received
openssl asn1parse -in id_rsa.asn1-test
https://lapo.it/asn1js/

protobuf - general remarks
● Only one official serialization protocol
● 3rd party JSON encoder: https://github.com/dpp-
name/protobuf-json
● ASN.1 with PER encoding is more compact: http:
//stackoverflow.com/a/4441622
● Tags are required (contrary to ASN.1)

protobuf - who uses it?
● Google
● Hadoop
● OpenStreetMap for their PBF Format
● Ubuntu One did use it for storage (before it was shut
down)
● … know more?

protobuf - officially supported languages
● C++
● Java
● Python
● Objective-C (3.0+)
● C# (3.0+)

protobuf - unofficially supported languages
● Clojure
● Common Lisp
● Erlang
● Go
● Haskell
● JavaScript
● Lua
● Ruby
● Scala
● and many others

protobuf - RPC
Similarly to ASN.1, protobuf doesn’t have builtin RPC but
take a look at http://www.grpc.io/

protobuf demo
Time for a demo...

Thrift - general remarks
● Started at Facebook, now part of Apache Software Foundation
● RPC is built-in
● Supports multiple protocols (Binary, Compact, JSON,
multiplexed, simple JSON, tuple, ...) and transports
(Empty, Framed, HttpClient, ...) out of the box
● Tags are required
● http://diwakergupta.github.io/thrift-missing-guide/

Thrift - officially supported languages
● C++
● C#
● Erlang
● Haskell
● Java
● JavaScript
● Python
● Ruby
● and many others with varying protocol/transport support

Thrift - who uses it?
● Facebook
● Hadoop
● Cassandra DID use it but switched to CQL
● Evernote
● LastFM

Thrift - funny Haskell bug ctd.

Thrift demo
Time for a demo...

Cap’n proto
● Similar to the previous ones
● Tags are obligatory
● RPC is built-in
● One encoding protocol with optional “packing” for data
compression
● No encoding/decoding step, data is already kept
encoded in memory

Cap’n proto - supported languages
● C++
● Erlang
● JavaScript (Node.js only)
● Python
● Ruby

Cap’n proto - supported languages (no RPC)
● C
● C#
● Go
● Java
● JavaScript
● Lua
● OCaml

Cap’n proto - RPC is quite interesting

Cap’n proto - E Language
# synchronous
def car := carMaker("Mercedes")
car.moveTo(2,3)
# asynchronous
def carPromise := carMaker <- new(“Mercedes”)
def movePromise := carPromise <- moveTo(2, 3)

Cap’n proto - E Language ctd.
when (movePromise) -> done(position) {
println(`Moved to $position`)
} catch e {
println(`Error moving: $e`)
}

Cap’n proto - E Language ctd.
Major implementations:
E on Java
E on Common Lisp

Avro - general remarks
● Developed by Apache Software Foundation
● Primarily used in Hadoop
● Dynamic schema, can be specified either in JSON or a
custom IDL (Interface Definition Language)
● Serialization to binary or JSON
● No tags! Every field is optional

Avro - supported languages
● C/C++
● C#
● Java
● Python
● Ruby
● Scala

Avro - dynamic schema
Because schema is dynamic, every message must be sent
along with its schema definition.
This resembles JSONSchema packed together with JSON
which holds the actual data.

Avro - dynamic schema ctd.
To decode the message you always need the schema it
was encoded with.
But: Avro has smart schema resolution rules so that if
reader expects the message in different format, Avro will
translate the decoded data to that format.

Avro - schema resolution for records
● the ordering of fields may be different: fields are matched by name.
● schemas for fields with the same name in both records are resolved
recursively.
● if the writer's record contains a field with a name not present in the reader's
record, the writer's value for that field is ignored.
● if the reader's record schema has a field that contains a default value, and
writer's schema does not have a field with the same name, then the reader
should use the default value from its field.
● if the reader's record schema has a field with no default value, and writer's
schema does not have a field with the same name, an error is signalled.

Avro
Avro seems like a good way to start out when you’re still
used to JSON and are not convinced by other protocols.

Final remarks
● Required fields are forever (except for Avro, where “tag” is fieldname)
● Can be used not only for exchange but storage too: protobuf in ActiveMQ and Ubuntu One (just
think in terms of Thrift and take “transport” to be Database)
● For HTTP use the Content-Encoding and Content-Type headers
● Performance & size comparisons:
○ https://github.com/sidshetye/SerializersCompare (C#)
○ https://github.com/eishay/jvm-serializers (Java)
● Everyone creates their own:
○ Protocol Buffers (Google)
○ Thrift (Facebook)
○ Bond (Microsoft)

Data exchange formats

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Data exchange formats

Similar to Data exchange formats (20)

Recently uploaded

Recently uploaded (6)

Data exchange formats