Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming in Scala with Avro

472 views

Published on

June 2015 slidedeck

Published in: Technology
  • http://www.dbmanagement.info/Tutorials/Hadoop.htm #Hadoop #Avro #Cassandro #Drill #Flume Tutorial (Videos and Books)at $7.95
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Streaming in Scala with Avro

  1. 1. Streaming in Scala*, with Avro
  2. 2. • My name is Jonathan Winandy • @ahoy_jon > Me
  3. 3. > Introduction Streams
  4. 4. What are Streams ? It’s an abstract data structure with the following : operations : • append(bytes) -> void? • readAt(int) -> null | bytes rule 1 : ∀p ∈ ℕ, for some definition of ‘==‘ x := readAt(p) y := readAt(p) x != null => x == y Rule 1 implies : Infinite cacheability 
 once the data is available at a position. > Introduction
  5. 5. Streams are the simplest way to manage data. 0 1 2 3 4 5 6 > Introduction
  6. 6. HARD DRIVES : Magnetic and mutable. And we know that since the XVth century … So what happened ? Memorial Journal Ledger > Introduction
  7. 7. Exemple of architectures CLASSICAL APP REPLICATION (BIN/LOG) APP APP DB DB APP APP APP append broadcast WITH STREAMS The broadcast mechanism is equivalent to 
 a db replication mechanism.
  8. 8. State of data encodings in the industry • As always worse is considered better. • Most of streams have data encoded in : • CSV/TSV • JSON • Platform specific serialisations (eg: Java serialisation, Kryo) > Data encoding
  9. 9. Why Data Encoding is Important ? • Some streams may contain very large amounts of Data, the chosen encoding must be CPU and space efficient. • Streams are processed 
 by many programs, 
 and many intermediaries, 
 for many years, 
 the chosen encoding must be processable in a generic way. > Data encoding
  10. 10. JSON is the Common Denominator Pros : • It reaches the browser, you can produce and consume data from inside a web page. vs a lot of Cons : • Inefficient, • No dates, no proper numerics, • Very basic data structures, • Very prone to errors. We all need JSON,
 but we should use it
 only when we can't avoid it. > Data encoding Eg : In our databases, we can avoid JSONs ;)
  11. 11. :02:06:62:6f:62:02:16:02:da:01 {“name”:"Bob", “age":11, "gender":"Male"} > Data encoding How bad is JSON ? 39 Bytes for
 10 Bytes of data
  12. 12. relevant
 ones popular binaries low tech cognitect “papa ?!” Avro Thrift Proto Buf JSON CSV Fressian Transit EDN XML RDF binary YES NO YES OK NO ?? generic YES ?? NO YES YES YES schema based YES NO YES NO ?? meta specific encoding YES NO “STRING S” YES OK Literal s reach the browser YES NO +++++ OK NO YES OK easy ? NO I PASS “true” YEP HUM ? … safe ? YES HUM? NO NO MISM ATCH <! YES has dates? Soon NO NO YES YES
  13. 13. Building your own Data Empire with Scala and Avro
  14. 14. TOPIC TOPIC !? Collector Collector Collector > Sandboxing > Pull to Push
  15. 15. > Sandboxing TOPIC TOPIC !? Collector Collector Collector TOPIC TOPIC !? Collector Collector Collector TOPIC TOPIC !? Collector Collector Collector Metadata ? > Durability
  16. 16. > Example Collector Event using avro-scala-macro-annotations
  17. 17. > Example AA | 00 | 00 | 04FFFF BaseEvent
  18. 18. > Example AA | 02D5CB1A5355EE405BADDC1133C628F7A6
 | 00 | 06FFFFFF BaseEvent
  19. 19. > Example Write Context
  20. 20. > Example write-context collector-staging Collector 1 22222222
  21. 21. > Example write-context collector-staging Consumer
  22. 22. > To improve • Writer Context / Schema manager • Avro support :
 - type providers from repository
 - bug fixes • Producers, Encoder and Decoder for Kafka.
  23. 23. Conclusion / Questions ?
  24. 24. Bonus :What is a CAS ? A Content Addressable Storage is a specific “key value store” : operations : • store(bytes) -> key • get(key) -> null | bytes rule 1 : key = h(data) h being a cryptographic hash function like md5 or sha1. rule 2 : ∀data get(store(data)) = data Rule 1 and 2 imply : Infinite cacheability 
 and scalability.

×