SlideShare a Scribd company logo
Type-safe, Versioned, and Rewindable
Stream Processing
with
Apache {Avro, Kafka} and Scala
-=[ confoo.ca ]=-
Thursday February 19th 2015
Hisham Mardam-Bey
Mate1 Inc.
Overview
● Who is this guy? + quick Mate1 intro
● Before message queues
● How we use message queues?
● Some examples
Who is this guy?
● Linux user and developer since 1996
● Started out hacking on Enlightenment
○ X11 window manager
● Worked with OpenBSD
○ building embedded network gear
● Did a whole bunch of C followed by Ruby
● Working with the JVM since 2007
● Lately enjoying Erlang and Haskell; FP-
FTW! (=
github: mardambey
twitter: codewarrior
Mate1: quick intro
● Online dating, since 2003, based in Montreal
● Initially team of 3, around 40 now
● Engineering team has 13 geeks / geekettes
● We own and run our own hardware
○ fun!
○ mostly…
https://github.com/mate1
Some of our features...
● Lots of communication, chatting, push notifs
● Searching, matching, recommendations,
geo-location features
● Lists of... friends, blocks, people interested,
more
● News & activity feeds, counters, rating
Before message queues
● Events via DAOs into MySQL
○ More data, more events lead to more latency
○ Or build an async layer around DAOs
■ Surely better solutions exist!
● Logs rsync’ed into file servers and Hadoop
○ Once every 24 hours
● MySQL Data partitioned functionally
○ Application layer sharding
● Custom MySQL replication for BI servers
○ Built fan-in replication for MySQL
● Data processed through Java, Jython, SQL
Message queues
● Apache Kafka: fast, durable, distributed
● Stored data as JSON, in plain text
● Mapped JSON to Scala classes manually
● Used Kafka + Cassandra a lot
○ low latency reactive system (push, not pull)
○ used them to build:
■ near real time data / events feeds
■ live counters
■ lots of lists
● This was awesome; but we had some issues
and wanted some improvements.
Issues / improvements
● Did not want to keep manually marshalling
data; potential mistakes -> type safety
● Code gets complicated when maintaining
backward compatibility -> versioning
● Losing events is costly if a bug creeps out
into production -> rewindable
● Wanted to save time and reuse certain logic
and parts of the system -> reusable patterns
○ more of an improvement than an issue
Type-safe
● Avoid stringified types, maps (no structure)
● Used Apache Avro for serialization:
○ Avro provides JSON / binary ser/de
○ Avro provides structuring and type safety
● Mapped Avro to Java/Scala classes
● Effectively tied:
○ Kafka topic <-> Avro schema <-> POJO
● Producers / consumers now type-safe and
compile time checked
Versioning, why?
● All was fine… until we had to alter schemas!
● Distributed producers means:
○ multiple versions of the data being generated
● Distributed consumers means:
○ multiple versions of the data being processed
● Rolling upgrades are the only way in prod
● Came up with a simple data format
Simple (extensible) data format
● magic: byte identifying data format / version
● schemaId: version of the schema to use
● data: plain text / binary bytes
○ ex: JSON encoded data
● assumption: schema name = Kafka topic
---------------------
| magic | 1 byte |
| schemaId | 2 bytes |
| data | N bytes |
---------------------
Schema loading
● Load schemas based on:
○ Kafka topic name (ex: WEB_LOGS, MSG_SENT, ...)
○ Schema ID / version (ex: 0, 1, 2)
● How do we store / fetch schemas?
○ local file system
○ across the network (database? some repository?)
● Decided to integrate AVRO-1124
○ a few patches in a Jira ticket
○ not part of mainstream Avro
Avro Schema Repository & Resolution
● What is an Avro schema repository?
○ HTTP based repo, originally filesystem backed
● AVRO-1124: integrated (and now improved)
○ Back on Github (Avro + AVRO-1124)
■ https://github.com/mate1/avro
○ Also a WIP fork into a standalone project
■ https://github.com/schema-repo/schema-repo
● Avro has schema resolution / evolution
○ provides rules guaranteeing version compatibility
○ allows for data to be decoded using multiple
schemas (old and new)
Rolling upgrades, how?
● Make new schema available in repository
● Rolling producer upgrades
○ produce old and new version of data
● Rolling consumer upgrades
○ consumers consume old and new version of data
● Eventually...
○ producers produce new version (now current)
○ consumers consume new version (now current)
Rewindable
● Why?
○ Re-process data due to downstream data loss
○ Buggy code causes faulty data / statistics
○ Rebuild downstream state after system crash or
restart
● How?
○ We take advantage of Kafka design
○ Let’s take a closer look at that...
Kafka Consumers and Offsets
● Kafka consumers manage their offsets
○ Offsets not managed by the broker
○ Data is not deleted upon consumption
○ Offsets stored in Zookeeper, usually (<= 0.8.1.1)
■ This changed with Kafka 0.8.2.0! Finally!
● Kafka data retention policies
○ time / size based retention
○ key based compaction
■ infinite retention!
● Need to map offsets to points in time
○ Allows for resetting offsets to a point in time
Currently, manual rewinding
● 2 types of Kafka consumers:
○ ZK based, one event at a time
○ MySQL based, batch processing
■ Kafka + MySQL offset store + ZFS =
transactional rollbacks
■ Used to transactionally get data into MySQL
● Working on tools to automate the process
○ Specifically to take advantage of 0.8.2.0’s offset
management API
Reusable
● Abstracted out some patterns, like:
○ Enrichment
○ Filtering
○ Splitting / Routing
○ Merging
● Let’s see how we use them...
Reusable
System
Events
Hadoop
Device
Detection
MySQL
App
Events
WebLog
Consumer
Enricher
PushNotif
Consumer
Router
XMPP
Consumer
APN
Consumer
GCM
Consumer
GeoIP
Service
EmailNotif
Consumer
Filter
X msgs / hr / user
MTA
Service
Internet
Batch
Consumers
NearRealTime
Consumers Dashboards
Kafka XMPP
Kafka Apple
Kafka Google
Kafka
Kafka
Enriched
WebLog
InboxCache
Consumer
Redis
Web
Servers
Cache
Fin!
That’s all folks (=
Thanks!
Questions?
Reusable
● Emerging patterns
● Enrichment
abstract class Enricher
[Input <: SpecificRecord, Output <: SpecificRecord] {
def enrich(input: Input): Output
}
● Filtering
abstract class Filter
[Input <: SpecificRecord] {
def filter(input: Input): Option[Input]
}
Reusable
● More patterns
● Splitting
abstract class Splitter2
[Input <: SpecificRecord, Output <: SpecificRecord] {
def split(input: Input): Tuple2[Output, Output]
}
● Merging
abstract class Merger2
[Input1 <: SpecificRecord,
Input2 <: SpecificRecord,
Output <: SpecificRecord] {
def merge(input1: SpecificRecord, input2: SpecificRecord):Output
}
Reusable
● Usage examples:
○ Enrich web logs
■ GeoIP
■ User-Agent, mobile device details
○ Push notifications message router / scheduler
■ OS specific notifications
■ A/B tests
○ News feed type systems
○ Cache maintenance
■ Users’ inbox, friend lists
■ Consumable data by time interval (Redis)
Data Pipeline Diagram (partial)
App
servers
Web
servers
Other
services
Event
Manager
Event
Manager
Event
Manager
Kafka
Kafka
Kafka
ZK
ZK
Consumers
Consumers
C*
C*
C*
Play
Play
SOLR
Redis
EjabberdEjabberdEjabberd
APN
NRT search
Geo-location
TTL flags
transient data

More Related Content

What's hot

Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Taiwan User Group
 
JRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform FurtherJRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform Further
Charles Nutter
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Itamar Haber
 
Serialization (Avro, Message Pack, Kryo)
Serialization (Avro, Message Pack, Kryo)Serialization (Avro, Message Pack, Kryo)
Serialization (Avro, Message Pack, Kryo)
오석 한
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon
 
Redis Modules API - an introduction
Redis Modules API - an introductionRedis Modules API - an introduction
Redis Modules API - an introduction
Itamar Haber
 
Bringing Concurrency to Ruby - RubyConf India 2014
Bringing Concurrency to Ruby - RubyConf India 2014Bringing Concurrency to Ruby - RubyConf India 2014
Bringing Concurrency to Ruby - RubyConf India 2014
Charles Nutter
 
Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
Alex Tumanoff
 
Experiences with Evangelizing Java Within the Database
Experiences with Evangelizing Java Within the DatabaseExperiences with Evangelizing Java Within the Database
Experiences with Evangelizing Java Within the Database
Marcelo Ochoa
 
Spry 2017
Spry 2017Spry 2017
Spry 2017
Göran Krampe
 
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Codemotion
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
André Mayer
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Speedment, Inc.
 
Mongo db3.0 wired_tiger_storage_engine
Mongo db3.0 wired_tiger_storage_engineMongo db3.0 wired_tiger_storage_engine
Mongo db3.0 wired_tiger_storage_engine
Kenny Gorman
 
Dynamic pricing of Lyft rides using streaming
Dynamic pricing of Lyft rides using streamingDynamic pricing of Lyft rides using streaming
Dynamic pricing of Lyft rides using streaming
Amar Pai
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
Anant Corporation
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
ehuard
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
clairvoyantllc
 
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax AstraApache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Anant Corporation
 
Debugging Your Production JVM
Debugging Your Production JVMDebugging Your Production JVM
Debugging Your Production JVM
kensipe
 

What's hot (20)

Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
 
JRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform FurtherJRuby: Pushing the Java Platform Further
JRuby: Pushing the Java Platform Further
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Serialization (Avro, Message Pack, Kryo)
Serialization (Avro, Message Pack, Kryo)Serialization (Avro, Message Pack, Kryo)
Serialization (Avro, Message Pack, Kryo)
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
 
Redis Modules API - an introduction
Redis Modules API - an introductionRedis Modules API - an introduction
Redis Modules API - an introduction
 
Bringing Concurrency to Ruby - RubyConf India 2014
Bringing Concurrency to Ruby - RubyConf India 2014Bringing Concurrency to Ruby - RubyConf India 2014
Bringing Concurrency to Ruby - RubyConf India 2014
 
Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
 
Experiences with Evangelizing Java Within the Database
Experiences with Evangelizing Java Within the DatabaseExperiences with Evangelizing Java Within the Database
Experiences with Evangelizing Java Within the Database
 
Spry 2017
Spry 2017Spry 2017
Spry 2017
 
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
 
Mongo db3.0 wired_tiger_storage_engine
Mongo db3.0 wired_tiger_storage_engineMongo db3.0 wired_tiger_storage_engine
Mongo db3.0 wired_tiger_storage_engine
 
Dynamic pricing of Lyft rides using streaming
Dynamic pricing of Lyft rides using streamingDynamic pricing of Lyft rides using streaming
Dynamic pricing of Lyft rides using streaming
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax AstraApache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
 
Debugging Your Production JVM
Debugging Your Production JVMDebugging Your Production JVM
Debugging Your Production JVM
 

Viewers also liked

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
Toby Matejovsky
 
Avro Data | Washington DC HUG
Avro Data | Washington DC HUGAvro Data | Washington DC HUG
Avro Data | Washington DC HUG
Cloudera, Inc.
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
Krishna Gade
 
Building Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache Kafka
Brian Ritchie
 
Avro intro
Avro introAvro intro
Avro intro
Randy Abernethy
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
DataWorks Summit/Hadoop Summit
 
Apache Avro and You
Apache Avro and YouApache Avro and You
Apache Avro and You
Eric Wendelin
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
airisData
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streams
jimriecken
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
Jeffrey Breen
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
GetInData
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
confluent
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
Michelle Darling
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
Stefan Norberg
 
Apache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePersonApache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePerson
LivePerson
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
LivePerson
 

Viewers also liked (17)

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Avro Data | Washington DC HUG
Avro Data | Washington DC HUGAvro Data | Washington DC HUG
Avro Data | Washington DC HUG
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Building Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache KafkaBuilding Event-Driven Systems with Apache Kafka
Building Event-Driven Systems with Apache Kafka
 
Avro intro
Avro introAvro intro
Avro intro
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
 
Apache Avro and You
Apache Avro and YouApache Avro and You
Apache Avro and You
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streams
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 
Apache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePersonApache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePerson
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
 

Similar to Type safe, versioned, and rewindable stream processing with Apache {Avro, Kafka} and Scala

14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
Travis Oliphant
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
José Román Martín Gil
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
Dmytro Semenov
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
NetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmapNetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmap
Ruslan Meshenberg
 
Openstack overview thomas-goirand
Openstack overview thomas-goirandOpenstack overview thomas-goirand
Openstack overview thomas-goirand
OpenCity Community
 
Change data capture
Change data captureChange data capture
Change data capture
Ron Barabash
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
ScyllaDB
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
Ceph Community
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Evolving ALLSTOCKER: Agile increments with Pharo Smalltalk
Evolving ALLSTOCKER: Agile increments with Pharo SmalltalkEvolving ALLSTOCKER: Agile increments with Pharo Smalltalk
Evolving ALLSTOCKER: Agile increments with Pharo Smalltalk
ESUG
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 

Similar to Type safe, versioned, and rewindable stream processing with Apache {Avro, Kafka} and Scala (20)

14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
NetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmapNetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmap
 
Openstack overview thomas-goirand
Openstack overview thomas-goirandOpenstack overview thomas-goirand
Openstack overview thomas-goirand
 
Change data capture
Change data captureChange data capture
Change data capture
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Evolving ALLSTOCKER: Agile increments with Pharo Smalltalk
Evolving ALLSTOCKER: Agile increments with Pharo SmalltalkEvolving ALLSTOCKER: Agile increments with Pharo Smalltalk
Evolving ALLSTOCKER: Agile increments with Pharo Smalltalk
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 

Recently uploaded

Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
Alina Yurenko
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
servicesNitor
 
Refactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contextsRefactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contexts
Michał Kurzeja
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
sandeepmenon62
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
campbellclarkson
 
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Vince Scalabrino
 
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA ComplianceSecure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
ICS
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
kalichargn70th171
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
confluent
 
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfSoftware Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
kalichargn70th171
 
Trailhead Talks_ Journey of an All-Star Ranger .pptx
Trailhead Talks_ Journey of an All-Star Ranger .pptxTrailhead Talks_ Journey of an All-Star Ranger .pptx
Trailhead Talks_ Journey of an All-Star Ranger .pptx
ImtiazBinMohiuddin
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
mohitd6
 
What’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 UpdateWhat’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 Update
VictoriaMetrics
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Ortus Solutions, Corp
 
Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)
wonyong hwang
 
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsEnsuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
OnePlan Solutions
 

Recently uploaded (20)

Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
 
Refactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contextsRefactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contexts
 
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptxOperational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
Operational ease MuleSoft and Salesforce Service Cloud Solution v1.0.pptx
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
 
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
 
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA ComplianceSecure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
 
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
 
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfSoftware Test Automation - A Comprehensive Guide on Automated Testing.pdf
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
 
Trailhead Talks_ Journey of an All-Star Ranger .pptx
Trailhead Talks_ Journey of an All-Star Ranger .pptxTrailhead Talks_ Journey of an All-Star Ranger .pptx
Trailhead Talks_ Journey of an All-Star Ranger .pptx
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
 
What’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 UpdateWhat’s new in VictoriaMetrics - Q2 2024 Update
What’s new in VictoriaMetrics - Q2 2024 Update
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
 
Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)Hyperledger Besu 빨리 따라하기 (Private Networks)
Hyperledger Besu 빨리 따라하기 (Private Networks)
 
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsEnsuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
 

Type safe, versioned, and rewindable stream processing with Apache {Avro, Kafka} and Scala

  • 1. Type-safe, Versioned, and Rewindable Stream Processing with Apache {Avro, Kafka} and Scala -=[ confoo.ca ]=- Thursday February 19th 2015 Hisham Mardam-Bey Mate1 Inc.
  • 2. Overview ● Who is this guy? + quick Mate1 intro ● Before message queues ● How we use message queues? ● Some examples
  • 3. Who is this guy? ● Linux user and developer since 1996 ● Started out hacking on Enlightenment ○ X11 window manager ● Worked with OpenBSD ○ building embedded network gear ● Did a whole bunch of C followed by Ruby ● Working with the JVM since 2007 ● Lately enjoying Erlang and Haskell; FP- FTW! (= github: mardambey twitter: codewarrior
  • 4. Mate1: quick intro ● Online dating, since 2003, based in Montreal ● Initially team of 3, around 40 now ● Engineering team has 13 geeks / geekettes ● We own and run our own hardware ○ fun! ○ mostly… https://github.com/mate1
  • 5. Some of our features... ● Lots of communication, chatting, push notifs ● Searching, matching, recommendations, geo-location features ● Lists of... friends, blocks, people interested, more ● News & activity feeds, counters, rating
  • 6. Before message queues ● Events via DAOs into MySQL ○ More data, more events lead to more latency ○ Or build an async layer around DAOs ■ Surely better solutions exist! ● Logs rsync’ed into file servers and Hadoop ○ Once every 24 hours ● MySQL Data partitioned functionally ○ Application layer sharding ● Custom MySQL replication for BI servers ○ Built fan-in replication for MySQL ● Data processed through Java, Jython, SQL
  • 7. Message queues ● Apache Kafka: fast, durable, distributed ● Stored data as JSON, in plain text ● Mapped JSON to Scala classes manually ● Used Kafka + Cassandra a lot ○ low latency reactive system (push, not pull) ○ used them to build: ■ near real time data / events feeds ■ live counters ■ lots of lists ● This was awesome; but we had some issues and wanted some improvements.
  • 8. Issues / improvements ● Did not want to keep manually marshalling data; potential mistakes -> type safety ● Code gets complicated when maintaining backward compatibility -> versioning ● Losing events is costly if a bug creeps out into production -> rewindable ● Wanted to save time and reuse certain logic and parts of the system -> reusable patterns ○ more of an improvement than an issue
  • 9. Type-safe ● Avoid stringified types, maps (no structure) ● Used Apache Avro for serialization: ○ Avro provides JSON / binary ser/de ○ Avro provides structuring and type safety ● Mapped Avro to Java/Scala classes ● Effectively tied: ○ Kafka topic <-> Avro schema <-> POJO ● Producers / consumers now type-safe and compile time checked
  • 10. Versioning, why? ● All was fine… until we had to alter schemas! ● Distributed producers means: ○ multiple versions of the data being generated ● Distributed consumers means: ○ multiple versions of the data being processed ● Rolling upgrades are the only way in prod ● Came up with a simple data format
  • 11. Simple (extensible) data format ● magic: byte identifying data format / version ● schemaId: version of the schema to use ● data: plain text / binary bytes ○ ex: JSON encoded data ● assumption: schema name = Kafka topic --------------------- | magic | 1 byte | | schemaId | 2 bytes | | data | N bytes | ---------------------
  • 12. Schema loading ● Load schemas based on: ○ Kafka topic name (ex: WEB_LOGS, MSG_SENT, ...) ○ Schema ID / version (ex: 0, 1, 2) ● How do we store / fetch schemas? ○ local file system ○ across the network (database? some repository?) ● Decided to integrate AVRO-1124 ○ a few patches in a Jira ticket ○ not part of mainstream Avro
  • 13. Avro Schema Repository & Resolution ● What is an Avro schema repository? ○ HTTP based repo, originally filesystem backed ● AVRO-1124: integrated (and now improved) ○ Back on Github (Avro + AVRO-1124) ■ https://github.com/mate1/avro ○ Also a WIP fork into a standalone project ■ https://github.com/schema-repo/schema-repo ● Avro has schema resolution / evolution ○ provides rules guaranteeing version compatibility ○ allows for data to be decoded using multiple schemas (old and new)
  • 14. Rolling upgrades, how? ● Make new schema available in repository ● Rolling producer upgrades ○ produce old and new version of data ● Rolling consumer upgrades ○ consumers consume old and new version of data ● Eventually... ○ producers produce new version (now current) ○ consumers consume new version (now current)
  • 15. Rewindable ● Why? ○ Re-process data due to downstream data loss ○ Buggy code causes faulty data / statistics ○ Rebuild downstream state after system crash or restart ● How? ○ We take advantage of Kafka design ○ Let’s take a closer look at that...
  • 16. Kafka Consumers and Offsets ● Kafka consumers manage their offsets ○ Offsets not managed by the broker ○ Data is not deleted upon consumption ○ Offsets stored in Zookeeper, usually (<= 0.8.1.1) ■ This changed with Kafka 0.8.2.0! Finally! ● Kafka data retention policies ○ time / size based retention ○ key based compaction ■ infinite retention! ● Need to map offsets to points in time ○ Allows for resetting offsets to a point in time
  • 17. Currently, manual rewinding ● 2 types of Kafka consumers: ○ ZK based, one event at a time ○ MySQL based, batch processing ■ Kafka + MySQL offset store + ZFS = transactional rollbacks ■ Used to transactionally get data into MySQL ● Working on tools to automate the process ○ Specifically to take advantage of 0.8.2.0’s offset management API
  • 18. Reusable ● Abstracted out some patterns, like: ○ Enrichment ○ Filtering ○ Splitting / Routing ○ Merging ● Let’s see how we use them...
  • 19. Reusable System Events Hadoop Device Detection MySQL App Events WebLog Consumer Enricher PushNotif Consumer Router XMPP Consumer APN Consumer GCM Consumer GeoIP Service EmailNotif Consumer Filter X msgs / hr / user MTA Service Internet Batch Consumers NearRealTime Consumers Dashboards Kafka XMPP Kafka Apple Kafka Google Kafka Kafka Enriched WebLog InboxCache Consumer Redis Web Servers Cache
  • 20. Fin! That’s all folks (= Thanks! Questions?
  • 21. Reusable ● Emerging patterns ● Enrichment abstract class Enricher [Input <: SpecificRecord, Output <: SpecificRecord] { def enrich(input: Input): Output } ● Filtering abstract class Filter [Input <: SpecificRecord] { def filter(input: Input): Option[Input] }
  • 22. Reusable ● More patterns ● Splitting abstract class Splitter2 [Input <: SpecificRecord, Output <: SpecificRecord] { def split(input: Input): Tuple2[Output, Output] } ● Merging abstract class Merger2 [Input1 <: SpecificRecord, Input2 <: SpecificRecord, Output <: SpecificRecord] { def merge(input1: SpecificRecord, input2: SpecificRecord):Output }
  • 23. Reusable ● Usage examples: ○ Enrich web logs ■ GeoIP ■ User-Agent, mobile device details ○ Push notifications message router / scheduler ■ OS specific notifications ■ A/B tests ○ News feed type systems ○ Cache maintenance ■ Users’ inbox, friend lists ■ Consumable data by time interval (Redis)
  • 24. Data Pipeline Diagram (partial) App servers Web servers Other services Event Manager Event Manager Event Manager Kafka Kafka Kafka ZK ZK Consumers Consumers C* C* C* Play Play SOLR Redis EjabberdEjabberdEjabberd APN NRT search Geo-location TTL flags transient data