A primer on building real time data-driven products
www.mapflat.com
A primer on building real-time
data-driven products
Lars Albertsson, independent consultant
Øyvind Løkling, Schibsted Media Group
www.mapflat.com
Who’s talking?
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (very large machines)
● Google (Hangouts, productivity)
● Recorded Future (NLP startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat - independent data engineering consultant
www.mapflat.com
Why stream processing?
● Increasing number of
data-driven features
● 90+% fed by batch processing
○ Simpler, better tooling
○ 1+ hour data reaction time
● Stream processing for
○ 100ms - 1 hour reaction
○ Decoupled, asynchronous
microservices
User content
Professional
content
Ads / partners
User
behaviour
Systems
Ads
System
diagnostics
Recommendations
Data-based
features
Curated
content
Pushing
Business
intelligence
Experiments
Exploration
www.mapflat.com
The organic past
● Many paths
● Synchronous
● Link failure -> chain failure
● Heterogeneous
● Difficult to recover from
transformation bugs
Service Service Service
App App App
DB
Poll
Queue
Aggregate
logs
NFS
Hourly dump
Data
warehouse
ETL
Queue
NFS
scp
DB
HTTP
www.mapflat.com
● Publish data in streams
● Replicated, sharded
append-only log
● Pub / sub with history
○ Kafka, Google Pub/Sub,
AWS Kinesis
● Tap to data lake for batch
processing
Unified log
The unified log
Ads Search Feed
App App App
StreamStream Stream
Data lake
www.mapflat.com
● Decoupled
producers/consumers
○ In source/deployment
○ In space
○ In time
● Publish results to log
● Recovers from link failures
● Replay on job bug fix
Stream processing
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job
www.mapflat.com
● Applications need current state in DB.
● Should match stream.
● Which holds truth?
A. Dual write. Simple & fragile.
B. Change data capture
C. Event sourcing /
Command & Query Responsibility Segregation
Databases vs streams
Stream
Service
Stream
Service
Stream
Service
Job
A
B
C
www.mapflat.com
Stream processing building blocks
● Aggregate
○ Calculate time windows
○ Aggregate state (in memory / local database / shared database)
● Filter
○ Slim down stream
○ Privacy, security concerns
● Join
○ Enrich by joining with datasets, e.g. geo IP lookup, demographics
○ Join streams within time windows, e.g. click-through rate
● Transform
○ Bring data into same “shape”, schema
www.mapflat.com
Stream processing technologies
● Spark Streaming
○ Ideal if you are already using Spark, same model
○ Bridges gap between data science / data engineers, batch and stream
● Kafka Streams
○ Library - new, positions itself as a lightweight alternative
○ Tightly coupled to Kafka
● Others
○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda
www.mapflat.com
● Update database table, e.g. for
polling dashboard
● Create service index table n+1.
Notify service to switch.
● Post to external web service
● Push stream to client
Egress
Service
Stream Stream
Job Job
www.mapflat.com
Pushing streams to clients
● Reactive streams
○ Akka streams, …
● Standard for back pressure
○ Protects consumers
● Suited for streams to terminal Increasing
resources
Decreasing
resources
Job
Service
App
Stream
Stream
www.mapflat.com
Real life challenges
● Events are delayed
● Events arrive out of order
● Events are duplicated
● Software changes
● Software has bugs
● Machines fail
Tooling, patterns, components do not provide sufficient solutions.
You need to plan and be aware.
www.mapflat.com
Architectural patterns
Strategies for handling imperfection - bugs, late events, volumes
Stream Stream pipeline
Service
Batch pipeline
Lambda
Data lake
Kappa
Stream Stream job version n
Stream job version n+1
Data lake
Service
Stream
Switch
Dataflow
Stream Stream pipeline Service
Late events Updates
Delta
Stream Stream pipeline
Data lake
Service
Batch pipeline
Online
Nearline
Offline
www.mapflat.com
There is always a schema
● Schema on write
+ Stronger safety net - catch bugs earlier
- Requires upfront schema design before data can be received
- Synchronised deployment of whole pipeline
● Schema on read
+ Allows data to be captured as is
+ Easier to add/change fields
- More work to keep data consistent
www.mapflat.com
Schema representation
● Avro, JSON Schema
● Records should include schema
○ Bundled
○ Id + schema registry
● Evolution must be planned
○ E.g. Backward compatible changes allowed; Incompatible -> new topic.
○ Plan for replay of old messages
www.mapflat.com
Thank you. Questions?
Credits:
Øyvind Løkling, Schibsted Media Group
● Content inspiration
Confluent, LinkedIn, Google, Netflix, Apache Samza
● Images
Tracey Saxby, Integration and Application Network, University of Maryland
Center for Environmental Science (ian.umces.edu/imagelibrary/).