SlideShare a Scribd company logo
1 of 17
Download to read offline
www.mapflat.com
A primer on building real-time
data-driven products
Lars Albertsson, independent consultant
Øyvind Løkling, Schibsted Media Group
www.mapflat.com
Who’s talking?
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (very large machines)
● Google (Hangouts, productivity)
● Recorded Future (NLP startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat - independent data engineering consultant
www.mapflat.com
Why stream processing?
● Increasing number of
data-driven features
● 90+% fed by batch processing
○ Simpler, better tooling
○ 1+ hour data reaction time
● Stream processing for
○ 100ms - 1 hour reaction
○ Decoupled, asynchronous
microservices
User content
Professional
content
Ads / partners
User
behaviour
Systems
Ads
System
diagnostics
Recommendations
Data-based
features
Curated
content
Pushing
Business
intelligence
Experiments
Exploration
www.mapflat.com
The organic past
● Many paths
● Synchronous
● Link failure -> chain failure
● Heterogeneous
● Difficult to recover from
transformation bugs
Service Service Service
App App App
DB
Poll
Queue
Aggregate
logs
NFS
Hourly dump
Data
warehouse
ETL
Queue
NFS
scp
DB
HTTP
www.mapflat.com
● Publish data in streams
● Replicated, sharded
append-only log
● Pub / sub with history
○ Kafka, Google Pub/Sub,
AWS Kinesis
● Tap to data lake for batch
processing
Unified log
The unified log
Ads Search Feed
App App App
StreamStream Stream
Data lake
www.mapflat.com
● Decoupled
producers/consumers
○ In source/deployment
○ In space
○ In time
● Publish results to log
● Recovers from link failures
● Replay on job bug fix
Stream processing
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job
www.mapflat.com
● Applications need current state in DB.
● Should match stream.
● Which holds truth?
A. Dual write. Simple & fragile.
B. Change data capture
C. Event sourcing /
Command & Query Responsibility Segregation
Databases vs streams
Stream
Service
Stream
Service
Stream
Service
Job
A
B
C
www.mapflat.com
Stream processing building blocks
● Aggregate
○ Calculate time windows
○ Aggregate state (in memory / local database / shared database)
● Filter
○ Slim down stream
○ Privacy, security concerns
● Join
○ Enrich by joining with datasets, e.g. geo IP lookup, demographics
○ Join streams within time windows, e.g. click-through rate
● Transform
○ Bring data into same “shape”, schema
www.mapflat.com
Stream processing technologies
● Spark Streaming
○ Ideal if you are already using Spark, same model
○ Bridges gap between data science / data engineers, batch and stream
● Kafka Streams
○ Library - new, positions itself as a lightweight alternative
○ Tightly coupled to Kafka
● Others
○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda
www.mapflat.com
● Update database table, e.g. for
polling dashboard
● Create service index table n+1.
Notify service to switch.
● Post to external web service
● Push stream to client
Egress
Service
Stream Stream
Job Job
www.mapflat.com
Pushing streams to clients
● Reactive streams
○ Akka streams, …
● Standard for back pressure
○ Protects consumers
● Suited for streams to terminal Increasing
resources
Decreasing
resources
Job
Service
App
Stream
Stream
www.mapflat.com
Real life challenges
● Events are delayed
● Events arrive out of order
● Events are duplicated
● Software changes
● Software has bugs
● Machines fail
Tooling, patterns, components do not provide sufficient solutions.
You need to plan and be aware.
www.mapflat.com
Architectural patterns
Strategies for handling imperfection - bugs, late events, volumes
Stream Stream pipeline
Service
Batch pipeline
Lambda
Data lake
Kappa
Stream Stream job version n
Stream job version n+1
Data lake
Service
Stream
Switch
Dataflow
Stream Stream pipeline Service
Late events Updates
Delta
Stream Stream pipeline
Data lake
Service
Batch pipeline
Online
Nearline
Offline
www.mapflat.com
There is always a schema
● Schema on write
+ Stronger safety net - catch bugs earlier
- Requires upfront schema design before data can be received
- Synchronised deployment of whole pipeline
● Schema on read
+ Allows data to be captured as is
+ Easier to add/change fields
- More work to keep data consistent
www.mapflat.com
Schema representation
● Avro, JSON Schema
● Records should include schema
○ Bundled
○ Id + schema registry
● Evolution must be planned
○ E.g. Backward compatible changes allowed; Incompatible -> new topic.
○ Plan for replay of old messages
www.mapflat.com
Secret tip: miscount!
● Sparse data structures
○ Trade little accuracy, 1-3%
○ Gain space, factor 100 - 10000
● Basic approximate building blocks
○ Approximate filter (Bloom filter), unique item counters (HyperLogLog), top lists (TopK),
per-item counter (Count-Min Sketch), percentiles (T-digest), nearest neighbours
○ Sparse memories
● Sufficient for machine learning
○ Collaborative filtering - recommendations
○ Clustering
○ Outlier detection
○ Similarity search
www.mapflat.com
Thank you. Questions?
Credits:
Øyvind Løkling, Schibsted Media Group
● Content inspiration
Confluent, LinkedIn, Google, Netflix, Apache Samza
● Images
Tracey Saxby, Integration and Application Network, University of Maryland
Center for Environmental Science (ian.umces.edu/imagelibrary/).

More Related Content

What's hot

10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Taro L. Saito
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 
Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Tin Ho
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtimearupmalakar
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFlink Forward
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 

What's hot (20)

10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 

Viewers also liked

JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream ProcessingLuis Gonzalez
 
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingTapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingSingleStore
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Timothy Spann
 
Data Pipelines Made Simple with Apache Kafka
Data Pipelines Made Simple with Apache KafkaData Pipelines Made Simple with Apache Kafka
Data Pipelines Made Simple with Apache Kafkaconfluent
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...confluent
 
Pantallazos ovas
Pantallazos ovasPantallazos ovas
Pantallazos ovasdalumaca77
 
Античність
 Античність Античність
Античністьenka2017
 
Slideshare la mediacion
Slideshare la mediacionSlideshare la mediacion
Slideshare la mediacionTania martin
 
Transformaciones lineales y método de gauss.
Transformaciones lineales y método de gauss.Transformaciones lineales y método de gauss.
Transformaciones lineales y método de gauss.CharlesJMorris
 
Slideshare la mediacion
Slideshare la mediacionSlideshare la mediacion
Slideshare la mediacionTania martin
 
Manual de prevencion_de_riesgos_mop
Manual de prevencion_de_riesgos_mopManual de prevencion_de_riesgos_mop
Manual de prevencion_de_riesgos_mopMarion Alvarado
 
Nuevas formas de liderazgo en eguipos
Nuevas formas de liderazgo en eguiposNuevas formas de liderazgo en eguipos
Nuevas formas de liderazgo en eguiposarlen mendieta
 

Viewers also liked (20)

JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
 
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingTapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms
 
Data Pipelines Made Simple with Apache Kafka
Data Pipelines Made Simple with Apache KafkaData Pipelines Made Simple with Apache Kafka
Data Pipelines Made Simple with Apache Kafka
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
 
Pantallazos ovas
Pantallazos ovasPantallazos ovas
Pantallazos ovas
 
Античність
 Античність Античність
Античність
 
Slideshare la mediacion
Slideshare la mediacionSlideshare la mediacion
Slideshare la mediacion
 
Leccion.antitusigenos
Leccion.antitusigenosLeccion.antitusigenos
Leccion.antitusigenos
 
Transformaciones lineales y método de gauss.
Transformaciones lineales y método de gauss.Transformaciones lineales y método de gauss.
Transformaciones lineales y método de gauss.
 
Slideshare la mediacion
Slideshare la mediacionSlideshare la mediacion
Slideshare la mediacion
 
25 Μαρτίου 1821
25 Μαρτίου 182125 Μαρτίου 1821
25 Μαρτίου 1821
 
Entrevista2
Entrevista2Entrevista2
Entrevista2
 
Entrevista narcis
Entrevista narcisEntrevista narcis
Entrevista narcis
 
Orden HFP/255/2017, de 21 de marzo,
Orden HFP/255/2017, de 21 de marzo,Orden HFP/255/2017, de 21 de marzo,
Orden HFP/255/2017, de 21 de marzo,
 
Manual de prevencion_de_riesgos_mop
Manual de prevencion_de_riesgos_mopManual de prevencion_de_riesgos_mop
Manual de prevencion_de_riesgos_mop
 
Nuevas formas de liderazgo en eguipos
Nuevas formas de liderazgo en eguiposNuevas formas de liderazgo en eguipos
Nuevas formas de liderazgo en eguipos
 
Key to Social Media Success
Key to Social Media SuccessKey to Social Media Success
Key to Social Media Success
 

Similar to Build Real-Time Data Products with Stream Processing

Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Adrianos Dadis
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processJampp
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsPetr Novotný
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Sistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeSistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeGlobant
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 

Similar to Build Real-Time Data Products with Stream Processing (20)

Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the process
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Sistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeSistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta Lake
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 

More from Lars Albertsson

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platformLars Albertsson
 

More from Lars Albertsson (20)

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 

Recently uploaded

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 

Recently uploaded (20)

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 

Build Real-Time Data Products with Stream Processing

  • 1. www.mapflat.com A primer on building real-time data-driven products Lars Albertsson, independent consultant Øyvind Løkling, Schibsted Media Group
  • 2. www.mapflat.com Who’s talking? ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (very large machines) ● Google (Hangouts, productivity) ● Recorded Future (NLP startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat - independent data engineering consultant
  • 3. www.mapflat.com Why stream processing? ● Increasing number of data-driven features ● 90+% fed by batch processing ○ Simpler, better tooling ○ 1+ hour data reaction time ● Stream processing for ○ 100ms - 1 hour reaction ○ Decoupled, asynchronous microservices User content Professional content Ads / partners User behaviour Systems Ads System diagnostics Recommendations Data-based features Curated content Pushing Business intelligence Experiments Exploration
  • 4. www.mapflat.com The organic past ● Many paths ● Synchronous ● Link failure -> chain failure ● Heterogeneous ● Difficult to recover from transformation bugs Service Service Service App App App DB Poll Queue Aggregate logs NFS Hourly dump Data warehouse ETL Queue NFS scp DB HTTP
  • 5. www.mapflat.com ● Publish data in streams ● Replicated, sharded append-only log ● Pub / sub with history ○ Kafka, Google Pub/Sub, AWS Kinesis ● Tap to data lake for batch processing Unified log The unified log Ads Search Feed App App App StreamStream Stream Data lake
  • 6. www.mapflat.com ● Decoupled producers/consumers ○ In source/deployment ○ In space ○ In time ● Publish results to log ● Recovers from link failures ● Replay on job bug fix Stream processing Job Ads Search Feed App App App StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Data lake Business intelligence Job
  • 7. www.mapflat.com ● Applications need current state in DB. ● Should match stream. ● Which holds truth? A. Dual write. Simple & fragile. B. Change data capture C. Event sourcing / Command & Query Responsibility Segregation Databases vs streams Stream Service Stream Service Stream Service Job A B C
  • 8. www.mapflat.com Stream processing building blocks ● Aggregate ○ Calculate time windows ○ Aggregate state (in memory / local database / shared database) ● Filter ○ Slim down stream ○ Privacy, security concerns ● Join ○ Enrich by joining with datasets, e.g. geo IP lookup, demographics ○ Join streams within time windows, e.g. click-through rate ● Transform ○ Bring data into same “shape”, schema
  • 9. www.mapflat.com Stream processing technologies ● Spark Streaming ○ Ideal if you are already using Spark, same model ○ Bridges gap between data science / data engineers, batch and stream ● Kafka Streams ○ Library - new, positions itself as a lightweight alternative ○ Tightly coupled to Kafka ● Others ○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda
  • 10. www.mapflat.com ● Update database table, e.g. for polling dashboard ● Create service index table n+1. Notify service to switch. ● Post to external web service ● Push stream to client Egress Service Stream Stream Job Job
  • 11. www.mapflat.com Pushing streams to clients ● Reactive streams ○ Akka streams, … ● Standard for back pressure ○ Protects consumers ● Suited for streams to terminal Increasing resources Decreasing resources Job Service App Stream Stream
  • 12. www.mapflat.com Real life challenges ● Events are delayed ● Events arrive out of order ● Events are duplicated ● Software changes ● Software has bugs ● Machines fail Tooling, patterns, components do not provide sufficient solutions. You need to plan and be aware.
  • 13. www.mapflat.com Architectural patterns Strategies for handling imperfection - bugs, late events, volumes Stream Stream pipeline Service Batch pipeline Lambda Data lake Kappa Stream Stream job version n Stream job version n+1 Data lake Service Stream Switch Dataflow Stream Stream pipeline Service Late events Updates Delta Stream Stream pipeline Data lake Service Batch pipeline Online Nearline Offline
  • 14. www.mapflat.com There is always a schema ● Schema on write + Stronger safety net - catch bugs earlier - Requires upfront schema design before data can be received - Synchronised deployment of whole pipeline ● Schema on read + Allows data to be captured as is + Easier to add/change fields - More work to keep data consistent
  • 15. www.mapflat.com Schema representation ● Avro, JSON Schema ● Records should include schema ○ Bundled ○ Id + schema registry ● Evolution must be planned ○ E.g. Backward compatible changes allowed; Incompatible -> new topic. ○ Plan for replay of old messages
  • 16. www.mapflat.com Secret tip: miscount! ● Sparse data structures ○ Trade little accuracy, 1-3% ○ Gain space, factor 100 - 10000 ● Basic approximate building blocks ○ Approximate filter (Bloom filter), unique item counters (HyperLogLog), top lists (TopK), per-item counter (Count-Min Sketch), percentiles (T-digest), nearest neighbours ○ Sparse memories ● Sufficient for machine learning ○ Collaborative filtering - recommendations ○ Clustering ○ Outlier detection ○ Similarity search
  • 17. www.mapflat.com Thank you. Questions? Credits: Øyvind Løkling, Schibsted Media Group ● Content inspiration Confluent, LinkedIn, Google, Netflix, Apache Samza ● Images Tracey Saxby, Integration and Application Network, University of Maryland Center for Environmental Science (ian.umces.edu/imagelibrary/).