SlideShare a Scribd company logo
Moving Towards a
Streaming Architecture
Garry Turkington (CTO), Gabriele Modena (Data Scientist)
Improve Digital
Real Time Adtech
SSP
http://improvedigital.com/
@ImproveDigital
Data sources
 A geographically distributed fleet of ad servers push out several types of data,
primarily:
- 1 minute metrics packets
- 15 minute full log files
 Previously used ad hoc system to process the former, Hadoop for the latter
Conceptualizing the data
 We think of event streams when considering the operational system
 But have this sliced into many log files for practical reasons
 Recreating the stream-based view gets us back to what is happening
 And brings data creation and processing closer together
How did we get here
 Need to process more data, faster
- Reporting
- Decision Making
 Evolution of the system to meet scalability requirements
 Stream processing by definition
 It is changing the way we think about data collection and distribution
 Not everything is obvious
 It opened new doors
Let’s take a step back
 Aggregates grow in number and size
 Keep a copy in HDFS
 Run job, then go and make coffee
 Generate new / other aggregates / ad hoc datasets
 Analytics datawarehouse (column store), denormalized data model
 BI, reporting
 Hadoop for logs and event processing (eg. ETL, simulation, learning, optimization)
Hadoop workflows
Incremental processing
 Ingest data every eg. 24 hours (/data/raw/event/d=YYYY-MM-DD)
 Scheduling and partitioning are straightforward
 Schema migrations: Avro + HCatalog
 Fixed length time windows
- What is a good length?
- Ok if it takes 1 hour to process 15 minutes of data
 Need to reprocess data
 It can be made efficient (eg. DataFu Hourglass)
 MapReduce
A more interactive system
We added tools to circumvent MapReduce
- Latency, really
- Still a “batch” mindset
Cloudera Impala (+ Parquet)
- Fast, familiar
Move computation closer to data
- Spark (among other things) for prototyping
- Python/R integration, batch and “streaming”
Outcome
 Multiple overlapping data sources, processing tools
- point-to-point
 Different pipelines for analytics and production models
- Divide
- Where is the “correct” data ?
All in all
 Data collection and distribution is critical
 We are still looking at “yesterday’s” data
 We need a different level of abstraction
Challenges of near-realtime data
 Instructive to ask why near-realtime data analysis is hard
 In Hadoop batch world we’d be happy processing 24hrs of data in 4 hrs
 So why do we flinch to process 24hrs in 24hrs?
 What allows batch workloads to process in sub-linear time:
- Scheduling
- Parallelization
- Partitioning
The streaming view
 Consider these from the streaming perspective:
- Scheduling is challenging because of unknown and variable data rates
- Parallelization breaks our nice logical model of a single data stream
- Partitioning can’t be simple file size/no of blocks
- The method of partitioning the stream is fundamental
Kafka
 We use Kafka (http://kafka.apache.org) for all data we view as streams
 It is resilient, performant and provides a very clear topic-based interface
 Producers write to topics, consumers read from topics
 Topics can be partitioned based on a producer-provided key
 Provides at-least once message delivery semantics
 Messages are persistently stored – topics can be re-read
Samza
 We use Samza (http://samza.incubator.apache.org) for stream processing
 Analogous to MapReduce atop YARN atop HDFS
 We have Samza atop YARN atop Kafka
E L E M O D E N A L E A R N IN G H A D O O P 2
A p a c h e S a m z a
afka for streaming !
arn for resource
anagement and exec!
amzaAPI for
ocessing!
weet spot: second,
inutes
Y A R N
S a m z a A P I
K a fk a
Samza API
A simple call-back API that should be familiar to MapReduce developers:
public interface StreamTask{
void process(IncomingMessageEnvelope envelope,
MessageCollector collector, TaskCoordinator) ;
}
public interface WindowableTask{
void window(MessageCollector collector,
StreamCoordinator coordinator);
}
Kafka/Samza integration
 A Samza job is comprised of multiple tasks
 Each Kafka partition topic is consumed by only one Samza task
 Each YARN container runs multiple Samza tasks
 The output of a task is often written to a different topic
 Overall applications are then constructed by composition of jobs
Stream job output
 You’ve processed the data, now what?
 Need consider how the output of each stream partition will be processed
 A downstream destination with an idempotent interface is ideal
 Otherwise try to have expressions of unique state
 Hardest is when the outputs need applied in sequence
Example: log data
 It’s a common model: server process rolls a log every x minutes
 Each resultant file is ingested into Kafka and read by multiple Samza jobs
 Each Samza task receives a series of potentially interleaved log chunks
 Consider a task receiving data from servers S1, S2 and S3 across time periods
T1..T3
 It’s view may become something like:
S1[T1..T2], S2[T1..T2], S1[T2..T3], S3[T1..T2]…
Example - Streaming aggregation
 Input is a Stream of records with the form (key, value)
 Output is a series of (key, aggregate)
 3 approaches to producing these outputs:
- Each task pushes local counts downstream
- Each task pushes to another stream aggregated by a different job
- Each task uses local state to produce final aggregate values
Comparing stream aggregation
 Option 1: each stream partition outputs local values for each key it sees across a
time window
- Each partial result is an update record (e.g. increment key k at time t by x)
- Downstream consumer has the responsibility of combining partial results
 Option 2 : aggregate partial results in a second stream processing job
- The output becomes a series of state changes
(set the value of key k at time t to x)
Comparing stream aggregation contd
 Option 3 - push state into the state processors
- Samza tasks can hold persistent local state
- Creates a key/value store modelled as a Kafka topic
- Ensure that all records for each key are sent to the same partition
- State processing task can then create unique state aggregates locally
A side effect of Kafka
 A firehose for event data
- Can tap into with Samza
- Can tap into with Spark
 Unifies how we access and publish data
- Streams are defined by application / use case
- These can come from other teams
- Single entry point, multiple outputs
Analytics on streams
Incremental processing pipelines as a basis
Tried to reuse logic
 Not that simple
Conceptualization vs. implementation
Time windows
 Temporal constraints
 When is data ready?
- The key is defining “ready”, which is weaker than “complete”
- Enough to make an informed decision (this varies case by case)
- We can update the dataset to improve a confidence interval
 When are we done processing?
- Tradeoff between timeliness and accuracy
Schema evolution
 Schema changes appear within / across time windows
- Not a solved problem
 Stop streaming, propagate changes, replay or resume
 SAMZA-317
- Avro SerDe
- Schema stream
Challenges
 Data completeness
 Windows length
 Tradeoff between finding answers / satisfying temporal constraints
 Need to adjust our approach (representation and data structures)
- approximate
- probabilistic
- iterations to convergence
 This is a work in progress
Model based outlier detection
Previously:
Batch and real time datasets = different approaches
Now:
Same dataset
Model based anomaly detection
 Reuse knowledge and infrastructure
What is normal?
 How do we set thresholds ?
 Literature review (q-digest, t-digest)
Testing and alerting
 Some features are hard to test
 Functionally correct but unexpected consequences
 unexpected ~ hard to predict
 advertisment is a complex system
 Bugs are outliers we can detect (sort of)
Testing as predictive modelling
 Monitoring (next to testing)
 Make software monitorable
 Think of testing as an experiment
 Establish short, continuous, feedback loops
Implications
 How do we use real time output
- “live” dashboards
- Instrumentation and alerting (predict failure)
- feed optimization models
- back propagating methods into ETL
 How we think about feedback loops
 Make data more consumable
Conclusion
 Streaming is part of the broader system
 We are fine tuning how the two fit together
 Data collection and distribution is important
 publish results follows
 Kafka + Samza
 data and processing model that fits our applications well
 Conceptualization vs. implementation
 Real time != incremental
 A certain degree of uncertainty
 Tradeoffs

More Related Content

What's hot

Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
Zubair Nabi
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
Gyula Fóra
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
Jeff Hammerbacher
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Romain Jacotin
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
Apache Apex
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
ateeq ateeq
 
Introduction to near real time computing
Introduction to near real time computingIntroduction to near real time computing
Introduction to near real time computing
Tao Li
 
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
confluent
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Apache flink
Apache flinkApache flink
Apache flink
Ahmed Nader
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
Muhammad Shahid
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 

What's hot (20)

Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introduction to near real time computing
Introduction to near real time computingIntroduction to near real time computing
Introduction to near real time computing
 
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache flink
Apache flinkApache flink
Apache flink
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 

Similar to Moving Towards a Streaming Architecture

Time Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming PlatformTime Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming Platform
confluent
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
Dr. Mirko Kämpf
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
bigdatagurus_meetup
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
Hakka Labs
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
Jeff Magnusson
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
Jonathan Holloway
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
J Singh
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
George Long
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
Knoldus Inc.
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
Sybase Türkiye
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
David Walker
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 

Similar to Moving Towards a Streaming Architecture (20)

Time Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming PlatformTime Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming Platform
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 

More from Gabriele Modena

Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Gabriele Modena
 
Estimating mental states of a depressed person with Bayesian Networks
Estimating mental states of a depressed person with Bayesian NetworksEstimating mental states of a depressed person with Bayesian Networks
Estimating mental states of a depressed person with Bayesian Networks
Gabriele Modena
 
Analysis of grid log data with Affinity Propagation
Analysis of grid log data with Affinity PropagationAnalysis of grid log data with Affinity Propagation
Analysis of grid log data with Affinity Propagation
Gabriele Modena
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
Gabriele Modena
 
Analysis of interrupt latencies in a real-time kernel
Analysis of interrupt latencies in a real-time kernelAnalysis of interrupt latencies in a real-time kernel
Analysis of interrupt latencies in a real-time kernel
Gabriele Modena
 

More from Gabriele Modena (6)

Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Estimating mental states of a depressed person with Bayesian Networks
Estimating mental states of a depressed person with Bayesian NetworksEstimating mental states of a depressed person with Bayesian Networks
Estimating mental states of a depressed person with Bayesian Networks
 
Analysis of grid log data with Affinity Propagation
Analysis of grid log data with Affinity PropagationAnalysis of grid log data with Affinity Propagation
Analysis of grid log data with Affinity Propagation
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Analysis of interrupt latencies in a real-time kernel
Analysis of interrupt latencies in a real-time kernelAnalysis of interrupt latencies in a real-time kernel
Analysis of interrupt latencies in a real-time kernel
 

Recently uploaded

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Moving Towards a Streaming Architecture

  • 1. Moving Towards a Streaming Architecture Garry Turkington (CTO), Gabriele Modena (Data Scientist) Improve Digital
  • 3.
  • 4.
  • 5.
  • 6. Data sources  A geographically distributed fleet of ad servers push out several types of data, primarily: - 1 minute metrics packets - 15 minute full log files  Previously used ad hoc system to process the former, Hadoop for the latter
  • 7. Conceptualizing the data  We think of event streams when considering the operational system  But have this sliced into many log files for practical reasons  Recreating the stream-based view gets us back to what is happening  And brings data creation and processing closer together
  • 8. How did we get here  Need to process more data, faster - Reporting - Decision Making  Evolution of the system to meet scalability requirements  Stream processing by definition  It is changing the way we think about data collection and distribution  Not everything is obvious  It opened new doors
  • 9. Let’s take a step back  Aggregates grow in number and size  Keep a copy in HDFS  Run job, then go and make coffee  Generate new / other aggregates / ad hoc datasets  Analytics datawarehouse (column store), denormalized data model  BI, reporting  Hadoop for logs and event processing (eg. ETL, simulation, learning, optimization)
  • 11. Incremental processing  Ingest data every eg. 24 hours (/data/raw/event/d=YYYY-MM-DD)  Scheduling and partitioning are straightforward  Schema migrations: Avro + HCatalog  Fixed length time windows - What is a good length? - Ok if it takes 1 hour to process 15 minutes of data  Need to reprocess data  It can be made efficient (eg. DataFu Hourglass)  MapReduce
  • 12. A more interactive system We added tools to circumvent MapReduce - Latency, really - Still a “batch” mindset Cloudera Impala (+ Parquet) - Fast, familiar Move computation closer to data - Spark (among other things) for prototyping - Python/R integration, batch and “streaming”
  • 13. Outcome  Multiple overlapping data sources, processing tools - point-to-point  Different pipelines for analytics and production models - Divide - Where is the “correct” data ?
  • 14. All in all  Data collection and distribution is critical  We are still looking at “yesterday’s” data  We need a different level of abstraction
  • 15.
  • 16. Challenges of near-realtime data  Instructive to ask why near-realtime data analysis is hard  In Hadoop batch world we’d be happy processing 24hrs of data in 4 hrs  So why do we flinch to process 24hrs in 24hrs?  What allows batch workloads to process in sub-linear time: - Scheduling - Parallelization - Partitioning
  • 17. The streaming view  Consider these from the streaming perspective: - Scheduling is challenging because of unknown and variable data rates - Parallelization breaks our nice logical model of a single data stream - Partitioning can’t be simple file size/no of blocks - The method of partitioning the stream is fundamental
  • 18. Kafka  We use Kafka (http://kafka.apache.org) for all data we view as streams  It is resilient, performant and provides a very clear topic-based interface  Producers write to topics, consumers read from topics  Topics can be partitioned based on a producer-provided key  Provides at-least once message delivery semantics  Messages are persistently stored – topics can be re-read
  • 19. Samza  We use Samza (http://samza.incubator.apache.org) for stream processing  Analogous to MapReduce atop YARN atop HDFS  We have Samza atop YARN atop Kafka E L E M O D E N A L E A R N IN G H A D O O P 2 A p a c h e S a m z a afka for streaming ! arn for resource anagement and exec! amzaAPI for ocessing! weet spot: second, inutes Y A R N S a m z a A P I K a fk a
  • 20. Samza API A simple call-back API that should be familiar to MapReduce developers: public interface StreamTask{ void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator) ; } public interface WindowableTask{ void window(MessageCollector collector, StreamCoordinator coordinator); }
  • 21. Kafka/Samza integration  A Samza job is comprised of multiple tasks  Each Kafka partition topic is consumed by only one Samza task  Each YARN container runs multiple Samza tasks  The output of a task is often written to a different topic  Overall applications are then constructed by composition of jobs
  • 22. Stream job output  You’ve processed the data, now what?  Need consider how the output of each stream partition will be processed  A downstream destination with an idempotent interface is ideal  Otherwise try to have expressions of unique state  Hardest is when the outputs need applied in sequence
  • 23. Example: log data  It’s a common model: server process rolls a log every x minutes  Each resultant file is ingested into Kafka and read by multiple Samza jobs  Each Samza task receives a series of potentially interleaved log chunks  Consider a task receiving data from servers S1, S2 and S3 across time periods T1..T3  It’s view may become something like: S1[T1..T2], S2[T1..T2], S1[T2..T3], S3[T1..T2]…
  • 24. Example - Streaming aggregation  Input is a Stream of records with the form (key, value)  Output is a series of (key, aggregate)  3 approaches to producing these outputs: - Each task pushes local counts downstream - Each task pushes to another stream aggregated by a different job - Each task uses local state to produce final aggregate values
  • 25. Comparing stream aggregation  Option 1: each stream partition outputs local values for each key it sees across a time window - Each partial result is an update record (e.g. increment key k at time t by x) - Downstream consumer has the responsibility of combining partial results  Option 2 : aggregate partial results in a second stream processing job - The output becomes a series of state changes (set the value of key k at time t to x)
  • 26. Comparing stream aggregation contd  Option 3 - push state into the state processors - Samza tasks can hold persistent local state - Creates a key/value store modelled as a Kafka topic - Ensure that all records for each key are sent to the same partition - State processing task can then create unique state aggregates locally
  • 27. A side effect of Kafka  A firehose for event data - Can tap into with Samza - Can tap into with Spark  Unifies how we access and publish data - Streams are defined by application / use case - These can come from other teams - Single entry point, multiple outputs
  • 28. Analytics on streams Incremental processing pipelines as a basis Tried to reuse logic  Not that simple Conceptualization vs. implementation
  • 29. Time windows  Temporal constraints  When is data ready? - The key is defining “ready”, which is weaker than “complete” - Enough to make an informed decision (this varies case by case) - We can update the dataset to improve a confidence interval  When are we done processing? - Tradeoff between timeliness and accuracy
  • 30. Schema evolution  Schema changes appear within / across time windows - Not a solved problem  Stop streaming, propagate changes, replay or resume  SAMZA-317 - Avro SerDe - Schema stream
  • 31. Challenges  Data completeness  Windows length  Tradeoff between finding answers / satisfying temporal constraints  Need to adjust our approach (representation and data structures) - approximate - probabilistic - iterations to convergence  This is a work in progress
  • 32. Model based outlier detection Previously: Batch and real time datasets = different approaches Now: Same dataset Model based anomaly detection  Reuse knowledge and infrastructure What is normal?  How do we set thresholds ?  Literature review (q-digest, t-digest)
  • 33. Testing and alerting  Some features are hard to test  Functionally correct but unexpected consequences  unexpected ~ hard to predict  advertisment is a complex system  Bugs are outliers we can detect (sort of)
  • 34. Testing as predictive modelling  Monitoring (next to testing)  Make software monitorable  Think of testing as an experiment  Establish short, continuous, feedback loops
  • 35. Implications  How do we use real time output - “live” dashboards - Instrumentation and alerting (predict failure) - feed optimization models - back propagating methods into ETL  How we think about feedback loops  Make data more consumable
  • 36. Conclusion  Streaming is part of the broader system  We are fine tuning how the two fit together  Data collection and distribution is important  publish results follows  Kafka + Samza  data and processing model that fits our applications well  Conceptualization vs. implementation  Real time != incremental  A certain degree of uncertainty  Tradeoffs