SlideShare a Scribd company logo
Implement a scalable statistical
aggregation system using Akka
Scala by the Bay, 12 Nov 2016
Stanley Nguyen, Vu Ho
Email Security@Symantec Singapore
The system
Provides service to answer time-series analytical questions such as
COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set
of data streams by using statistical approach.
Motivation
 The system collects data from multiple sources in streaming log
format
 Some common questions in Email Anti-Abuse system
 Most frequent Items (IP, domain, sender, etc.)
 Number of unique items
 Have we seen an item before?
=> Need to be able to answer such questions in a timely manner
Data statistics
 6K email logs/second
 One email log is flatten out to subevents
 Ip, sender, sender domain, etc
 Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week, etc)
Total ~200K messages/second
Challenges
 Our system needs to be
 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient
Sketching data structures
 How many times have we seen a certain IP?
 Count Min Sketch (CMS): Counting things + TopK
 How many unique senders have we seen yesterday?
 HyperLogLog (HLL): Set cardinality
 Did we see a certain IP last month?
 Bloom Filter (BF): Set membership
SPACE / SPEED
 Implement data structure for
finding cardinality (i.e. counting
things); set membership; top-k
elements – solved by using
streamlib / twitter algebird
 Implement a dynamic,
reactive, distributed system
for answering cardinality (i.e.
counting things); set
membership; top-k elements
What we try to solveWhat is available
Sketching data structures
 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient
Akka Actor
BACK PRESSURE?
Akka Stream
GraphDSL
FLOW-SHAPE NODE
Using GraphDSL
(msg-type, @timestamp, key, value)
GraphDSL - Limitations
Our design – Dynamic stream
Merge Hub
 Provided by Akka Stream:
Allow dynamic set of TCP producers
Splitter Hub
 Split the stream based on event type to a dynamic set of
downstream consumers.
 Consumers are actors which implement CMS, BF, HLL, etc logic.
 Not available in akka-stream.
Splitter Hub API
 Similar to built-in akka stream’s BroadcastHub; different in back-
pressure implementation.
 [[SplitterHub]].source can be supplied with a predicate/selector function
to return a filtered subset of data.
selector
Splitter Hub’s Implementation
Splitter Hub
 The [[Source]] can be materialized any number of times — each
materialization creates a new consumer which can be registered with the
hub, and then receives items matching the selector function from the
upstream.
Consumer can be added at run time
Consumers
 Can be either local or remote.
 Managed by coordination actor.
 Implements a specific data structure (CMS/BF/HLL) for a particular event
type from a specific time-range.
 Responsibility:
 Answer a specific query.
 Persisting serialization of internal data structure such as count-min-table, etc.
regularly. COUNT-QUERY
forward
ref
snapshot
 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient
Scaling out
 If data does not fit in one machine.
 Server crashes.
 How to maintain back pressure end-to-end.
Scaling out
Akka stream TCP
 Handled by Kernel (back-pressure, reliable).
 For each worker, we create a source for each message type it is
responsible for using SplitterHub source() API.
 Connect each source to a TCP connection and send to worker.
 Backpressure is maintained across network.
~>
~>
Master-Worker communication
Master Failover
 The Coordinator is the Single Point of Failure.
 Run multiple Coordinator Actors as Cluster Singleton .
 Worker communicates to master (heartbeat) using Cluster Client.
Worker Failover
 Worker persists all events to DB journal + snapshot.
 Akka Persistent.
 Redis for storing Journal + Snapshot.
 When a worker is down, its keys are re-distributed.
 Master then redirects traffic to other workers.
 CMS Actors are restored on new worker from Snapshot + Journal.
Benchmark
Akka-stream on single node 100K+ msg/second (one msg-type)
Akka-stream on remote node
(remote TCP)
15-20K msg/second (one msg-type)
Akka-stream on remote node
(remote TCP) with akka persistent
journal
2000+ msg/second (one msg-type)
Conclusion
 Our system is
 Responsive
 Reactive
 Scalable
 Resilient
 Future works:
 Make worker metric agnostics
 Scale out master
 Exactly one delivery for worker
 More flexible filter using SplitterHub
Q&A

More Related Content

What's hot

Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
Anyscale
 
Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015Todd Lanning
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_final
confluent
 
Ceilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfCeilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdf
OpenStack Foundation
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Spark Summit
 
Streaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQLStreaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQL
confluent
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
Databricks
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
phanleson
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Lightbend
 
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
HostedbyConfluent
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
Dr. Mirko Kämpf
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rulli
mfrancis
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafka
confluent
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor API
confluent
 
Streaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLStreaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETL
confluent
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINE
SingleStore
 
PowerStream Demo
PowerStream DemoPowerStream Demo
PowerStream Demo
SingleStore
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
Spark Summit
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
Asis Mohanty
 

What's hot (20)

Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015Visualizing C2_MLADS_2015
Visualizing C2_MLADS_2015
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_final
 
Ceilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfCeilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdf
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Streaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQLStreaming ETL to Elastic with Apache Kafka and KSQL
Streaming ETL to Elastic with Apache Kafka and KSQL
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rulli
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafka
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor API
 
Streaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLStreaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETL
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINE
 
PowerStream Demo
PowerStream DemoPowerStream Demo
PowerStream Demo
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 

Viewers also liked

Mb0046 marketing management
Mb0046  marketing managementMb0046  marketing management
Mb0046 marketing management
smumbahelp
 
Vol32 3 onicomicosis
Vol32 3 onicomicosisVol32 3 onicomicosis
Vol32 3 onicomicosis
sastorga
 
Mb0044 production and operation management
Mb0044   production and operation managementMb0044   production and operation management
Mb0044 production and operation management
smumbahelp
 
Keram's 2016 VR Q4 Roundup
Keram's 2016 VR Q4 RoundupKeram's 2016 VR Q4 Roundup
Keram's 2016 VR Q4 Roundup
Keram Malicki-Sanchez
 
مقایسه بدن انسان با سازمان
مقایسه بدن انسان با سازمانمقایسه بدن انسان با سازمان
مقایسه بدن انسان با سازمانmahjava جوادزاده
 
7 steps to successful sales
7 steps to successful sales7 steps to successful sales
7 steps to successful salesMohamad Safieh
 
3.1.1.1 el 0401 generator lead installation
3.1.1.1 el 0401 generator lead installation3.1.1.1 el 0401 generator lead installation
3.1.1.1 el 0401 generator lead installation
Roy Maiguasca Nievez
 
導師班級經營分享20161229
導師班級經營分享20161229導師班級經營分享20161229
導師班級經營分享20161229
Jieh-Shan YEH
 

Viewers also liked (8)

Mb0046 marketing management
Mb0046  marketing managementMb0046  marketing management
Mb0046 marketing management
 
Vol32 3 onicomicosis
Vol32 3 onicomicosisVol32 3 onicomicosis
Vol32 3 onicomicosis
 
Mb0044 production and operation management
Mb0044   production and operation managementMb0044   production and operation management
Mb0044 production and operation management
 
Keram's 2016 VR Q4 Roundup
Keram's 2016 VR Q4 RoundupKeram's 2016 VR Q4 Roundup
Keram's 2016 VR Q4 Roundup
 
مقایسه بدن انسان با سازمان
مقایسه بدن انسان با سازمانمقایسه بدن انسان با سازمان
مقایسه بدن انسان با سازمان
 
7 steps to successful sales
7 steps to successful sales7 steps to successful sales
7 steps to successful sales
 
3.1.1.1 el 0401 generator lead installation
3.1.1.1 el 0401 generator lead installation3.1.1.1 el 0401 generator lead installation
3.1.1.1 el 0401 generator lead installation
 
導師班級經營分享20161229
導師班級經營分享20161229導師班級經營分享20161229
導師班級經營分享20161229
 

Similar to ReactiveSummeriserAkka-ScalaByBay2016

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Akka Microservices Architecture And Design
Akka Microservices Architecture And DesignAkka Microservices Architecture And Design
Akka Microservices Architecture And Design
Yaroslav Tkachenko
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Amazon Web Services
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCBuilding a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Konrad Malawski
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
Nan Zhu
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
SingleStore
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Accumulo Summit
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
Yoni Farin
 
Real time data-pipeline from inception to production
Real time data-pipeline from inception to productionReal time data-pipeline from inception to production
Real time data-pipeline from inception to production
Shreya Mukhopadhyay
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 

Similar to ReactiveSummeriserAkka-ScalaByBay2016 (20)

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Akka Microservices Architecture And Design
Akka Microservices Architecture And DesignAkka Microservices Architecture And Design
Akka Microservices Architecture And Design
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCBuilding a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYC
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
Real time data-pipeline from inception to production
Real time data-pipeline from inception to productionReal time data-pipeline from inception to production
Real time data-pipeline from inception to production
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 

ReactiveSummeriserAkka-ScalaByBay2016

  • 1. Implement a scalable statistical aggregation system using Akka Scala by the Bay, 12 Nov 2016 Stanley Nguyen, Vu Ho Email Security@Symantec Singapore
  • 2. The system Provides service to answer time-series analytical questions such as COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set of data streams by using statistical approach.
  • 3. Motivation  The system collects data from multiple sources in streaming log format  Some common questions in Email Anti-Abuse system  Most frequent Items (IP, domain, sender, etc.)  Number of unique items  Have we seen an item before? => Need to be able to answer such questions in a timely manner
  • 4. Data statistics  6K email logs/second  One email log is flatten out to subevents  Ip, sender, sender domain, etc  Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week, etc) Total ~200K messages/second
  • 5. Challenges  Our system needs to be  Responsive  Space efficient  Reactive  Extensible  Scalable  Resilient
  • 6. Sketching data structures  How many times have we seen a certain IP?  Count Min Sketch (CMS): Counting things + TopK  How many unique senders have we seen yesterday?  HyperLogLog (HLL): Set cardinality  Did we see a certain IP last month?  Bloom Filter (BF): Set membership SPACE / SPEED
  • 7.  Implement data structure for finding cardinality (i.e. counting things); set membership; top-k elements – solved by using streamlib / twitter algebird  Implement a dynamic, reactive, distributed system for answering cardinality (i.e. counting things); set membership; top-k elements What we try to solveWhat is available
  • 9.  Responsive  Space efficient  Reactive  Extensible  Scalable  Resilient
  • 14. Our design – Dynamic stream
  • 15. Merge Hub  Provided by Akka Stream: Allow dynamic set of TCP producers
  • 16. Splitter Hub  Split the stream based on event type to a dynamic set of downstream consumers.  Consumers are actors which implement CMS, BF, HLL, etc logic.  Not available in akka-stream.
  • 17. Splitter Hub API  Similar to built-in akka stream’s BroadcastHub; different in back- pressure implementation.  [[SplitterHub]].source can be supplied with a predicate/selector function to return a filtered subset of data. selector
  • 19. Splitter Hub  The [[Source]] can be materialized any number of times — each materialization creates a new consumer which can be registered with the hub, and then receives items matching the selector function from the upstream. Consumer can be added at run time
  • 20. Consumers  Can be either local or remote.  Managed by coordination actor.  Implements a specific data structure (CMS/BF/HLL) for a particular event type from a specific time-range.  Responsibility:  Answer a specific query.  Persisting serialization of internal data structure such as count-min-table, etc. regularly. COUNT-QUERY forward ref snapshot
  • 21.  Responsive  Space efficient  Reactive  Extensible  Scalable  Resilient
  • 22. Scaling out  If data does not fit in one machine.  Server crashes.  How to maintain back pressure end-to-end.
  • 24. Akka stream TCP  Handled by Kernel (back-pressure, reliable).  For each worker, we create a source for each message type it is responsible for using SplitterHub source() API.  Connect each source to a TCP connection and send to worker.  Backpressure is maintained across network. ~> ~>
  • 26. Master Failover  The Coordinator is the Single Point of Failure.  Run multiple Coordinator Actors as Cluster Singleton .  Worker communicates to master (heartbeat) using Cluster Client.
  • 27. Worker Failover  Worker persists all events to DB journal + snapshot.  Akka Persistent.  Redis for storing Journal + Snapshot.  When a worker is down, its keys are re-distributed.  Master then redirects traffic to other workers.  CMS Actors are restored on new worker from Snapshot + Journal.
  • 28. Benchmark Akka-stream on single node 100K+ msg/second (one msg-type) Akka-stream on remote node (remote TCP) 15-20K msg/second (one msg-type) Akka-stream on remote node (remote TCP) with akka persistent journal 2000+ msg/second (one msg-type)
  • 29. Conclusion  Our system is  Responsive  Reactive  Scalable  Resilient  Future works:  Make worker metric agnostics  Scale out master  Exactly one delivery for worker  More flexible filter using SplitterHub
  • 30. Q&A

Editor's Notes

  1. Thanks for coming. Introduction. Working Symantec Email security team protecting our customers against all types of email abuse. Present abt system
  2. What does our system do? a service that answers analytical questions such as COUNT, TOPK, SET MEMBERSHIP, CARDINALITY for a fixed time interval such as last 5 minutes or last hour. Use statistical approach. “statistical” mean ~> instead of storing everything in a database and run query on it, we only store a statistical representation ~-> approximate answer. streaming data instead of offline data. rate of thousands and milliseconds latency.
  3. prevent email abuse. have a system that collects email logs; log ~> sender, IP, domains and other metadata. All data streamed in and the volume can be quite large so we need to be able to digest and process them quickly. Upon collecting the data ~> extract the useful information from it in order to identify threat. common questions faces Most frequent domain, sender within last 5 minutes/1hour/1 day Number of unique domains Have we seen this IP last 24 hours detect and stop abuse as soon as possible ~> answer such question in a timely and automated manner.
  4. Handle abt 6k email logs. Each email log -> complicated obj Keep track individual field -> ~ aggergate per fixed time interval Total 200k
  5. Challenges Responsive => rate thousand of queries per second and with millisecond latency. Secondly, Terabytes of data ~> space efficient In addition, responsive even under sudden increased load in traffic extensible because dynamically add support for new data stream Scalable multiple nodes and resilient in the case of failure.
  6. responsive and space efficient ~> sketching data structure. “summary of a dataset” storing the whole dataset vs a statistical representation. No exact answers but only approximations. But can tweak the parameters. common sketching are Count-Min-Sketch, BloomFilter or HyperLogLog.
  7. streamlib or twitter algebird available But they don’t enable an easy way to implement a dynamic, reactive, distributed system out of the box..
  8. Before diving in detail, how to use sketching data strutures. streamlib library for demo. A single stream of IP address + how many times: Initializing with some parameters. each element ~> update the count-min instance. Get the answer.
  9. Summary sketching data structure => responsive and space efficient because a small memory footprint. ?reactive and extensible.
  10. uses actor. 1. Single IP stream ~> one actor is enough. CMS actor. 2. Support several different types of messages for multiple time ranges ~> several consumer actors. 3. producers ~> several producers 4. Send data from producer -> consumer => master actor (is receiving message and pass it to a set of downstream actors) Challenges ~> Flow control.
  11. Reacive = Akka stream -> reactive standard. GraphDSL to represent = graph. define a Source Node -> send, a Sink node receive + intermediate nodes which we call “FLOW”. graphDSL = a powerful abstraction to model how to pass data from Source to a number of computation stages.
  12. If GraphDSL, how model? different data streams + a complicated object ~> transform to simpler for Sketching. Format = (key, value). merge all streams ~> a big data stream. Split ~> set of downstream consumers <- aggregation. merged to Sink.
  13. limitation of GraphDSL. Connections, # nodes in graph => known/specified upfront. Eg. broadcast, #consumers. Merge, #producer fixed and cannot be changed at run time. We don’t know #consumers + producers => A ways to allow consumer or producer to connect / disconnect.
  14. our design, instead of using dynamic stream? Ask audience? More specifically, use MergeHub + SplitterHub. MergeHub = part of dynamic stream; supports by akka recently. It addresses the limitation of graphDSL. mergeHub and splitter hub, no need to specify how many + consumer or producer join at run time. different type of producers = kafka or TCP. TCP, multiple streams into via multiple TCP connections. several consumers which connect and get data from the SplitterHub.
  15. MergeHub provided by akka stream. don’t have to implement it by your self. Using it is simple. Materialize to a Sink + stream your data to the Sink. Example -> merge different data streams from multiple incoming TCP connection.
  16. Merging => then need to split the data into a set of downstream consumers based on event type. One consumer should receive only a specified subset of messages. responsiblely of SplitterHub is to allow you to specify which subset of data you should send to a particular consumer. SplitterHub is not available in akka-stream so we will have to implement it by ourself.
  17. a custom SplitterHub. Selector function as sort of filter for a single stream. Implementation ~ BroadcastHub; few differences: Require selector function. Secondly, instead of using broadcast, split based on event type => back-pressure differently.
  18. dynamic set of consumers which read data from a fixed size buffer. When upstream available, => push buffer. new consumer joins, it will start from a particular offset + move forward. Demand => check current offset. Doesn’t match, it will jump to the next offset. Match => push to the input port of the consumer.
  19. How to use? Key thing: specify a selector function. materialize the source as many times as you want. Each materialization -> create + register => the consumer will receive items from the upstream which match the selector function. can add or remote new consumer at run time.
  20. Already explain abt producer, merge + split how the computation is handled within “consumer actor”? do the actual computation + responsible to answer a specific query: Ex…. Managed by coordination actor ~> keeps the reference to Splitter Hub Query => coordination actor => forwarded to the corresponding actor.
  21. In summary, we use sketching data structure to make our system both responsible and space efficient. We use Akka stream to make our system reactive. And we use dynamic stream to make sure we can add support for a new type of data at run time without the need of shutting down the system to do new deployment. My college will explain more about how to make the system scalable and rezilient.
  22. Problem is solved on a single machine. Sending to remote system poses several challenges: How to maintain back pressure between 2 remote entities How to maintain back pressure between multiple sources to multiple consumers The naive way i.e. Actor to Actor => requires ACK, PULL, etc => Not easy to implement, error prone Using ActorProducer/ActorSubscriber => not recommended since “request” message can be lost We have seen that the SplitterHub allows us to add new consumer dynamically, this is where it really shines cos we can split traffic to new workers as they join
  23. Coordinator actor on Master node subscribe to cluster event and get notified when a member of the cluster is Up The new worker connect to the cluster and register itself to the coordinator Actor SplitterHub creates a new source and filter the traffic through it using source(T => Boolean) The new source flows data into the TCP connection which in turn forward all to the remote worker When one worker goes down, the Coordinator Actor updates the selector function to allow data flow to a different worker. After receiving the data, worker pipe the data to its CMS Actor using mapAsync, this is one way to connect an actor to a stream. Back pressure is handled by the actor mailbox + an ACK future Finally CMS Actor is a persistent actor. In case of failure a new CMS Actor can be started from any node and recovers its state from the persistent database.
  24. Since the Active master can run on any node, Cluster Client is used to communicate with the master => Actor Location transparency. Cluster Client automatically reconnect to a new master when failover happens Master don’t subscribe to Cluster Event to know if a node is up, instead it listens for registration message from worker.