In this guest webinar by Kevin Webber, we cover the entire architecture of a Reactive system, from a responsive UI implemented with Vue.js, to a fully event sourced collection of microservices implemented with Java, Lagom, Cassandra, and Kafka.
For the full recording, visit: https://www.lightbend.com/blog/full-stack-reactive-in-practice-webinar
Real Time Machine Learning Visualization With SparkChester Chen
Training machine learning model involves a lot of experimentation, we need a way to visualize the training process.
We presented a system to enable real time machine learning visualization with Spark:
-- Gives visibility into the training of a model
-- Allows us monitor the convergence of the algorithms during training
-- Can stop the iterations when convergence is good enough.
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...confluent
How Priceline uses Kafka Streams technology to effectively save TBs on daily licenses of our monitoring systems. Kafka Streams powers a big part of our analytics and monitoring pipelines and delivers operational metrics transformations in real time. All logs and operational metrics from all of the APIs of Priceline’s products flow into Kafka and is ingested into our Monitoring System Splunk for Alerting and Monitoring. We have now implemented data transformations, aggregations and summarizations using Kafka Streams technologies to effectively eliminate PCI/PII violations on the log data; do aggregations on metrics to avoid ingesting sub-second metrics and ingest metrics only at the granularity that we need to. We will cover the need for custom Serdes, custom partitioners, and why we don’t use the confluent registry. You will also learn how Priceline uses a self service model to configure its streams, topics and consumers using Data Collection Console, which is our UI for managing the Kafka streaming pipelines.
Ceilometer is a tool that collects usage and performance data, while Heat orchestrates complex deployments on top of OpenStack. Heat aims to autoscale its deployments, scaling up when they're running hot and scaling back when idle.
Ceilometer can access decisive data and trigger the appropriate actions in Heat. The result of these two OpenStack projects meeting is value creation in the form of an alarming API in Ceilometer and its consumption in Heat.
Slides presented at the Fall OpenStack Design Summit in Hong Kong
Kafka, Killer of Point-to-Point Integrations, Lucian Litaconfluent
With 60+ products and over 24% of the US GDP flowing through it, system integration is a tough problem for Intuit. Seasonality, scale, and massive peaks in products like TurboTax, QuickBooks, and Mint.com add extra layers of difficulty when building shared data services around transaction and user graphs, clickstream processing, a/b testing, and personalization. To reduce complexity and latency, we’ve implemented Kafka as the backbone across these data services. This allows us to asynchronously trigger relevant processing, elegantly scaling up and down as needed around peaks, all without the need for point-to-point integrations.
In this talk, we share what we’ve learned about Kafka at Intuit and describe our data services architecture. We found that Kafka is invaluable in achieving a scalable, clean architecture, allowing engineering teams to focus less on integration and more on product development.
In this guest webinar by Kevin Webber, we cover the entire architecture of a Reactive system, from a responsive UI implemented with Vue.js, to a fully event sourced collection of microservices implemented with Java, Lagom, Cassandra, and Kafka.
For the full recording, visit: https://www.lightbend.com/blog/full-stack-reactive-in-practice-webinar
Real Time Machine Learning Visualization With SparkChester Chen
Training machine learning model involves a lot of experimentation, we need a way to visualize the training process.
We presented a system to enable real time machine learning visualization with Spark:
-- Gives visibility into the training of a model
-- Allows us monitor the convergence of the algorithms during training
-- Can stop the iterations when convergence is good enough.
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...confluent
How Priceline uses Kafka Streams technology to effectively save TBs on daily licenses of our monitoring systems. Kafka Streams powers a big part of our analytics and monitoring pipelines and delivers operational metrics transformations in real time. All logs and operational metrics from all of the APIs of Priceline’s products flow into Kafka and is ingested into our Monitoring System Splunk for Alerting and Monitoring. We have now implemented data transformations, aggregations and summarizations using Kafka Streams technologies to effectively eliminate PCI/PII violations on the log data; do aggregations on metrics to avoid ingesting sub-second metrics and ingest metrics only at the granularity that we need to. We will cover the need for custom Serdes, custom partitioners, and why we don’t use the confluent registry. You will also learn how Priceline uses a self service model to configure its streams, topics and consumers using Data Collection Console, which is our UI for managing the Kafka streaming pipelines.
Ceilometer is a tool that collects usage and performance data, while Heat orchestrates complex deployments on top of OpenStack. Heat aims to autoscale its deployments, scaling up when they're running hot and scaling back when idle.
Ceilometer can access decisive data and trigger the appropriate actions in Heat. The result of these two OpenStack projects meeting is value creation in the form of an alarming API in Ceilometer and its consumption in Heat.
Slides presented at the Fall OpenStack Design Summit in Hong Kong
Kafka, Killer of Point-to-Point Integrations, Lucian Litaconfluent
With 60+ products and over 24% of the US GDP flowing through it, system integration is a tough problem for Intuit. Seasonality, scale, and massive peaks in products like TurboTax, QuickBooks, and Mint.com add extra layers of difficulty when building shared data services around transaction and user graphs, clickstream processing, a/b testing, and personalization. To reduce complexity and latency, we’ve implemented Kafka as the backbone across these data services. This allows us to asynchronously trigger relevant processing, elegantly scaling up and down as needed around peaks, all without the need for point-to-point integrations.
In this talk, we share what we’ve learned about Kafka at Intuit and describe our data services architecture. We found that Kafka is invaluable in achieving a scalable, clean architecture, allowing engineering teams to focus less on integration and more on product development.
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
At Databricks, we manage Apache Spark clusters for customers to run various production workloads. In this talk, we share our experiences in building a real-time monitoring system for thousands of Spark nodes, including the lessons we learned and the value we’ve seen from our efforts so far.
The was part of the talk presented at #monitorSF Meetup held at Databricks HQ in SF.
Streaming ETL to Elastic with Apache Kafka and KSQLconfluent
Companies are recognizing the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka streaming platform. With Kafka, developers can integrate multiple sources and systems, enableing low latency analytics, event-driven architectures and the population of multiple downstream systems. These data pipelines can be built using configuration alone.
In this talk we’ll see how easy it is to stream data from sources such as databases and into Kafka using the Kafka Connect API. We’ll use KSQL to filter, aggregate and join it to other data, and then stream this enriched data from Kafka out into targets such as Elasticsearch. All of this can be accomplished without a single line of code!
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...Databricks
Yelp’s ad platform handles millions of ad requests everyday. To generate ad metrics and analytics in real-time, they built they ad event tracking and analyzing pipeline on top of Spark Streaming. It allows Yelp to manage large number of active ad campaigns and greatly reduce over-delivery. It also enables them to share ad metrics with advertisers in a more timely fashion.
This session will start with an overview of the entire pipeline and then focus on two specific challenges in the event consolidation part of the pipeline that Yelp had to solve. The first challenge will be about joining multiple data sources together to generate a single stream of ad events that feeds into various downstream systems. That involves solving several problems that are unique to real-time applications, such as windowed processing and handling of event delays. The second challenge covered is with regards to state management across code deployments and application restarts. Throughout the session, the speakers will share best practices for the design and development of large-scale Spark Streaming pipelines for production environments.
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
Audience: Architects, Data Scientists, Developers
Technical level: Introductory
From home intrusion detection, to self-driving cars, to keeping data center operations healthy, Machine Learning (ML) has become one of the hottest topics in software engineering today. While much of the focus has been on the actual creation of the algorithms used in ML, the less talked-about challenge is how to serve these models in production, often utilizing real-time streaming data.
The traditional approach to model serving is to treat the model as code, which means that ML implementation has to be continually adapted for model serving. As the amount of machine learning tools and techniques grows, the efficiency of such an approach is becoming more questionable. Additionally, machine learning and model serving are driven by very different quality of service requirements; while machine learning is typically batch, dealing with scalability and processing power, model serving is mostly concerned with performance and stability.
In this webinar with O’Reilly author and Lightbend Principal Architect, Boris Lublinsky, we will define an alternative approach to model serving, based on treating the model itself as data. Using popular frameworks like Akka Streams and Apache Flink, Boris will review how to implement this approach, explaining how it can help you:
* Achieve complete decoupling between the model implementation for machine learning and model serving, enforcing better standardization of your model serving implementation.
* Enable dynamic updates of the served model without having to restart the system.
* Utilize Tensorflow and PMML as model representation and their usage for building “real time updatable” model serving architecture.
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...HostedbyConfluent
In any enterprise or cloud application, Task scheduling is a key requirement. A highly available and fault-tolerant task scheduling will help us to improve our business goals.
A classic task scheduling infrastructure is typically backed by databases. The instances/service that performs the scheduling, loads the task definitions from the database into memory and performs the task scheduling.
This kind of infrastructure creates issues like stateful services, inability to scale the services horizontally, being prone to frequent failures, etc., If the state of these kinds of services is not maintained well, it may lead to inconsistent and integrity issues.
To mitigate these issues, we will explore a high available and fault-tolerant task scheduling infrastructure using Kafka, Kafka Streams, and State Store.
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
Big data reactive streams and OSGi - M Rullimfrancis
OSGi Community Event 2017 Presentation by Matteo Rulli [FlairBit]
One of the basic requirement to enable big-data analytics is a rational and effective approach to data ingestion. In long running projects the need arises to evolve the domain model and this potentially affects data quality. As a consequence, the concept of versioning is crucial to keep data centric systems consistent: the importance of service dynamicity and good modularity support in a sound data ingestion workflow implementation cannot be easily overestimated.
This talk demonstrates how to combine OSGi declarative services and OSGi robust versioning support to enable complex data ingestion use cases such as serialization upcasting, domain and data models segregation and events versioning. Both Akka and Cassandra are offered as OSGi services to materialize big-data processing workflows with no pain.
Streaming Transformations - Putting the T in Streaming ETLconfluent
Speaker: Nick Dearden, Director of Engineering, Confluent
We’ll discuss how to leverage some of the more advanced transformation capabilities available in both KSQL and Kafka Connect, including how to chain them together into powerful combinations for handling tasks such as data-masking, restructuring and aggregations. Using KSQL, you can deliver the streaming transformation capability easily and quickly.
This is part 3 of 3 in Streaming ETL - The New Data Integration series.
Watch the recording: https://videos.confluent.io/watch/en56Qt3KAdrpQ4JE5EZNHj?.
At Spark Summit East in New York, we unveil PowerStream, an Internet of Things (IoT) simulation with visualizations and alerts based on real-time data from 2 million sensors across global wind farms.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
help.mbaassignments@gmail.com
or
call us at : 08263069601
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
At Databricks, we manage Apache Spark clusters for customers to run various production workloads. In this talk, we share our experiences in building a real-time monitoring system for thousands of Spark nodes, including the lessons we learned and the value we’ve seen from our efforts so far.
The was part of the talk presented at #monitorSF Meetup held at Databricks HQ in SF.
Streaming ETL to Elastic with Apache Kafka and KSQLconfluent
Companies are recognizing the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka streaming platform. With Kafka, developers can integrate multiple sources and systems, enableing low latency analytics, event-driven architectures and the population of multiple downstream systems. These data pipelines can be built using configuration alone.
In this talk we’ll see how easy it is to stream data from sources such as databases and into Kafka using the Kafka Connect API. We’ll use KSQL to filter, aggregate and join it to other data, and then stream this enriched data from Kafka out into targets such as Elasticsearch. All of this can be accomplished without a single line of code!
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...Databricks
Yelp’s ad platform handles millions of ad requests everyday. To generate ad metrics and analytics in real-time, they built they ad event tracking and analyzing pipeline on top of Spark Streaming. It allows Yelp to manage large number of active ad campaigns and greatly reduce over-delivery. It also enables them to share ad metrics with advertisers in a more timely fashion.
This session will start with an overview of the entire pipeline and then focus on two specific challenges in the event consolidation part of the pipeline that Yelp had to solve. The first challenge will be about joining multiple data sources together to generate a single stream of ad events that feeds into various downstream systems. That involves solving several problems that are unique to real-time applications, such as windowed processing and handling of event delays. The second challenge covered is with regards to state management across code deployments and application restarts. Throughout the session, the speakers will share best practices for the design and development of large-scale Spark Streaming pipelines for production environments.
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
Audience: Architects, Data Scientists, Developers
Technical level: Introductory
From home intrusion detection, to self-driving cars, to keeping data center operations healthy, Machine Learning (ML) has become one of the hottest topics in software engineering today. While much of the focus has been on the actual creation of the algorithms used in ML, the less talked-about challenge is how to serve these models in production, often utilizing real-time streaming data.
The traditional approach to model serving is to treat the model as code, which means that ML implementation has to be continually adapted for model serving. As the amount of machine learning tools and techniques grows, the efficiency of such an approach is becoming more questionable. Additionally, machine learning and model serving are driven by very different quality of service requirements; while machine learning is typically batch, dealing with scalability and processing power, model serving is mostly concerned with performance and stability.
In this webinar with O’Reilly author and Lightbend Principal Architect, Boris Lublinsky, we will define an alternative approach to model serving, based on treating the model itself as data. Using popular frameworks like Akka Streams and Apache Flink, Boris will review how to implement this approach, explaining how it can help you:
* Achieve complete decoupling between the model implementation for machine learning and model serving, enforcing better standardization of your model serving implementation.
* Enable dynamic updates of the served model without having to restart the system.
* Utilize Tensorflow and PMML as model representation and their usage for building “real time updatable” model serving architecture.
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...HostedbyConfluent
In any enterprise or cloud application, Task scheduling is a key requirement. A highly available and fault-tolerant task scheduling will help us to improve our business goals.
A classic task scheduling infrastructure is typically backed by databases. The instances/service that performs the scheduling, loads the task definitions from the database into memory and performs the task scheduling.
This kind of infrastructure creates issues like stateful services, inability to scale the services horizontally, being prone to frequent failures, etc., If the state of these kinds of services is not maintained well, it may lead to inconsistent and integrity issues.
To mitigate these issues, we will explore a high available and fault-tolerant task scheduling infrastructure using Kafka, Kafka Streams, and State Store.
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
Big data reactive streams and OSGi - M Rullimfrancis
OSGi Community Event 2017 Presentation by Matteo Rulli [FlairBit]
One of the basic requirement to enable big-data analytics is a rational and effective approach to data ingestion. In long running projects the need arises to evolve the domain model and this potentially affects data quality. As a consequence, the concept of versioning is crucial to keep data centric systems consistent: the importance of service dynamicity and good modularity support in a sound data ingestion workflow implementation cannot be easily overestimated.
This talk demonstrates how to combine OSGi declarative services and OSGi robust versioning support to enable complex data ingestion use cases such as serialization upcasting, domain and data models segregation and events versioning. Both Akka and Cassandra are offered as OSGi services to materialize big-data processing workflows with no pain.
Streaming Transformations - Putting the T in Streaming ETLconfluent
Speaker: Nick Dearden, Director of Engineering, Confluent
We’ll discuss how to leverage some of the more advanced transformation capabilities available in both KSQL and Kafka Connect, including how to chain them together into powerful combinations for handling tasks such as data-masking, restructuring and aggregations. Using KSQL, you can deliver the streaming transformation capability easily and quickly.
This is part 3 of 3 in Streaming ETL - The New Data Integration series.
Watch the recording: https://videos.confluent.io/watch/en56Qt3KAdrpQ4JE5EZNHj?.
At Spark Summit East in New York, we unveil PowerStream, an Internet of Things (IoT) simulation with visualizations and alerts based on real-time data from 2 million sensors across global wind farms.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
help.mbaassignments@gmail.com
or
call us at : 08263069601
Mb0044 production and operation managementsmumbahelp
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
help.mbaassignments@gmail.com
or
call us at : 08263069601
At improve digital we collect and store large volumes of machine generated and behavioural data from our fleet of ad servers. For some time we have performed mostly batch processing through a data warehouse that combines traditional RDBMs (MySQL), columnar stores (Infobright, impala+parquet) and Hadoop.
We wish to share our experiences in enhancing this capability with systems and techniques that process the data as streams in near-realtime. In particular we will cover:
• The architectural need for an approach to data collection and distribution as a first-class capability
• The different needs of the ingest pipeline required by streamed realtime data, the challenges faced in building these pipelines and how they forced us to start thinking about the concept of production-ready data.
• The tools we used, in particular Apache Kafka as the message broker, Apache Samza for stream processing and Apache Avro to allow schema evolution; an essential element to handle data whose formats will change over time.
• The unexpected capabilities enabled by this approach, including the value in using realtime alerting as a strong adjunct to data validation and testing.
• What this has meant for our approach to analytics and how we are moving to online learning and realtime simulation.
This is still a work in progress at Improve Digital with differing levels of production-deployed capability across the topics above. We feel our experiences can help inform others embarking on a similar journey and hopefully allow them to learn from our initiative in this space.
Nowadays Akka is a popular choice for building distributed systems - there are a lot of case studies and successful examples in the industry.
But it still can be hard to switch to actor-based systems, because most of the tutorials and documentation don't show the way to assemble a real application using actors, especially in microservices environment.
Actor is a powerful abstraction in the message-driven environments, but it can be challenging to use familiar patterns and methodologies. At the same time, message-driven nature of actors is the biggest advantage that can be used for Reactive systems and microservices.
I want to share my experience and show how Domain-Driven Design and Enterprise Integration Patterns can be leveraged to design and build fine-grained microservices with synchronous and asynchronous communication. I'll focus on the core Akka functionality, but also explain how advanced features like Akka Persistence and Akka Cluster Sharding can be used together for achieving incredible results.
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
Slides from my talk with Evan Chan at Strata San Jose: NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis. Streaming analytics architecture in big data for fast streaming, ad hoc and batch, with Kafka, Spark Streaming, Akka, Mesos, Cassandra and FiloDB. Simplifying to a unified architecture.
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
"Low latency analytics is becoming a very popular scenario. In this session we will discuss several architectural options for doing
analytics on moving data using Amazon Kinesis and EMR/Spark Streaming and share some best practices and real world examples."
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
4.6.16 AI&BigData Lab
Upcoming events: goo.gl/I2gJ4H
Как устроить анализ данных 40 млн. человек за 5 лет так, чтобы это выглядело почти в реальном времени.
Building a Reactive System with Akka - Workshop @ O'Reilly SAConf NYCKonrad Malawski
Intense 3 hour workshop covering Akka Actors, Cluster, Streams, HTTP and more. Including very advanced patterns.
Presented with Henrik Engstrom at O'Reilly Software Architecture Conference in New York City in 2017
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsSingleStore
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics: Novus, DigitalOcean, Akamai.
Building Predictive Applications with Real-Time Data Pipelines and Streamliner. Eric Frenkiel, CEO and Co-Founder, MemSQL
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
Timely was born to visualize and analyze metric data at a scale untenable for existing solutions. We're returning to talk about what we've achieved over the past year, provide a detailed look into production architecture and discuss additional features added within the past year including alerting and support for external analytics.
– Speakers –
Drew Farris
Chief Technologist, Booz Allen Hamilton
Drew Farris is a software developer and technology consultant at Booz Allen Hamilton where he helps his client solve problems related to large scale analytics, distributed computing and machine learning. He is a member of the Apache Software Foundation and a contributing author to Manning Publications’ “Taming Text” and the Booz Allen Hamilton “Field Guide to Data Science”.
Bill Oley
Senior Lead Engineer, Booz Allen Hamilton
Bill Oley is a senior lead software engineer at Booz Allen Hamilton where he helps his clients analyze and solve problems related to large scale data ingest, storage, retrieval, and analysis. He is particularly interested in improving visibility into large scale systems by making actionable metrics scalable and usable. He has 16 years of experience designing and developing fault-tolerant distributed systems that operate on continuous streams of data. He holds a bachelor's degree in computer science from the United States Naval Academy and a master's degree in computer science from The Johns Hopkins University.
— More Information —
For more information see http://www.accumulosummit.com/
Streaming ETL with Apache Kafka and KSQLNick Dearden
Companies new and old are all recognizing the importance of a low-latency, scalable, fault-tolerant data backbone - in the form of the Apache Kafka streaming platform. With Kafka developers can integrate multiple systems and data sources to enable low-latency analytics, event-driven architectures, and the population of downstream systems. What's more, these data pipelines can be built using configuration alone.
In this talk, we'll see how easy it is to capture a stream of data changes in real-time from a database such as MySQL into Kafka using the Kafka Connect framework and then use KSQL to filter, aggregate and join it to other data, and finally stream the results from Kafka out into multiple targets such as Elasticsearch and MySQL. All of this can be accomplished without a single line of Java code!
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
One of the biggest challenges in data science is to build a continuous data application which delivers results rapidly and reliably. Spark Streaming offers a powerful solution for real-time data processing. However, the challenge remains in how to connect them with various continuous and real-time data sources, guaranteeing the responsiveness and reliability of data applications.
In this talk, Nan and Arijit will summarize their experiences learned from serving the real-time Spark-based data analytic solutions on Azure HDInsight. Their solution seamlessly integrates Spark and Azure EventHubs which is a hyper-scale telemetry ingestion service enabling users to ingress massive amounts of telemetry into the cloud and read the data from multiple applications using publish-subscribe semantics.
They’ll will cover three topics: bridging the gap of data communication model in Spark and data source, accommodating Spark to rate control and message addressing of data source, and the co-design of fault tolerance Mechanisms. This talk will share the insights on how to build continuous data applications with Spark and boost more availabilities of connectors for Spark and different real-time data sources.
Porting a Streaming Pipeline from Scala to RustEvan Chan
How we at Conviva ported a streaming data pipeline in months from Scala to Rust. What are the important human and technical factors in our port, and what did we learn?
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
Typesafe did a survey of Spark usage last year and found that a large percentage of Spark users combine it with Cassandra and Kafka. This talk focuses on streaming data scenarios that demonstrate how these three tools complement each other for building robust, scalable, and flexible data applications. Cassandra provides resilient and scalable storage, with flexible data format and query options. Kafka provides durable, scalable collection of streaming data with message-queue semantics. Spark provides very flexible analytics, everything from classic SQL queries to machine learning and graph algorithms, running in a streaming model based on "mini-batches", offline batch jobs, or interactive queries. We'll consider best practices and areas where improvements are needed.
This paper covers our experience of building real-time pipelines for financial data, the various open source libraries we experimented with and the impacts we saw in a very brief time.
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.
Similar to ReactiveSummeriserAkka-ScalaByBay2016 (20)
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
ReactiveSummeriserAkka-ScalaByBay2016
1. Implement a scalable statistical
aggregation system using Akka
Scala by the Bay, 12 Nov 2016
Stanley Nguyen, Vu Ho
Email Security@Symantec Singapore
2. The system
Provides service to answer time-series analytical questions such as
COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set
of data streams by using statistical approach.
3. Motivation
The system collects data from multiple sources in streaming log
format
Some common questions in Email Anti-Abuse system
Most frequent Items (IP, domain, sender, etc.)
Number of unique items
Have we seen an item before?
=> Need to be able to answer such questions in a timely manner
4. Data statistics
6K email logs/second
One email log is flatten out to subevents
Ip, sender, sender domain, etc
Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week, etc)
Total ~200K messages/second
5. Challenges
Our system needs to be
Responsive
Space efficient
Reactive
Extensible
Scalable
Resilient
6. Sketching data structures
How many times have we seen a certain IP?
Count Min Sketch (CMS): Counting things + TopK
How many unique senders have we seen yesterday?
HyperLogLog (HLL): Set cardinality
Did we see a certain IP last month?
Bloom Filter (BF): Set membership
SPACE / SPEED
7. Implement data structure for
finding cardinality (i.e. counting
things); set membership; top-k
elements – solved by using
streamlib / twitter algebird
Implement a dynamic,
reactive, distributed system
for answering cardinality (i.e.
counting things); set
membership; top-k elements
What we try to solveWhat is available
16. Splitter Hub
Split the stream based on event type to a dynamic set of
downstream consumers.
Consumers are actors which implement CMS, BF, HLL, etc logic.
Not available in akka-stream.
17. Splitter Hub API
Similar to built-in akka stream’s BroadcastHub; different in back-
pressure implementation.
[[SplitterHub]].source can be supplied with a predicate/selector function
to return a filtered subset of data.
selector
19. Splitter Hub
The [[Source]] can be materialized any number of times — each
materialization creates a new consumer which can be registered with the
hub, and then receives items matching the selector function from the
upstream.
Consumer can be added at run time
20. Consumers
Can be either local or remote.
Managed by coordination actor.
Implements a specific data structure (CMS/BF/HLL) for a particular event
type from a specific time-range.
Responsibility:
Answer a specific query.
Persisting serialization of internal data structure such as count-min-table, etc.
regularly. COUNT-QUERY
forward
ref
snapshot
24. Akka stream TCP
Handled by Kernel (back-pressure, reliable).
For each worker, we create a source for each message type it is
responsible for using SplitterHub source() API.
Connect each source to a TCP connection and send to worker.
Backpressure is maintained across network.
~>
~>
26. Master Failover
The Coordinator is the Single Point of Failure.
Run multiple Coordinator Actors as Cluster Singleton .
Worker communicates to master (heartbeat) using Cluster Client.
27. Worker Failover
Worker persists all events to DB journal + snapshot.
Akka Persistent.
Redis for storing Journal + Snapshot.
When a worker is down, its keys are re-distributed.
Master then redirects traffic to other workers.
CMS Actors are restored on new worker from Snapshot + Journal.
28. Benchmark
Akka-stream on single node 100K+ msg/second (one msg-type)
Akka-stream on remote node
(remote TCP)
15-20K msg/second (one msg-type)
Akka-stream on remote node
(remote TCP) with akka persistent
journal
2000+ msg/second (one msg-type)
29. Conclusion
Our system is
Responsive
Reactive
Scalable
Resilient
Future works:
Make worker metric agnostics
Scale out master
Exactly one delivery for worker
More flexible filter using SplitterHub
Thanks for coming. Introduction.
Working Symantec Email security team
protecting our customers against all types of email abuse.
Present abt system
What does our system do?
a service that answers analytical questions such as COUNT, TOPK, SET MEMBERSHIP, CARDINALITY for a fixed time interval such as last 5 minutes or last hour.
Use statistical approach.
“statistical” mean ~> instead of storing everything in a database and run query on it, we only store a statistical representation ~-> approximate answer.
streaming data instead of offline data. rate of thousands and milliseconds latency.
prevent email abuse.
have a system that collects email logs; log ~> sender, IP, domains and other metadata.
All data streamed in and the volume can be quite large so we need to be able to digest and process them quickly.
Upon collecting the data ~> extract the useful information from it in order to identify threat. common questions faces
Most frequent domain, sender within last 5 minutes/1hour/1 day
Number of unique domains
Have we seen this IP last 24 hours
detect and stop abuse as soon as possible ~> answer such question in a timely and automated manner.
Handle abt 6k email logs.
Each email log -> complicated obj
Keep track individual field ->
~ aggergate per fixed time interval
Total 200k
Challenges
Responsive => rate thousand of queries per second and with millisecond latency.
Secondly, Terabytes of data ~> space efficient
In addition, responsive even under sudden increased load in traffic
extensible because dynamically add support for new data stream
Scalable multiple nodes and resilient in the case of failure.
responsive and space efficient ~> sketching data structure.
“summary of a dataset”
storing the whole dataset vs a statistical representation.
No exact answers but only approximations. But can tweak the parameters.
common sketching are Count-Min-Sketch, BloomFilter or HyperLogLog.
streamlib or twitter algebird available
But they don’t enable an easy way
to implement a dynamic, reactive, distributed system out of the box..
Before diving in detail, how to use sketching data strutures.
streamlib library for demo.
A single stream of IP address + how many times:
Initializing with some parameters.
each element ~> update the count-min instance.
Get the answer.
Summary
sketching data structure => responsive and space efficient because a small memory footprint.
?reactive and extensible.
uses actor.
1. Single IP stream ~> one actor is enough. CMS actor.
2. Support several different types of messages for multiple time ranges ~> several consumer actors.
3. producers ~> several producers
4. Send data from producer -> consumer => master actor (is receiving message and pass it to a set of downstream actors)
Challenges ~> Flow control.
Reacive = Akka stream -> reactive standard.
GraphDSL to represent = graph.
define a Source Node -> send, a Sink node receive + intermediate nodes which we call “FLOW”.
graphDSL = a powerful abstraction to model how to pass data from Source to a number of computation stages.
If GraphDSL, how model?
different data streams + a complicated object ~> transform to simpler for Sketching. Format = (key, value).
merge all streams ~> a big data stream.
Split ~> set of downstream consumers <- aggregation.
merged to Sink.
limitation of GraphDSL.
Connections, # nodes in graph => known/specified upfront.
Eg. broadcast, #consumers. Merge, #producer fixed and cannot be changed at run time.
We don’t know #consumers + producers => A ways to allow consumer or producer to connect / disconnect.
our design, instead of using dynamic stream? Ask audience?
More specifically, use MergeHub + SplitterHub.
MergeHub = part of dynamic stream; supports by akka recently.
It addresses the limitation of graphDSL.
mergeHub and splitter hub, no need to specify how many + consumer or producer join at run time.
different type of producers = kafka or TCP.
TCP, multiple streams into via multiple TCP connections.
several consumers which connect and get data from the SplitterHub.
MergeHub provided by akka stream. don’t have to implement it by your self. Using it is simple.
Materialize to a Sink + stream your data to the Sink.
Example -> merge different data streams from multiple incoming TCP connection.
Merging => then need to split the data into a set of downstream consumers based on event type.
One consumer should receive only a specified subset of messages.
responsiblely of SplitterHub is to allow you to specify which subset of data you should send to a particular consumer.
SplitterHub is not available in akka-stream so we will have to implement it by ourself.
a custom SplitterHub.
Selector function as sort of filter for a single stream.
Implementation ~ BroadcastHub; few differences:
Require selector function.
Secondly, instead of using broadcast, split based on event type => back-pressure differently.
dynamic set of consumers which read data from a fixed size buffer.
When upstream available, => push buffer.
new consumer joins, it will start from a particular offset + move forward.
Demand => check current offset.
Doesn’t match, it will jump to the next offset.
Match => push to the input port of the consumer.
How to use?
Key thing: specify a selector function.
materialize the source as many times as you want.
Each materialization -> create + register => the consumer will receive items from the upstream which match the selector function.
can add or remote new consumer at run time.
Already explain abt producer, merge + split
how the computation is handled within “consumer actor”?
do the actual computation + responsible to answer a specific query: Ex….
Managed by coordination actor ~> keeps the reference to Splitter Hub
Query => coordination actor => forwarded to the corresponding actor.
In summary, we use sketching data structure to make our system both responsible and space efficient.
We use Akka stream to make our system reactive.
And we use dynamic stream to make sure we can add support for a new type of data at run time without the need of shutting down the system to do new deployment.
My college will explain more about how to make the system scalable and rezilient.
Problem is solved on a single machine. Sending to remote system poses several challenges:
How to maintain back pressure between 2 remote entities
How to maintain back pressure between multiple sources to multiple consumers
The naive way i.e. Actor to Actor => requires ACK, PULL, etc => Not easy to implement, error prone
Using ActorProducer/ActorSubscriber => not recommended since “request” message can be lost
We have seen that the SplitterHub allows us to add new consumer dynamically, this is where it really shines cos we can split traffic to new workers as they join
Coordinator actor on Master node subscribe to cluster event and get notified when a member of the cluster is Up
The new worker connect to the cluster and register itself to the coordinator Actor
SplitterHub creates a new source and filter the traffic through it using source(T => Boolean)
The new source flows data into the TCP connection which in turn forward all to the remote worker
When one worker goes down, the Coordinator Actor updates the selector function to allow data flow to a different worker.
After receiving the data, worker pipe the data to its CMS Actor using mapAsync, this is one way to connect an actor to a stream. Back pressure is handled by the actor mailbox + an ACK future
Finally CMS Actor is a persistent actor. In case of failure a new CMS Actor can be started from any node and recovers its state from the persistent database.
Since the Active master can run on any node, Cluster Client is used to communicate with the master => Actor Location transparency. Cluster Client automatically reconnect to a new master when failover happens
Master don’t subscribe to Cluster Event to know if a node is up, instead it listens for registration message from worker.