Justin and I gave this talk in QCon SF 2014 about the Mantis, a stream processing system that features a reactive programming API, auto scaling, and stream locality
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yyaHb8.
The authors discuss Netflix's new stream processing system that supports a reactive programming model, allows auto scaling, and is capable of processing millions of messages per second. Filmed at qconsf.com.
Danny Yuan is an architect and software developer in Netflix’s Platform Engineering team. Justin Becker is Senior Software Engineer at Netflix.
Distributed real time stream processing- why and howPetr Zapletal
In this talk you will discover various state-of-the-art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs, their intended use-cases, and how to choose between them. Petr will focus on the popular frameworks, including Spark Streaming, Storm, Samza and Flink. You will also explore theoretical introduction, common pitfalls, popular architectures, and much more.
The demand for stream processing is increasing. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications, include trading, social networks, the Internet of Things, and system monitoring, are becoming more and more important. A number of powerful, easy-to-use open source platforms have emerged to address this.
Petr's goal is to provide a comprehensive overview of modern streaming solutions and to help fellow developers with picking the best possible solution for their particular use-case. Join this talk if you are thinking about, implementing, or have already deployed a streaming solution.
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yyaHb8.
The authors discuss Netflix's new stream processing system that supports a reactive programming model, allows auto scaling, and is capable of processing millions of messages per second. Filmed at qconsf.com.
Danny Yuan is an architect and software developer in Netflix’s Platform Engineering team. Justin Becker is Senior Software Engineer at Netflix.
Distributed real time stream processing- why and howPetr Zapletal
In this talk you will discover various state-of-the-art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs, their intended use-cases, and how to choose between them. Petr will focus on the popular frameworks, including Spark Streaming, Storm, Samza and Flink. You will also explore theoretical introduction, common pitfalls, popular architectures, and much more.
The demand for stream processing is increasing. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications, include trading, social networks, the Internet of Things, and system monitoring, are becoming more and more important. A number of powerful, easy-to-use open source platforms have emerged to address this.
Petr's goal is to provide a comprehensive overview of modern streaming solutions and to help fellow developers with picking the best possible solution for their particular use-case. Join this talk if you are thinking about, implementing, or have already deployed a streaming solution.
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
Apache Beam (incubating) is a unified batch and streaming data processing programming model that is efficient and portable. Beam evolved from a decade of system-building at Google, and Beam pipelines run today on both open source (Apache Flink, Apache Spark) and proprietary (Google Cloud Dataflow) runners. This talk will focus on I/O and connectors in Apache Beam, specifically its APIs for efficient, parallel, adaptive I/O. Google will discuss how these APIs enable a Beam data processing pipeline runner to dynamically rebalance work at runtime, to work around stragglers, and to automatically scale up and down cluster size as a job’s workload changes. Together these APIs and techniques enable Apache Beam runners to efficiently use computing resources without compromising on performance or correctness. Practical examples and a demonstration of Beam will be included.
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
There have been plenty of “explaining EXPLAIN” type talks over the years, which provide a great introduction to it. They often also cover how to identify a few of the more common issues through it. EXPLAIN is a deep topic though, and to do a good introduction talk, you have to skip over a lot of the tricky bits. As such, this talk will not be a good introduction to EXPLAIN, but instead a deeper dive into some of the things most don’t cover. The idea is to start with some of the more complex and unintuitive calculations needed to work out the relationships between operations, rows, threads, loops, timings, buffers, CTEs and subplans. Most popular tools handle at least several of these well, but there are cases where they don’t that are worth being conscious of and alert to. For example, we’ll have a look at whether certain numbers are averaged per-loop or per-thread, or both. We’ll also cover a resulting rounding issue or two to be on the lookout for. Finally, some per-operation timing quirks are worth looking out for where CTEs and subqueries are concerned, for example CTEs that are referenced more than once. As time allows, we can also look at a few rarer issues that can be spotted via EXPLAIN, as well as a few more gotchas that we’ve picked up along the way. This includes things like spotting when the query is JIT, planning, or trigger time dominated, spotting the signs of table and index bloat, issues like lossy bitmap scans or index-only scans fetching from the heap, as well as some things to be aware of when using auto_explain.
Apache Ignite is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time. But, did you know it provides streaming and complex event processing (CEP)? In this hands-on demonstration we will take Apache Ignite’s Streaming and CEP features for a test drive. We will start with an example streaming use case then demonstrate how to implement each component in Apache Ignite. Finally we will show how to connect a dashboard application to Apache Ignite to display the results.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
Apache Beam (incubating) is a unified batch and streaming data processing programming model that is efficient and portable. Beam evolved from a decade of system-building at Google, and Beam pipelines run today on both open source (Apache Flink, Apache Spark) and proprietary (Google Cloud Dataflow) runners. This talk will focus on I/O and connectors in Apache Beam, specifically its APIs for efficient, parallel, adaptive I/O. Google will discuss how these APIs enable a Beam data processing pipeline runner to dynamically rebalance work at runtime, to work around stragglers, and to automatically scale up and down cluster size as a job’s workload changes. Together these APIs and techniques enable Apache Beam runners to efficiently use computing resources without compromising on performance or correctness. Practical examples and a demonstration of Beam will be included.
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
There have been plenty of “explaining EXPLAIN” type talks over the years, which provide a great introduction to it. They often also cover how to identify a few of the more common issues through it. EXPLAIN is a deep topic though, and to do a good introduction talk, you have to skip over a lot of the tricky bits. As such, this talk will not be a good introduction to EXPLAIN, but instead a deeper dive into some of the things most don’t cover. The idea is to start with some of the more complex and unintuitive calculations needed to work out the relationships between operations, rows, threads, loops, timings, buffers, CTEs and subplans. Most popular tools handle at least several of these well, but there are cases where they don’t that are worth being conscious of and alert to. For example, we’ll have a look at whether certain numbers are averaged per-loop or per-thread, or both. We’ll also cover a resulting rounding issue or two to be on the lookout for. Finally, some per-operation timing quirks are worth looking out for where CTEs and subqueries are concerned, for example CTEs that are referenced more than once. As time allows, we can also look at a few rarer issues that can be spotted via EXPLAIN, as well as a few more gotchas that we’ve picked up along the way. This includes things like spotting when the query is JIT, planning, or trigger time dominated, spotting the signs of table and index bloat, issues like lossy bitmap scans or index-only scans fetching from the heap, as well as some things to be aware of when using auto_explain.
Apache Ignite is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time. But, did you know it provides streaming and complex event processing (CEP)? In this hands-on demonstration we will take Apache Ignite’s Streaming and CEP features for a test drive. We will start with an example streaming use case then demonstrate how to implement each component in Apache Ignite. Finally we will show how to connect a dashboard application to Apache Ignite to display the results.
This presentation describes a intelligent IT monitoring solution that uses Nagios as source of information, Esper as the CEP engine and a PCA algorithm.
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
At improve digital we collect and store large volumes of machine generated and behavioural data from our fleet of ad servers. For some time we have performed mostly batch processing through a data warehouse that combines traditional RDBMs (MySQL), columnar stores (Infobright, impala+parquet) and Hadoop.
We wish to share our experiences in enhancing this capability with systems and techniques that process the data as streams in near-realtime. In particular we will cover:
• The architectural need for an approach to data collection and distribution as a first-class capability
• The different needs of the ingest pipeline required by streamed realtime data, the challenges faced in building these pipelines and how they forced us to start thinking about the concept of production-ready data.
• The tools we used, in particular Apache Kafka as the message broker, Apache Samza for stream processing and Apache Avro to allow schema evolution; an essential element to handle data whose formats will change over time.
• The unexpected capabilities enabled by this approach, including the value in using realtime alerting as a strong adjunct to data validation and testing.
• What this has meant for our approach to analytics and how we are moving to online learning and realtime simulation.
This is still a work in progress at Improve Digital with differing levels of production-deployed capability across the topics above. We feel our experiences can help inform others embarking on a similar journey and hopefully allow them to learn from our initiative in this space.
Measure anything, measure everything.
Effortless monitoring with Statsd, Collectd and Graphite can increase software development productivity and quality at the same time.
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
One of the biggest challenges in data science is to build a continuous data application which delivers results rapidly and reliably. Spark Streaming offers a powerful solution for real-time data processing. However, the challenge remains in how to connect them with various continuous and real-time data sources, guaranteeing the responsiveness and reliability of data applications.
In this talk, Nan and Arijit will summarize their experiences learned from serving the real-time Spark-based data analytic solutions on Azure HDInsight. Their solution seamlessly integrates Spark and Azure EventHubs which is a hyper-scale telemetry ingestion service enabling users to ingress massive amounts of telemetry into the cloud and read the data from multiple applications using publish-subscribe semantics.
They’ll will cover three topics: bridging the gap of data communication model in Spark and data source, accommodating Spark to rate control and message addressing of data source, and the co-design of fault tolerance Mechanisms. This talk will share the insights on how to build continuous data applications with Spark and boost more availabilities of connectors for Spark and different real-time data sources.
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Brian Brazil
Counters are one of the two core metric types in Prometheus, allowing for tracking of request rates, error ratios and other key measurements. Learn why are they designed the way they are, how client libraries implement them and how rate() works.
If you'd like more information about Prometheus, contact us at prometheus@robustperception.io
Join this video course on Udemy. Click the below link
https://www.udemy.com/mastering-rtos-hands-on-with-freertos-arduino-and-stm32fx/?couponCode=SLIDESHARE
>> The Complete FreeRTOS Course with Programming and Debugging <<
"The Biggest objective of this course is to demystifying RTOS practically using FreeRTOS and STM32 MCUs"
STEP-by-STEP guide to port/run FreeRTOS using development setup which includes,
1) Eclipse + STM32F4xx + FreeRTOS + SEGGER SystemView
2) FreeRTOS+Simulator (For windows)
Demystifying the complete Architecture (ARM Cortex M) related code of FreeRTOS which will massively help you to put this kernel on any target hardware of your choice.
Just like you can't defeat the laws of physics there are natural laws that ultimately decide software performance. Even the latest technology beta is still bound by Newton's laws, and you can't change the speed of light, even in the cloud!
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianDataStax Academy
This session will focus on our approach to building a scalable TimeSeries database for financial data using Cassandra 1.2 and CQL3. We will discuss how we deal with a heavy mix of reads and writes as well as how we monitor and track performance of the system.
Similar to QConSF 2014 talk on Netflix Mantis, a stream processing system (20)
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
51. One problem we need to solve,
detect movies that are failing?
52. Do it fast → limit impact, fix early
Do it at scale → for all permutations
Do it cheap → cost detect <<< serve
53. Work through the details for how
to solve this problem in Mantis
54. Goal is to highlight unique and
interesting design features
55. … begin with batch approach,
the non-Mantis approach
Batch algorithm, runs every N minutes
for(play in playAttempts()){
Stats movieStats = getStats(play.movieId);
updateStats(movieStats, play);
if (movieStats.failRatio > THRESHOLD){
alert(movieId, failRatio, timestamp);
}
}
56. First problem, each run requires
reads + writes to data store per run
Batch algorithm, runs every N minutes
for(play in playAttempts()){
Stats movieStats = getStats(play.movieId);
updateStats(movieStats, play);
if (movieStats.failRatio > THRESHOLD){
alert(movieId, failRatio, timestamp);
}
}
57. For Mantis don’t want to pay that
cost: for latency or storage
Batch algorithm, runs every N minutes
for(play in playAttempts()){
Stats movieStats = getStats(play.movieId);
updateStats(movieStats, play);
if (movieStats.failRatio > THRESHOLD){
alert(movieId, failRatio, timestamp);
}
}
58. Next problem, “pull” model great for
batch processing, bit awkward for
stream processing
Batch algorithm, runs every N minutes
for(play in playAttempts()){
Stats movieStats = getStats(play.movieId);
updateStats(movieStats, play);
if (movieStats.failRatio > THRESHOLD){
alert(movieId, failRatio, timestamp);
}
}
59. By definition, batch processing
requires batches. How do I chunk
my data? Or, how often do I run?
Batch algorithm, runs every N minutes
for(play in playAttempts()){
Stats movieStats = getStats(play.movieId);
updateStats(movieStats, play);
if (movieStats.failRatio > THRESHOLD){
alert(movieId, failRatio, timestamp);
}
}
60. For Mantis, prefer “push” model,
natural approach to data-in-motion
processing
Batch algorithm, runs every N minutes
for(play in playAttempts()){
Stats movieStats = getStats(play.movieId);
updateStats(movieStats, play);
if (movieStats.failRatio > THRESHOLD){
alert(movieId, failRatio, timestamp);
}
}
61. For our “push” API we decided
to use Reactive Extensions (Rx)
62. Two reasons for choosing Rx:
theoretical, practical
1. Observable is a natural abstraction for
stream processing, Observable = stream
2. Rx already leveraged throughout the
company
63. So, what is an Observable?
A sequence of events, aka a stream
Observable<String> o =
Observable.just(“hello”,
“qcon”, “SF”);
o.subscribe(x->{println x;})
64. What can I do with an Observable?
Apply operators → New observable
Subscribe → Observer of data
Operators, familiar lambda functions
map(), flatMap(), scan(), ...
65. What is the connection with Mantis?
In Mantis, a job (code-to-run) is the
collection of operators applied to a
sourced observable where the
output is sinked to observers
66. Think of a “source” observable as
the input to a job.
Think of a “sink” observer as the
output of the job.
67. Let’s refactor previous problem
using Mantis API terminology
Source: Play attempts
Operators: Detection logic
Sink: Alerting service
68. Sounds OK, but how will this scale?
For pull model luxury of requesting
data at specified rates/time
Analogous to drinking
from a straw
69. In contrast, push is the firehose
No explicit control to limit the
flow of data
70. In Mantis, we solve this problem
by scaling horizontally
71. Horizontal scale is accomplished
by arranging operators into
logical “stages”, explicitly by job
writer or implicitly with fancy
tooling (future work)
72. A stage is a boundary between
computations. The boundary
may be a network boundary,
process boundary, etc.
73. So, to scale, Mantis job is really,
Source → input observable
Stage(s) → operators
Sink → output observer
74. Let’s refactor previous problem to
follow the Mantis job API structure
MantisJob
.source(Netflix.PlayAttempts())
.stage({ // detection logic })
.sink(Alerting.email())
75. We need to provide a computation
boundary to scale horizontally
MantisJob
.source(Netflix.PlayAttempts())
.stage({ // detection logic })
.sink(Alerting.email())
76. For our problem, scale is a function
of the number of movies tracking
MantisJob
.source(Netflix.PlayAttempts())
.stage({ // detection logic })
.sink(Alerting.email())
77. Lets create two stages, one producing
groups of movies, other to run detection
MantisJob
.source(Netflix.PlayAttempts())
.stage({ // groupBy movieId })
.stage({ // detection logic })
.sink(Alerting.email())
78. OK, computation logic is split, how is the
code scheduled to resources for execution?
MantisJob
.source(Netflix.PlayAttempts())
.stage({ // groupBy movieId })
.stage({ // detection logic })
.sink(Alerting.email())
79. One, you tell the Mantis Scheduler explicitly
at submit time: number of instances, CPU
cores, memory, disk, network, per instance
80. Two, Mantis Scheduler learns how to
schedule job (work in progress)
Looks at previous run history
Looks at history for source input
Over/under provision, auto adjust
82. Computation is split, code is scheduled,
how is data transmitted over stage
boundary?
Worker Worker ?
83. Depends on the service level agreement
(SLA) for the Job, transport is pluggable
Worker Worker ?
84. Decision is usually a trade-off
between latency and fault tolerance
Worker Worker ?
85. A “weak” SLA job might trade-off fault
tolerance for speed, using TCP as the
transport
Worker Worker ?
86. A “strong” SLA job might trade-off speed
for fault tolerance, using a queue/broker as
a transport
Worker Worker ?
87. Forks and joins require data partitioning,
collecting over boundaries
Worker
Worker
Worker
Fork Worker
Worker
Worker
Join
Need to partition data Need to collect data
88. Mantis has native support for partitioning,
collecting over scalars (T) and groups (K,V)
Worker
Worker
Worker
Fork Worker
Worker
Worker
Join
Need to partition data Need to collect data
89. Let’s refactor job to include SLA, for the
detection use case we prefer low latency
MantisJob
.source(Netflix.PlayAttempts())
.stage({ // groupBy movieId })
.stage({ // detection logic })
.sink(Alerting.email())
.config(SLA.weak())
90. The job is scheduled and running what
happens when the input-data’s volume
changes?
Previous scheduling decision may not hold
Prefer not to over provision, goal is for cost
insights <<< product
91. Good news, Mantis Scheduler has the
ability to grow and shrink (autoscale) the
cluster and jobs
92. The cluster can scale up/down for two
reasons: more/less job (demand) or jobs
themselves are growing/shrinking
For cluster we can use submit pending
queue depth as a proxy for demand
For jobs we use backpressure as a proxy to
grow shrink the job
93. Backpressure is “build up” in a system
Imagine we have a two stage Mantis job,
Second stage is performing a complex
calculation, causing data to queue up in the
previous stage
We can use this signal to increase nodes at
second stage
94. Having touched on key points in Mantis
architecture, want to show a complete job
definition
99. Simple detection algorithms
Windows for 10 minutes
or 1000 play events
Reduce the window to
an experiment, update
counts
Filter out if less than
threshold