ACM DEBS 2015: Realtime
Streaming Analytics
Srinath Perera
Sriskandarajah Suhothayan
WSO2 Inc.
Data Analytics ( Big Data)
o Scientists are doing this for
25 year with MPI (1991)
using special Hardware
o Took off with Google’s
MapReduce paper (2004),
Apache Hadoop, Hive and
whole ecosystem created.
o Later Spark emerged, and it is
o But, processing takes time.
Value of Some Insights degrade
o For some usecases ( e.g. stock
markets, traffic, surveillance,
patient monitoring) the value
of insights degrade very
quickly with time.
o E.g. stock markets and speed of
oo We need technology that can produce outputs fast
o Static Queries, but need very fast output (Alerts, Realtime
o Dynamic and Interactive Queries ( Data exploration)
▪Realtime Analytics are not new
- Active Databases (2000+)
- Stream processing (Aurora, Borealis
(2005+) and later Storm)
- Distributed Streaming Operators (e.
g. Database research topic around
- CEP Vendor Roadmap ( from http:
Data Analytics Landscape
Realtime Interactive Analytics
o Usually done to support
interactive queries
o Index data to make them
them readily accessible so
you can respond to queries
fast. (e.g. Apache Drill)
o Tools like Druid, VoltDB and
SAP Hana can do this with all
data in memory to make
things really fast.
Realtime Streaming Analytics
o Process data without Streaming ( As data some in)
o Queries are fixed ( Static)
o Triggers when given conditions are met.
o Technologies
o Stream Processing ( Apache Storm, Apache Samza)
o Complex Event Processing/CEP (WSO2 CEP, Esper,
o MicroBatches ( Spark Streaming)
Realtime Football Analytics
● Video:
● More Info:
Why Realtime Streaming Analytics
o Reason 1: Usual advantages
o Give us better understanding
o Give us better vocabulary to teach and
o Tools can implement them
o ..
o Reason 2: Under theme realtime analytics, lot of
people get too much carried away with word count
example. Patterns shows word count is just tip of
the iceberg.
Earlier Work on Patterns
o Patterns from SQL ( project, join, filter etc)
o Event Processing Technical Society’s (EPTS)
reference architecture
o higher-level patterns such as tracking, prediction and
learning in addition to low-level operators that
comes from SQL like languages.
o Esper’s Solution Patterns Document (50 patterns)
o Coral8 White Paper
Basic Patterns
o Pattern 1: Preprocessing ( filter, transform, enrich,
project .. )
o Pattern 2: Alerts and Thresholds
o Pattern 3: Simple Counting and Counting with
o Pattern 4: Joining Event Streams
o Pattern 5: Data Correlation, Missing Events, and
Erroneous Data
Patterns for Handling Trends
o Pattern 7: Detecting Temporal Event Sequence
o Pattern 8: Tracking ( track something over space or
o Pattern 9: Detecting Trends ( rise, fall, turn, tipple
o Pattern 13: Online Control
Mixed Patterns
o Pattern 6: Interacting with Databases
o Pattern 10: Running the same Query in Batch and
Realtime Pipelines
o Pattern 11: Detecting and switching to Detailed
o Pattern 12: Using a Machine Learning Model
Earlier Work on Patterns
Realtime Streaming
Analytics Tools
Implementing Realtime Analytics
o tempting to write a custom code. Filter look very
easy. Too complex!! Don’t!
o Option 1: Stream Processing (e.g. Storm). Kind of
works. It is like Map Reduce, you have to write code.
o Option 2: Spark Streaming - more compact than
Storm, but cannot do some stateful operations.
o Option 3: Complex Event Processing - compact, SQL
like language, fast
Stream Processing
o Program a set of processors and wire them up, data
flows though the graph.
o A middleware framework handles data flow,
distribution, and fault tolerance (e.g. Apache Storm,
o Processors may be in the same machine or multiple
Writing a Storm Program
o Write Spout(s)
o Write Bolt(s)
o Wire them up
o Run
Write Bolts
We will use a shorthand
like on the left to explain
public static class WordCount extends BaseBasicBolt {
public void execute(Tuple tuple, BasicOutputCollector
collector) {
.. do something …
collector.emit(new Values(word, count));
public void declareOutputFields(OutputFieldsDeclarer
declarer) {
declarer.declare(new Fields("word", "count"));
Wire up and Run
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8)
builder.setBolt("count", new WordCount(), 12)
.fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
if (args != null && args.length > 0) {
args[0], conf, builder.createTopology());
}else {
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf,
Complex Event Processing
Micro Batches ( e.g. Spark
o Process data in small batches,
and then combine results for
final results (e.g. Spark)
o Works for simple aggregates,
but tricky to do this for complex
operations (e.g. Event
o Can do it with MapReduce as
well if the deadlines are not too
o A SQL like data processing
languages (e.g. Apache Hive)
o Since many understand SQL,
Hive made large scale data
processing Big Data accessible
to many
o Expressive, short, and sweet.
o Define core operations that
covers 90% of problems
o Let experts dig in when they
SQL Like Query Languages
o Easy to follow from SQL
o Expressive, short, and sweet.
o Define core operations that covers 90% of problems
o Let experts dig in when they like!
CEP = SQL for Realtime
Code and other details
o Sample code - https://github.
o pack http://svn.wso2.
o docs- https://docs.wso2.
o Apache Storm -
o We have packs in a pendrive
Pattern 1: Preprocessing
o What? Cleanup and prepare data via operations like
filter, project, enrich, split, and transformations
o Usecases?
o From twitter data stream: we extract author,
timestamp and location fields and then filter
them based on the location of the author.
o From temperature stream we expect
temperature & room number of the sensor and
filter by them.
from TempStream [ roomNo > 245 and roomNo <= 365]
select roomNo, temp
insert into ServerRoomTempStream ;
In Storm
In CEP ( Siddhi)
Architecture of WSO2 CEP
CEP Event Adapters
Support for several transports (network access)
● Thrift
● Kafka
● Websocket
Supports database writes using Map messages
● Cassandra
Supports custom event adaptors via its pluggable architecture!
Stream Definition (Data Model)
'name':'', 'version':'1.0.0',
'nickName': 'Soft_Drink_Sales', 'description': 'Soft drink sales',
define stream TempStream
(deviceID long, roomNo int, temp double);
from TempStream
select roomNo, temp
insert into OutputStream ;
Inferred Streams
from TempStream
select roomNo, temp
insert into OutputStream ;
define stream OutputStream
(roomNo int, temp double);
from TempStream
select roomNo, temp,‘C’ as scale
insert into OutputStream
define stream OutputStream
(roomNo int, temp double, scale string);
from TempStream
select deviceID, roomNo, avg(temp) as avgTemp
insert into OutputStream ;
from TempStream
select concat(deviceID, ‘-’, roomNo) as uid,
toFahrenheit(temp) as tempInF,
‘F’ as scale
insert into OutputStream ;
from TempStream
select roomNo, temp
insert into RoomTempStream ;
from TempStream
select deviceID, temp
insert into DeviceTempStream ;
Pattern 2: Alerts and Thresholds
o What? detects a condition and generates alerts
based on a condition. (e.g. Alarm on high
o These alerts can be based on a simple value or
more complex conditions such as rate of increase
o Usecases?
o Raise alert when vehicle going too fast
o Alert when a room is too hot
Filter Alert
from TempStream [ roomNo > 245 and roomNo <= 365
and temp > 40 ]
select roomNo, temp
insert into AlertServerRoomTempStream ;
Pattern 3: Simple Counting and
Counting with Windows
o What? aggregate functions like Min, Max,
Percentiles, etc
o Often they can be counted without storing any
o Most useful when used with a window
o Usecases?
o Most metrics need a time bound so we can
compare ( errors per day, transactions per
o Linux Load Average give us an idea of overall
trend by reporting last 1m, 3m, and 5m mean.
Types of windows
o Sliding windows vs. Batch (tumbling) windows
o Time vs. Length windows
Also supports
o Unique window
o First unique window
o External time window
In Storm
In CEP (Siddhi)
from TempStream
select roomNo, avg(temp) as avgTemp
insert into HotRoomsStream ;
Sliding Time Window
from TempStream#window.time(1 min)
select roomNo, avg(temp) as avgTemp
insert all events into AvgRoomTempStream ;
Group By
from TempStream#window.time(1 min)
select roomNo, avg(temp) as avgTemp
group by roomNo
insert all events into HotRoomsStream ;
Batch Time Window
from TempStream#window.timeBatch(5 min)
select roomNo, avg(temp) as avgTemp
group by roomNo
insert all events into HotRoomsStream ;
Pattern 4: Joining Event Streams
o What? Create a new event stream by joining
multiple streams
o Complication comes with time. So need at least
one window
o Often used with a window
o Usecases?
o To detecting when a player has kicked the ball in
a football game .
o To correlate TempStream and the state of the
regulator and trigger control commands
Join with Storm
define stream TempStream
(deviceID long, roomNo int, temp double);
define stream RegulatorStream
(deviceID long, roomNo int, isOn bool);
In CEP (Siddhi)
define stream TempStream
(deviceID long, roomNo int, temp double);
define stream RegulatorStream
(deviceID long, roomNo int, isOn bool);
from TempStream[temp > 30.0]#window.time(1 min) as T
join RegulatorStream[isOn == false]#window.length(1) as R
on T.roomNo == R.roomNo
select T.roomNo, R.deviceID, ‘start’ as action
insert into RegulatorActionStream ;
In CEP (Siddhi)
Pattern 5: Data Correlation, Missing
Events, and Erroneous Data
o What? find correlations and use that to detect and
handle missing and erroneous Data
o Use Cases?
o Detecting a missing event (e.g., Detect a
customer request that has not been responded
within 1 hour of its reception)
o Detecting erroneous data (e.g., Detecting failed
sensors using a set of sensors that monitor
overlapping regions. We can use those
redundant data to find erroneous sensors and
remove those data from further processing)
Missing Event in Storm
Missing Event in CEP
In CEP (Siddhi)
from RequestStream#window.time(1h)
insert expired events into ExpiryStream
from r1=RequestStream->r2=Response[] or
select as id ...
insert into AlertStream having having == null;
Pattern 6: Interacting with Databases
o What? Combine realtime data against historical
o Use Cases?
o On a transaction, looking up the customer age
using ID from customer database to detect fraud
o Checking a transaction against blacklists and
whitelists in the database
o Receive an input from the user (e.g., Daily
discount amount may be updated in the
database, and then the query will pick it
automatically without human intervention).
In Storm
Querying Databases
In CEP (Siddhi)
Event Table
define table CardUserTable (name string, cardNum long) ;
@from(eventtable = 'rdbms' , = ‘CardDataSource’ , = ‘UserTable’, caching.algorithm’=‘LRU’)
define table CardUserTable (name string, cardNum long)
Cache types supported
● Basic: A size-based algorithm based on FIFO.
● LRU (Least Recently Used): The least recently used event is dropped
when cache is full.
● LFU (Least Frequently Used): The least frequently used event is dropped
when cache is full.
Join : Event Table
define stream Purchase (price double, cardNo long, place string);
define table CardUserTable (name string, cardNum long) ;
from Purchase#window.length(1) join CardUserTable
on Purchase.cardNo == CardUserTable.cardNum
select Purchase.cardNo as cardNo, as name,
Purchase.price as price
insert into PurchaseUserStream ;
Insert : Event Table
define stream FraudStream (price double, cardNo long, userName
define table BlacklistedUserTable (name string, cardNum long) ;
from FraudStream
select userName as name, cardNo as cardNum
insert into BlacklistedUserTable ;
Update : Event Table
define stream LoginStream (userID string,
islogin bool, loginTime long);
define table LastLoginTable (userID string, time long) ;
from LoginStream
select userID, loginTime as time
update LastLoginTable
on LoginStream.userID == LastLoginTable.userID ;
Pattern 7: Detecting Temporal
Event Sequence Patterns
o What? detect a temporal sequence of events or
condition arranged in time
o Use Cases?
o Detect suspicious activities like small transaction
immediately followed by a large transaction
o Detect ball possession in a football game
o Detect suspicious financial patterns like large buy
and sell behaviour within a small time period
In Storm
In CEP (Siddhi)
define stream Purchase (price double, cardNo long,place string);
from every (a1 = Purchase[price < 100] -> a3= ..) ->
a2 = Purchase[price >10000 and a1.cardNo == a2.cardNo]
within 1 day
select a1.cardNo as cardNo, a2.price as price, as place
insert into PotentialFraud ;
Pattern 8: Tracking
o What? detecting an overall trend over time
o Use Cases?
o Tracking a fleet of vehicles, making sure that
they adhere to speed limits, routes, and Geo-
o Tracking wildlife, making sure they are alive (they
will not move if they are dead) and making sure
they will not go out of the reservation.
o Tracking airline luggage and making sure they
have not been sent to wrong destinations
o Tracking a logistic network and figuring out
bottlenecks and unexpected conditions.
TFL: Traffic Analytics
Built using TFL ( Transport for London) open data feeds.
Pattern 9: Detecting Trends
o What? tracking something over space and time and
detects given conditions.
o Useful in stock markets, SLA enforcement, auto
scaling, predictive maintenance
o Use Cases?
o Rise, Fall of values and Turn (switch from rise to
a fall)
o Outliers - deviate from the current trend by a
large value
o Complex trends like “Triple Bottom” and “Cup
and Handle” [17].
Trend in Storm
Build and apply an state machine
In CEP (Siddhi)
from t1=TempStream,
t2=TempStream [(isNull(t2[last].temp) and t1.temp<temp) or
(t2[last].temp < temp and not(isNull(t2[last].temp))]+
within 5 min
select t1.temp as initialTemp,
t2[last].temp as finalTemp,
insert into IncreaingHotRoomsStream ;
In CEP (Siddhi)
partition by (roomNo of TempStream)
from t1=TempStream,
t2=TempStream [(isNull(t2[last].temp) and t1.temp<temp)
or (t2[last].temp < temp and not(isNull(t2[last].temp))]+
within 5 min
select t1.temp as initialTemp,
t2[last].temp as finalTemp,
insert into IncreaingHotRoomsStream ;
Detecting Trends in Real Life
o Paper “A Complex Event Processing
Toolkit for Detecting Technical Chart
Patterns” (HPBC 2015) used the idea to
identify stock chart patterns
o Used kernel regression for smoothing
and detected maxima’s and minimas.
o Then any pattern can be written as a
temporal event sequence.
Pattern 10: Lambda Architecture
o What? runs the same query in both relatime and
batch pipelines. This uses realtime analytics to fill
the lag in batch analytics results.
o Also called “Lambda Architecture”. See Nathen
Marz’s “Questioning the Lambda Architecture”
o Use Cases?
o For example, if batch processing takes 15
minutes, results would always lags 15 minutes
from the current data. Here realtime processing
fill the gap.
Lambda Architecture. How?
Pattern 11: Detecting and switching
to Detailed Analysis
o What? detect a condition that suggests some
anomaly, and further analyze it using historical data.
o Use Cases?
o Use basic rules to detect Fraud (e.g., large transaction),
then pull out all transactions done against that credit
card for a larger time period (e.g., 3 months data) from
batch pipeline and run a detailed analysis
o While monitoring weather, detect conditions like high
temperature or low pressure in a given region, and then
start a high resolution localized forecast for that region.
o Detect good customers (e.g., through expenditure of
more than $1000 within a month, and then run a
detailed model to decide the potential of offering a deal).
Pattern 11: How?
Pattern 12: Using a Machine
Learning Model
o What? The idea is to train a model (often a
Machine Learning model), and then use it with the
Realtime pipeline to make decisions
o For example, you can build a model using R, export it as
PMML (Predictive Model Markup Language) and use it
within your realtime pipeline.
o Use Cases?
o Fraud Detection
o Segmentation
o Predict Churn
Predictive Analytics
o Build models and use
them with WSO2 CEP,
BAM and ESB using
upcoming WSO2
Machine Learner Product
( 2015 Q2)
o Build model using R,
export them as PMML,
and use within WSO2 CEP
o Call R Scripts from CEP
In CEP (Siddhi)
PMML Model
from TrasnactionStream
timestamp, amount, ip)
insert into PotentialFraudsStream;
Pattern 13: Online Control
o What? Control something Online. These would
involve problems like current situation awareness,
predicting next value(s), and deciding on corrective
o Use Cases?
o Autopilot
o Self-driving
o Robotics
Fraud Demo
Scaling & HA for Pattern
So how we scale a system ?
o Vertical Scaling
o Horizontal Scaling
Vertical Scaling
Horizontal Scaling
E.g. Calculate Mean
Horizontal Scaling ...
E.g. Calculate Mean
Horizontal Scaling ...
E.g. Calculate Mean
Horizontal Scaling ...
How about scaling median ?
Horizontal Scaling ...
How about scaling median ?
If & only if we can partition !
Scalable Realtime solutions ...
Spark Streaming
o Supports distributed processing
o Runs micro batches
o Not supports pattern & sequence detection
Scalable Realtime solutions ...
Spark Streaming
o Supports distributed processing
o Runs micro batches
o Not supports pattern & sequence detection
Apache Storm
o Supports distributed processing
o Stream processing engine
Why not use Apache Storm ?
o Supports distributed processing
o Supports Partitioning
o Extendable
o Opensource
o Need to write Java code
o Need to start from basic principles ( & data structures )
o Adoption for change is slow
o No support to govern artifacts
WSO2 CEP += Apache Storm
o Supports distributed processing
o Supports Partitioning
o Extendable
o Opensource
o No need to write Java code (Supports SQL like query language)
o No need to start from basic principles (Supports high level
o Adoption for change is fast
o Govern artifacts using Toolboxes
o etc ...
How we scale ?
How we scale ...
Scaling with Storm
Siddhi QL
define stream StockStream
(symbol string, volume int, price double);
@name(‘Filter Query’)
from StockStream[price > 75]
select *
insert into HighPriceStockStream ;
@name(‘Window Query’)
from HighPriceStockStream#window.time(10 min)
select symbol, sum(volume) as sumVolume
insert into ResultStockStream ;
Siddhi QL - with partition
define stream StockStream
(symbol string, volume int, price double);
@name(‘Filter Query’)
from StockStream[price > 75]
select *
insert into HighPriceStockStream ;
@name(‘Window Query’)
partition with (symbol of HighPriceStockStream)
from HighPriceStockStream#window.time(10 min)
select symbol, sum(volume) as sumVolume
insert into ResultStockStream ;
Siddhi QL - distributed
define stream StockStream
(symbol string, volume int, price double);
@name(Filter Query’)
@dist(parallel= ‘3')
from StockStream[price > 75]
select *
insert into HightPriceStockStream ;
@name(‘Window Query’)
@dist(parallel= ‘2')
partition with (symbol of HighPriceStockStream)
from HighPriceStockStream#window.time(10 min)
select symbol, sum(volume) as sumVolume
insert into ResultStockStream ;
On Storm UI
On Storm UI
High Availability
HA / Persistence
o Option 1: Side by side
o Recommended
o Takes 2X hardware
o Gives zero down time
o Option 2: Snapshot and restore
o Uses less HW
o Will lose events between snapshots
o Downtime while recovery
o ** Some scenarios you can use event tables to keep intermediate state
Siddhi Extensions
● Function extension
● Aggregator extension
● Window extension
● Transform extension
Siddhi Query : Function Extension
from TempStream
select deviceID, roomNo,
custom:toKelvin(temp) as tempInKelvin,
‘K’ as scale
insert into OutputStream ;
Siddhi Query : Aggregator Extension
from TempStream
select deviceID, roomNo, temp
custom:stdev(temp) as stdevTemp,
‘C’ as scale
insert into OutputStream ;
Siddhi Query : Window Extension
from TempStream
#window.custom:lastUnique(roomNo,2 min)
select *
insert into OutputStream ;
Siddhi Query : Transform Extension
from XYZSpeedStream
select velocity, direction
insert into SpeedStream ;
