DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics

ACM DEBS 2015: Realtime
Streaming Analytics
Patterns
Srinath Perera
Sriskandarajah Suhothayan
WSO2 Inc.

Data Analytics ( Big Data)
o Scientists are doing this for
25 year with MPI (1991)
using special Hardware
o Took off with Google’s
MapReduce paper (2004),
Apache Hadoop, Hive and
whole ecosystem created.
o Later Spark emerged, and it is
faster.
o But, processing takes time.

Value of Some Insights degrade
Fast!
o For some usecases ( e.g. stock
markets, traffic, surveillance,
patient monitoring) the value
of insights degrade very
quickly with time.
o E.g. stock markets and speed of
light
oo We need technology that can produce outputs fast
o Static Queries, but need very fast output (Alerts, Realtime
control)
o Dynamic and Interactive Queries ( Data exploration)

History
▪Realtime Analytics are not new
either!!
- Active Databases (2000+)
- Stream processing (Aurora, Borealis
(2005+) and later Storm)
- Distributed Streaming Operators (e.
g. Database research topic around
2005)
- CEP Vendor Roadmap ( from http:
//www.complexevents.
com/2014/12/03/cep-tooling-
market-survey-2014/)

Realtime Interactive Analytics
o Usually done to support
interactive queries
o Index data to make them
them readily accessible so
you can respond to queries
fast. (e.g. Apache Drill)
o Tools like Druid, VoltDB and
SAP Hana can do this with all
data in memory to make
things really fast.

Realtime Streaming Analytics
o Process data without Streaming ( As data some in)
o Queries are fixed ( Static)
o Triggers when given conditions are met.
o Technologies
o Stream Processing ( Apache Storm, Apache Samza)
o Complex Event Processing/CEP (WSO2 CEP, Esper,
StreamBase)
o MicroBatches ( Spark Streaming)

Realtime Football Analytics
● Video: https://www.youtube.com/watch?v=nRI6buQ0NOM
● More Info: http://www.slideshare.net/hemapani/strata-2014-
talktracking-a-soccer-game-with-big-data

Why Realtime Streaming Analytics
Patterns?
o Reason 1: Usual advantages
o Give us better understanding
o Give us better vocabulary to teach and
communicate
o Tools can implement them
o ..
o Reason 2: Under theme realtime analytics, lot of
people get too much carried away with word count
example. Patterns shows word count is just tip of
the iceberg.

Earlier Work on Patterns
o Patterns from SQL ( project, join, filter etc)
o Event Processing Technical Society’s (EPTS)
reference architecture
o higher-level patterns such as tracking, prediction and
learning in addition to low-level operators that
comes from SQL like languages.
o Esper’s Solution Patterns Document (50 patterns)
o Coral8 White Paper

Basic Patterns
o Pattern 1: Preprocessing ( filter, transform, enrich,
project .. )
o Pattern 2: Alerts and Thresholds
o Pattern 3: Simple Counting and Counting with
Windows
o Pattern 4: Joining Event Streams
o Pattern 5: Data Correlation, Missing Events, and
Erroneous Data

Patterns for Handling Trends
o Pattern 7: Detecting Temporal Event Sequence
Patterns
o Pattern 8: Tracking ( track something over space or
time)
o Pattern 9: Detecting Trends ( rise, fall, turn, tipple
bottom)
o Pattern 13: Online Control

Mixed Patterns
o Pattern 6: Interacting with Databases
o Pattern 10: Running the same Query in Batch and
Realtime Pipelines
o Pattern 11: Detecting and switching to Detailed
Analysis
o Pattern 12: Using a Machine Learning Model

Realtime Streaming
Analytics Tools

Implementing Realtime Analytics
o tempting to write a custom code. Filter look very
easy. Too complex!! Don’t!
o Option 1: Stream Processing (e.g. Storm). Kind of
works. It is like Map Reduce, you have to write code.
o Option 2: Spark Streaming - more compact than
Storm, but cannot do some stateful operations.
o Option 3: Complex Event Processing - compact, SQL
like language, fast

Stream Processing
o Program a set of processors and wire them up, data
flows though the graph.
o A middleware framework handles data flow,
distribution, and fault tolerance (e.g. Apache Storm,
Samza)
o Processors may be in the same machine or multiple
machines

Writing a Storm Program
o Write Spout(s)
o Write Bolt(s)
o Wire them up
o Run

Write Bolts
We will use a shorthand
like on the left to explain
public static class WordCount extends BaseBasicBolt {
@Override
public void execute(Tuple tuple, BasicOutputCollector
collector) {
.. do something …
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer
declarer) {
declarer.declare(new Fields("word", "count"));
}
}

Wire up and Run
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8)
.shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12)
.fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar(
args[0], conf, builder.createTopology());
}else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf,
builder.createTopology());
...
}
}

Micro Batches ( e.g. Spark
Streaming)
o Process data in small batches,
and then combine results for
final results (e.g. Spark)
o Works for simple aggregates,
but tricky to do this for complex
operations (e.g. Event
Sequences)
o Can do it with MapReduce as
well if the deadlines are not too
tight.

o A SQL like data processing
languages (e.g. Apache Hive)
o Since many understand SQL,
Hive made large scale data
processing Big Data accessible
to many
o Expressive, short, and sweet.
o Define core operations that
covers 90% of problems
o Let experts dig in when they
like!
SQL Like Query Languages

o Easy to follow from SQL
o Expressive, short, and sweet.
o Define core operations that covers 90% of problems
o Let experts dig in when they like!
CEP = SQL for Realtime
Analytics

Code and other details
o Sample code - https://github.
com/suhothayan/DEBS-2015-Realtime-Analytics-
Patterns
o WSO2 CEP
o pack http://svn.wso2.
org/repos/wso2/people/suho/packs/cep/4.0.0
/debs2015/wso2cep-4.0.0-SNAPSHOT.zip
o docs- https://docs.wso2.
com/display/CEP400/WSO2+Complex+Event+Processor+
Documentation
o Apache Storm - https://storm.apache.org/
o We have packs in a pendrive

Pattern 1: Preprocessing
o What? Cleanup and prepare data via operations like
filter, project, enrich, split, and transformations
o Usecases?
o From twitter data stream: we extract author,
timestamp and location fields and then filter
them based on the location of the author.
o From temperature stream we expect
temperature & room number of the sensor and
filter by them.

Filter
from TempStream [ roomNo > 245 and roomNo <= 365]
select roomNo, temp
insert into ServerRoomTempStream ;
In Storm
In CEP ( Siddhi)

CEP Event Adapters
Support for several transports (network access)
● SOAP
● HTTP
● JMS
● SMTP
● SMS
● Thrift
● Kafka
● Websocket
● MQTT
Supports database writes using Map messages
● Cassandra
● RDBMs
Supports custom event adaptors via its pluggable architecture!

Stream Definition (Data Model)
{
'name':'soft.drink.coop.sales', 'version':'1.0.0',
'nickName': 'Soft_Drink_Sales', 'description': 'Soft drink sales',
'metaData':[
{'name':'region','type':'STRING'}
],
'correlationData':[
{'name':’transactionID’,'type':'STRING'}
],
'payloadData':[
{'name':'brand','type':'STRING'},
{'name':'quantity','type':'INT'},
{'name':'total','type':'INT'},
{'name':'user','type':'STRING'}
]
}

Projection
define stream TempStream
(deviceID long, roomNo int, temp double);
from TempStream
select roomNo, temp
insert into OutputStream ;

Inferred Streams
from TempStream
select roomNo, temp
define stream OutputStream
(roomNo int, temp double);

Enrich
from TempStream
select roomNo, temp,‘C’ as scale
insert into OutputStream
define stream OutputStream
(roomNo int, temp double, scale string);
from TempStream
select deviceID, roomNo, avg(temp) as avgTemp

Transformation
from TempStream
select concat(deviceID, ‘-’, roomNo) as uid,
toFahrenheit(temp) as tempInF,
‘F’ as scale

Split
from TempStream
select roomNo, temp
insert into RoomTempStream ;
from TempStream
select deviceID, temp
insert into DeviceTempStream ;

Pattern 2: Alerts and Thresholds
o What? detects a condition and generates alerts
based on a condition. (e.g. Alarm on high
temperature).
o These alerts can be based on a simple value or
more complex conditions such as rate of increase
etc.
o Usecases?
o Raise alert when vehicle going too fast
o Alert when a room is too hot

Filter Alert
from TempStream [ roomNo > 245 and roomNo <= 365
and temp > 40 ]
select roomNo, temp
insert into AlertServerRoomTempStream ;

Pattern 3: Simple Counting and
Counting with Windows
o What? aggregate functions like Min, Max,
Percentiles, etc
o Often they can be counted without storing any
data
o Most useful when used with a window
o Usecases?
o Most metrics need a time bound so we can
compare ( errors per day, transactions per
second)
o Linux Load Average give us an idea of overall
trend by reporting last 1m, 3m, and 5m mean.

Types of windows
o Sliding windows vs. Batch (tumbling) windows
o Time vs. Length windows
Also supports
o Unique window
o First unique window
o External time window

Aggregation
In CEP (Siddhi)
from TempStream
select roomNo, avg(temp) as avgTemp
insert into HotRoomsStream ;

Sliding Time Window
from TempStream#window.time(1 min)
insert all events into AvgRoomTempStream ;

Group By
from TempStream#window.time(1 min)
group by roomNo
insert all events into HotRoomsStream ;

Batch Time Window
from TempStream#window.timeBatch(5 min)
group by roomNo
insert all events into HotRoomsStream ;

Pattern 4: Joining Event Streams
o What? Create a new event stream by joining
multiple streams
o Complication comes with time. So need at least
one window
o Often used with a window
o Usecases?
o To detecting when a player has kicked the ball in
a football game .
o To correlate TempStream and the state of the
regulator and trigger control commands

Join
define stream RegulatorStream
(deviceID long, roomNo int, isOn bool);
In CEP (Siddhi)

Join
define stream RegulatorStream
(deviceID long, roomNo int, isOn bool);
from TempStream[temp > 30.0]#window.time(1 min) as T
join RegulatorStream[isOn == false]#window.length(1) as R
on T.roomNo == R.roomNo
select T.roomNo, R.deviceID, ‘start’ as action
insert into RegulatorActionStream ;
In CEP (Siddhi)

Pattern 5: Data Correlation, Missing
Events, and Erroneous Data
o What? find correlations and use that to detect and
handle missing and erroneous Data
o Use Cases?
o Detecting a missing event (e.g., Detect a
customer request that has not been responded
within 1 hour of its reception)
o Detecting erroneous data (e.g., Detecting failed
sensors using a set of sensors that monitor
overlapping regions. We can use those
redundant data to find erroneous sensors and
remove those data from further processing)

Missing Event in CEP
In CEP (Siddhi)
from RequestStream#window.time(1h)
insert expired events into ExpiryStream
from r1=RequestStream->r2=Response[id=r1.id] or
r3=ExpiryStream[id=r1.id]
select r1.id as id ...
insert into AlertStream having having r2.id == null;

Pattern 6: Interacting with Databases
o What? Combine realtime data against historical
data
o Use Cases?
o On a transaction, looking up the customer age
using ID from customer database to detect fraud
(enrichment)
o Checking a transaction against blacklists and
whitelists in the database
o Receive an input from the user (e.g., Daily
discount amount may be updated in the
database, and then the query will pick it
automatically without human intervention).

In CEP (Siddhi)
Event Table
define table CardUserTable (name string, cardNum long) ;
@from(eventtable = 'rdbms' , datasource.name = ‘CardDataSource’ ,
table.name = ‘UserTable’, caching.algorithm’=‘LRU’)
define table CardUserTable (name string, cardNum long)
Cache types supported
● Basic: A size-based algorithm based on FIFO.
● LRU (Least Recently Used): The least recently used event is dropped
when cache is full.
● LFU (Least Frequently Used): The least frequently used event is dropped
when cache is full.

Join : Event Table
define stream Purchase (price double, cardNo long, place string);
define table CardUserTable (name string, cardNum long) ;
from Purchase#window.length(1) join CardUserTable
on Purchase.cardNo == CardUserTable.cardNum
select Purchase.cardNo as cardNo,
CardUserTable.name as name,
Purchase.price as price
insert into PurchaseUserStream ;

Insert : Event Table
define stream FraudStream (price double, cardNo long, userName
string);
define table BlacklistedUserTable (name string, cardNum long) ;
from FraudStream
select userName as name, cardNo as cardNum
insert into BlacklistedUserTable ;

Update : Event Table
define stream LoginStream (userID string,
islogin bool, loginTime long);
define table LastLoginTable (userID string, time long) ;
from LoginStream
select userID, loginTime as time
update LastLoginTable
on LoginStream.userID == LastLoginTable.userID ;

Pattern 7: Detecting Temporal
Event Sequence Patterns
o What? detect a temporal sequence of events or
condition arranged in time
o Use Cases?
o Detect suspicious activities like small transaction
immediately followed by a large transaction
o Detect ball possession in a football game
o Detect suspicious financial patterns like large buy
and sell behaviour within a small time period

In CEP (Siddhi)
Pattern
define stream Purchase (price double, cardNo long,place string);
from every (a1 = Purchase[price < 100] -> a3= ..) ->
a2 = Purchase[price >10000 and a1.cardNo == a2.cardNo]
within 1 day
select a1.cardNo as cardNo, a2.price as price, a2.place as place
insert into PotentialFraud ;

Pattern 8: Tracking
o What? detecting an overall trend over time
o Use Cases?
o Tracking a fleet of vehicles, making sure that
they adhere to speed limits, routes, and Geo-
fences.
o Tracking wildlife, making sure they are alive (they
will not move if they are dead) and making sure
they will not go out of the reservation.
o Tracking airline luggage and making sure they
have not been sent to wrong destinations
o Tracking a logistic network and figuring out
bottlenecks and unexpected conditions.

TFL: Traffic Analytics
Built using TFL ( Transport for London) open data feeds.
http://goo.gl/9xNiCm http://goo.gl/04tX6k

Pattern 9: Detecting Trends
o What? tracking something over space and time and
detects given conditions.
o Useful in stock markets, SLA enforcement, auto
scaling, predictive maintenance
o Use Cases?
o Rise, Fall of values and Turn (switch from rise to
a fall)
o Outliers - deviate from the current trend by a
large value
o Complex trends like “Triple Bottom” and “Cup
and Handle” [17].

Trend in Storm
Build and apply an state machine

In CEP (Siddhi)
Sequence
from t1=TempStream,
t2=TempStream [(isNull(t2[last].temp) and t1.temp<temp) or
(t2[last].temp < temp and not(isNull(t2[last].temp))]+
within 5 min
select t1.temp as initialTemp,
t2[last].temp as finalTemp,
t1.deviceID,
t1.roomNo
insert into IncreaingHotRoomsStream ;

In CEP (Siddhi)
Partition
partition by (roomNo of TempStream)
begin
from t1=TempStream,
t2=TempStream [(isNull(t2[last].temp) and t1.temp<temp)
or (t2[last].temp < temp and not(isNull(t2[last].temp))]+
within 5 min
select t1.temp as initialTemp,
t2[last].temp as finalTemp,
t1.deviceID,
t1.roomNo
insert into IncreaingHotRoomsStream ;
end;

Detecting Trends in Real Life
o Paper “A Complex Event Processing
Toolkit for Detecting Technical Chart
Patterns” (HPBC 2015) used the idea to
identify stock chart patterns
o Used kernel regression for smoothing
and detected maxima’s and minimas.
o Then any pattern can be written as a
temporal event sequence.

Pattern 10: Lambda Architecture
o What? runs the same query in both relatime and
batch pipelines. This uses realtime analytics to fill
the lag in batch analytics results.
o Also called “Lambda Architecture”. See Nathen
Marz’s “Questioning the Lambda Architecture”
o Use Cases?
o For example, if batch processing takes 15
minutes, results would always lags 15 minutes
from the current data. Here realtime processing
fill the gap.

Pattern 11: Detecting and switching
to Detailed Analysis
o What? detect a condition that suggests some
anomaly, and further analyze it using historical data.
o Use Cases?
o Use basic rules to detect Fraud (e.g., large transaction),
then pull out all transactions done against that credit
card for a larger time period (e.g., 3 months data) from
batch pipeline and run a detailed analysis
o While monitoring weather, detect conditions like high
temperature or low pressure in a given region, and then
start a high resolution localized forecast for that region.
o Detect good customers (e.g., through expenditure of
more than $1000 within a month, and then run a
detailed model to decide the potential of offering a deal).

Pattern 12: Using a Machine
Learning Model
o What? The idea is to train a model (often a
Machine Learning model), and then use it with the
Realtime pipeline to make decisions
o For example, you can build a model using R, export it as
PMML (Predictive Model Markup Language) and use it
within your realtime pipeline.
o Use Cases?
o Fraud Detection
o Segmentation
o Predict Churn

Predictive Analytics
o Build models and use
them with WSO2 CEP,
BAM and ESB using
upcoming WSO2
Machine Learner Product
( 2015 Q2)
o Build model using R,
export them as PMML,
and use within WSO2 CEP
o Call R Scripts from CEP
queries

In CEP (Siddhi)
PMML Model
from TrasnactionStream
#ml:applyModel(‘/path/logisticRegressionModel1.xml’,
timestamp, amount, ip)
insert into PotentialFraudsStream;

Pattern 13: Online Control
o What? Control something Online. These would
involve problems like current situation awareness,
predicting next value(s), and deciding on corrective
actions.
o Use Cases?
o Autopilot
o Self-driving
o Robotics

Scaling & HA for Pattern
Implementations

So how we scale a system ?
o Vertical Scaling
o Horizontal Scaling

Horizontal Scaling
E.g. Calculate Mean

Horizontal Scaling ...
E.g. Calculate Mean

How about scaling median ?

How about scaling median ?
If & only if we can partition !

Scalable Realtime solutions ...
Spark Streaming
o Supports distributed processing
o Runs micro batches
o Not supports pattern & sequence detection

Scalable Realtime solutions ...
Spark Streaming
o Runs micro batches
o Not supports pattern & sequence detection
Apache Storm
o Stream processing engine

Why not use Apache Storm ?
Advantages
o Supports Partitioning
o Extendable
o Opensource
Disadvantages
o Need to write Java code
o Need to start from basic principles ( & data structures )
o Adoption for change is slow
o No support to govern artifacts

WSO2 CEP += Apache Storm
Advantages
o Supports Partitioning
o Extendable
o Opensource
Disadvantages
o No need to write Java code (Supports SQL like query language)
o No need to start from basic principles (Supports high level
language)
o Adoption for change is fast
o Govern artifacts using Toolboxes
o etc ...

Siddhi QL
define stream StockStream
(symbol string, volume int, price double);
@name(‘Filter Query’)
from StockStream[price > 75]
select *
insert into HighPriceStockStream ;
@name(‘Window Query’)
from HighPriceStockStream#window.time(10 min)
select symbol, sum(volume) as sumVolume
insert into ResultStockStream ;

Siddhi QL - with partition
@name(‘Filter Query’)
select *
insert into HighPriceStockStream ;
partition with (symbol of HighPriceStockStream)
begin
end;

Siddhi QL - distributed
@name(Filter Query’)
@dist(parallel= ‘3')
select *
insert into HightPriceStockStream ;
@dist(parallel= ‘2')
partition with (symbol of HighPriceStockStream)
begin
end;

HA / Persistence
o Option 1: Side by side
o Recommended
o Takes 2X hardware
o Gives zero down time
o Option 2: Snapshot and restore
o Uses less HW
o Will lose events between snapshots
o Downtime while recovery
o ** Some scenarios you can use event tables to keep intermediate state

Siddhi Extensions
● Function extension
● Aggregator extension
● Window extension
● Transform extension

Siddhi Query : Function Extension
from TempStream
select deviceID, roomNo,
custom:toKelvin(temp) as tempInKelvin,
‘K’ as scale

Siddhi Query : Aggregator Extension
from TempStream
select deviceID, roomNo, temp
custom:stdev(temp) as stdevTemp,
‘C’ as scale

Siddhi Query : Window Extension
from TempStream
#window.custom:lastUnique(roomNo,2 min)
select *

Siddhi Query : Transform Extension
from XYZSpeedStream
#transform.custom:getVelocityVector(v,vx,vy,vz)
select velocity, direction
insert into SpeedStream ;

DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics

Similar to DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics (20)

More from Sriskandarajah Suhothayan

More from Sriskandarajah Suhothayan (10)

Recently uploaded

Recently uploaded (20)

DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics