2. Data Analytics ( Big Data)
o Scientists are doing this for
25 year with MPI (1991)
using special Hardware
o Took off with Google’s
MapReduce paper (2004),
Apache Hadoop, Hive and
whole ecosystem created.
o Later Spark emerged, and it is
faster.
o But, processing takes time.
3. Value of Some Insights degrade
Fast!
o For some usecases ( e.g. stock
markets, traffic, surveillance,
patient monitoring) the value
of insights degrade very
quickly with time.
o E.g. stock markets and speed of
light
oo We need technology that can produce outputs fast
o Static Queries, but need very fast output (Alerts, Realtime
control)
o Dynamic and Interactive Queries ( Data exploration)
4. History
▪Realtime Analytics are not new
either!!
- Active Databases (2000+)
- Stream processing (Aurora, Borealis
(2005+) and later Storm)
- Distributed Streaming Operators (e.
g. Database research topic around
2005)
- CEP Vendor Roadmap ( from http:
//www.complexevents.
com/2014/12/03/cep-tooling-
market-survey-2014/)
6. Realtime Interactive Analytics
o Usually done to support
interactive queries
o Index data to make them
them readily accessible so
you can respond to queries
fast. (e.g. Apache Drill)
o Tools like Druid, VoltDB and
SAP Hana can do this with all
data in memory to make
things really fast.
7. Realtime Streaming Analytics
o Process data without Streaming ( As data some in)
o Queries are fixed ( Static)
o Triggers when given conditions are met.
o Technologies
o Stream Processing ( Apache Storm, Apache Samza)
o Complex Event Processing/CEP (WSO2 CEP, Esper,
StreamBase)
o MicroBatches ( Spark Streaming)
8. Realtime Football Analytics
● Video: https://www.youtube.com/watch?v=nRI6buQ0NOM
● More Info: http://www.slideshare.net/hemapani/strata-2014-
talktracking-a-soccer-game-with-big-data
9. Why Realtime Streaming Analytics
Patterns?
o Reason 1: Usual advantages
o Give us better understanding
o Give us better vocabulary to teach and
communicate
o Tools can implement them
o ..
o Reason 2: Under theme realtime analytics, lot of
people get too much carried away with word count
example. Patterns shows word count is just tip of
the iceberg.
10. Earlier Work on Patterns
o Patterns from SQL ( project, join, filter etc)
o Event Processing Technical Society’s (EPTS)
reference architecture
o higher-level patterns such as tracking, prediction and
learning in addition to low-level operators that
comes from SQL like languages.
o Esper’s Solution Patterns Document (50 patterns)
o Coral8 White Paper
11. Basic Patterns
o Pattern 1: Preprocessing ( filter, transform, enrich,
project .. )
o Pattern 2: Alerts and Thresholds
o Pattern 3: Simple Counting and Counting with
Windows
o Pattern 4: Joining Event Streams
o Pattern 5: Data Correlation, Missing Events, and
Erroneous Data
12. Patterns for Handling Trends
o Pattern 7: Detecting Temporal Event Sequence
Patterns
o Pattern 8: Tracking ( track something over space or
time)
o Pattern 9: Detecting Trends ( rise, fall, turn, tipple
bottom)
o Pattern 13: Online Control
13. Mixed Patterns
o Pattern 6: Interacting with Databases
o Pattern 10: Running the same Query in Batch and
Realtime Pipelines
o Pattern 11: Detecting and switching to Detailed
Analysis
o Pattern 12: Using a Machine Learning Model
16. Implementing Realtime Analytics
o tempting to write a custom code. Filter look very
easy. Too complex!! Don’t!
o Option 1: Stream Processing (e.g. Storm). Kind of
works. It is like Map Reduce, you have to write code.
o Option 2: Spark Streaming - more compact than
Storm, but cannot do some stateful operations.
o Option 3: Complex Event Processing - compact, SQL
like language, fast
17. Stream Processing
o Program a set of processors and wire them up, data
flows though the graph.
o A middleware framework handles data flow,
distribution, and fault tolerance (e.g. Apache Storm,
Samza)
o Processors may be in the same machine or multiple
machines
18. Writing a Storm Program
o Write Spout(s)
o Write Bolt(s)
o Wire them up
o Run
19. Write Bolts
We will use a shorthand
like on the left to explain
public static class WordCount extends BaseBasicBolt {
@Override
public void execute(Tuple tuple, BasicOutputCollector
collector) {
.. do something …
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer
declarer) {
declarer.declare(new Fields("word", "count"));
}
}
20. Wire up and Run
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8)
.shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12)
.fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar(
args[0], conf, builder.createTopology());
}else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf,
builder.createTopology());
...
}
}
22. Micro Batches ( e.g. Spark
Streaming)
o Process data in small batches,
and then combine results for
final results (e.g. Spark)
o Works for simple aggregates,
but tricky to do this for complex
operations (e.g. Event
Sequences)
o Can do it with MapReduce as
well if the deadlines are not too
tight.
23. o A SQL like data processing
languages (e.g. Apache Hive)
o Since many understand SQL,
Hive made large scale data
processing Big Data accessible
to many
o Expressive, short, and sweet.
o Define core operations that
covers 90% of problems
o Let experts dig in when they
like!
SQL Like Query Languages
24. o Easy to follow from SQL
o Expressive, short, and sweet.
o Define core operations that covers 90% of problems
o Let experts dig in when they like!
CEP = SQL for Realtime
Analytics
26. Code and other details
o Sample code - https://github.
com/suhothayan/DEBS-2015-Realtime-Analytics-
Patterns
o WSO2 CEP
o pack http://svn.wso2.
org/repos/wso2/people/suho/packs/cep/4.0.0
/debs2015/wso2cep-4.0.0-SNAPSHOT.zip
o docs- https://docs.wso2.
com/display/CEP400/WSO2+Complex+Event+Processor+
Documentation
o Apache Storm - https://storm.apache.org/
o We have packs in a pendrive
27. Pattern 1: Preprocessing
o What? Cleanup and prepare data via operations like
filter, project, enrich, split, and transformations
o Usecases?
o From twitter data stream: we extract author,
timestamp and location fields and then filter
them based on the location of the author.
o From temperature stream we expect
temperature & room number of the sensor and
filter by them.
28. Filter
from TempStream [ roomNo > 245 and roomNo <= 365]
select roomNo, temp
insert into ServerRoomTempStream ;
In Storm
In CEP ( Siddhi)
37. Pattern 2: Alerts and Thresholds
o What? detects a condition and generates alerts
based on a condition. (e.g. Alarm on high
temperature).
o These alerts can be based on a simple value or
more complex conditions such as rate of increase
etc.
o Usecases?
o Raise alert when vehicle going too fast
o Alert when a room is too hot
38. Filter Alert
from TempStream [ roomNo > 245 and roomNo <= 365
and temp > 40 ]
select roomNo, temp
insert into AlertServerRoomTempStream ;
39. Pattern 3: Simple Counting and
Counting with Windows
o What? aggregate functions like Min, Max,
Percentiles, etc
o Often they can be counted without storing any
data
o Most useful when used with a window
o Usecases?
o Most metrics need a time bound so we can
compare ( errors per day, transactions per
second)
o Linux Load Average give us an idea of overall
trend by reporting last 1m, 3m, and 5m mean.
40. Types of windows
o Sliding windows vs. Batch (tumbling) windows
o Time vs. Length windows
Also supports
o Unique window
o First unique window
o External time window
45. Batch Time Window
from TempStream#window.timeBatch(5 min)
select roomNo, avg(temp) as avgTemp
group by roomNo
insert all events into HotRoomsStream ;
46. Pattern 4: Joining Event Streams
o What? Create a new event stream by joining
multiple streams
o Complication comes with time. So need at least
one window
o Often used with a window
o Usecases?
o To detecting when a player has kicked the ball in
a football game .
o To correlate TempStream and the state of the
regulator and trigger control commands
49. Join
define stream TempStream
(deviceID long, roomNo int, temp double);
define stream RegulatorStream
(deviceID long, roomNo int, isOn bool);
from TempStream[temp > 30.0]#window.time(1 min) as T
join RegulatorStream[isOn == false]#window.length(1) as R
on T.roomNo == R.roomNo
select T.roomNo, R.deviceID, ‘start’ as action
insert into RegulatorActionStream ;
In CEP (Siddhi)
50. Pattern 5: Data Correlation, Missing
Events, and Erroneous Data
o What? find correlations and use that to detect and
handle missing and erroneous Data
o Use Cases?
o Detecting a missing event (e.g., Detect a
customer request that has not been responded
within 1 hour of its reception)
o Detecting erroneous data (e.g., Detecting failed
sensors using a set of sensors that monitor
overlapping regions. We can use those
redundant data to find erroneous sensors and
remove those data from further processing)
52. Missing Event in CEP
In CEP (Siddhi)
from RequestStream#window.time(1h)
insert expired events into ExpiryStream
from r1=RequestStream->r2=Response[id=r1.id] or
r3=ExpiryStream[id=r1.id]
select r1.id as id ...
insert into AlertStream having having r2.id == null;
53. Pattern 6: Interacting with Databases
o What? Combine realtime data against historical
data
o Use Cases?
o On a transaction, looking up the customer age
using ID from customer database to detect fraud
(enrichment)
o Checking a transaction against blacklists and
whitelists in the database
o Receive an input from the user (e.g., Daily
discount amount may be updated in the
database, and then the query will pick it
automatically without human intervention).
55. In CEP (Siddhi)
Event Table
define table CardUserTable (name string, cardNum long) ;
@from(eventtable = 'rdbms' , datasource.name = ‘CardDataSource’ ,
table.name = ‘UserTable’, caching.algorithm’=‘LRU’)
define table CardUserTable (name string, cardNum long)
Cache types supported
● Basic: A size-based algorithm based on FIFO.
● LRU (Least Recently Used): The least recently used event is dropped
when cache is full.
● LFU (Least Frequently Used): The least frequently used event is dropped
when cache is full.
56. Join : Event Table
define stream Purchase (price double, cardNo long, place string);
define table CardUserTable (name string, cardNum long) ;
from Purchase#window.length(1) join CardUserTable
on Purchase.cardNo == CardUserTable.cardNum
select Purchase.cardNo as cardNo,
CardUserTable.name as name,
Purchase.price as price
insert into PurchaseUserStream ;
57. Insert : Event Table
define stream FraudStream (price double, cardNo long, userName
string);
define table BlacklistedUserTable (name string, cardNum long) ;
from FraudStream
select userName as name, cardNo as cardNum
insert into BlacklistedUserTable ;
58. Update : Event Table
define stream LoginStream (userID string,
islogin bool, loginTime long);
define table LastLoginTable (userID string, time long) ;
from LoginStream
select userID, loginTime as time
update LastLoginTable
on LoginStream.userID == LastLoginTable.userID ;
59. Pattern 7: Detecting Temporal
Event Sequence Patterns
o What? detect a temporal sequence of events or
condition arranged in time
o Use Cases?
o Detect suspicious activities like small transaction
immediately followed by a large transaction
o Detect ball possession in a football game
o Detect suspicious financial patterns like large buy
and sell behaviour within a small time period
61. In CEP (Siddhi)
Pattern
define stream Purchase (price double, cardNo long,place string);
from every (a1 = Purchase[price < 100] -> a3= ..) ->
a2 = Purchase[price >10000 and a1.cardNo == a2.cardNo]
within 1 day
select a1.cardNo as cardNo, a2.price as price, a2.place as place
insert into PotentialFraud ;
62. Pattern 8: Tracking
o What? detecting an overall trend over time
o Use Cases?
o Tracking a fleet of vehicles, making sure that
they adhere to speed limits, routes, and Geo-
fences.
o Tracking wildlife, making sure they are alive (they
will not move if they are dead) and making sure
they will not go out of the reservation.
o Tracking airline luggage and making sure they
have not been sent to wrong destinations
o Tracking a logistic network and figuring out
bottlenecks and unexpected conditions.
63. TFL: Traffic Analytics
Built using TFL ( Transport for London) open data feeds.
http://goo.gl/9xNiCm http://goo.gl/04tX6k
64. Pattern 9: Detecting Trends
o What? tracking something over space and time and
detects given conditions.
o Useful in stock markets, SLA enforcement, auto
scaling, predictive maintenance
o Use Cases?
o Rise, Fall of values and Turn (switch from rise to
a fall)
o Outliers - deviate from the current trend by a
large value
o Complex trends like “Triple Bottom” and “Cup
and Handle” [17].
66. In CEP (Siddhi)
Sequence
from t1=TempStream,
t2=TempStream [(isNull(t2[last].temp) and t1.temp<temp) or
(t2[last].temp < temp and not(isNull(t2[last].temp))]+
within 5 min
select t1.temp as initialTemp,
t2[last].temp as finalTemp,
t1.deviceID,
t1.roomNo
insert into IncreaingHotRoomsStream ;
67. In CEP (Siddhi)
Partition
partition by (roomNo of TempStream)
begin
from t1=TempStream,
t2=TempStream [(isNull(t2[last].temp) and t1.temp<temp)
or (t2[last].temp < temp and not(isNull(t2[last].temp))]+
within 5 min
select t1.temp as initialTemp,
t2[last].temp as finalTemp,
t1.deviceID,
t1.roomNo
insert into IncreaingHotRoomsStream ;
end;
68. Detecting Trends in Real Life
o Paper “A Complex Event Processing
Toolkit for Detecting Technical Chart
Patterns” (HPBC 2015) used the idea to
identify stock chart patterns
o Used kernel regression for smoothing
and detected maxima’s and minimas.
o Then any pattern can be written as a
temporal event sequence.
69. Pattern 10: Lambda Architecture
o What? runs the same query in both relatime and
batch pipelines. This uses realtime analytics to fill
the lag in batch analytics results.
o Also called “Lambda Architecture”. See Nathen
Marz’s “Questioning the Lambda Architecture”
o Use Cases?
o For example, if batch processing takes 15
minutes, results would always lags 15 minutes
from the current data. Here realtime processing
fill the gap.
71. Pattern 11: Detecting and switching
to Detailed Analysis
o What? detect a condition that suggests some
anomaly, and further analyze it using historical data.
o Use Cases?
o Use basic rules to detect Fraud (e.g., large transaction),
then pull out all transactions done against that credit
card for a larger time period (e.g., 3 months data) from
batch pipeline and run a detailed analysis
o While monitoring weather, detect conditions like high
temperature or low pressure in a given region, and then
start a high resolution localized forecast for that region.
o Detect good customers (e.g., through expenditure of
more than $1000 within a month, and then run a
detailed model to decide the potential of offering a deal).
73. Pattern 12: Using a Machine
Learning Model
o What? The idea is to train a model (often a
Machine Learning model), and then use it with the
Realtime pipeline to make decisions
o For example, you can build a model using R, export it as
PMML (Predictive Model Markup Language) and use it
within your realtime pipeline.
o Use Cases?
o Fraud Detection
o Segmentation
o Predict Churn
74. Predictive Analytics
o Build models and use
them with WSO2 CEP,
BAM and ESB using
upcoming WSO2
Machine Learner Product
( 2015 Q2)
o Build model using R,
export them as PMML,
and use within WSO2 CEP
o Call R Scripts from CEP
queries
75. In CEP (Siddhi)
PMML Model
from TrasnactionStream
#ml:applyModel(‘/path/logisticRegressionModel1.xml’,
timestamp, amount, ip)
insert into PotentialFraudsStream;
76. Pattern 13: Online Control
o What? Control something Online. These would
involve problems like current situation awareness,
predicting next value(s), and deciding on corrective
actions.
o Use Cases?
o Autopilot
o Self-driving
o Robotics
86. Scalable Realtime solutions ...
Spark Streaming
o Supports distributed processing
o Runs micro batches
o Not supports pattern & sequence detection
87. Scalable Realtime solutions ...
Spark Streaming
o Supports distributed processing
o Runs micro batches
o Not supports pattern & sequence detection
Apache Storm
o Supports distributed processing
o Stream processing engine
88. Why not use Apache Storm ?
Advantages
o Supports distributed processing
o Supports Partitioning
o Extendable
o Opensource
Disadvantages
o Need to write Java code
o Need to start from basic principles ( & data structures )
o Adoption for change is slow
o No support to govern artifacts
89. WSO2 CEP += Apache Storm
Advantages
o Supports distributed processing
o Supports Partitioning
o Extendable
o Opensource
Disadvantages
o No need to write Java code (Supports SQL like query language)
o No need to start from basic principles (Supports high level
language)
o Adoption for change is fast
o Govern artifacts using Toolboxes
o etc ...
99. HA / Persistence
o Option 1: Side by side
o Recommended
o Takes 2X hardware
o Gives zero down time
o Option 2: Snapshot and restore
o Uses less HW
o Will lose events between snapshots
o Downtime while recovery
o ** Some scenarios you can use event tables to keep intermediate state
101. Siddhi Query : Function Extension
from TempStream
select deviceID, roomNo,
custom:toKelvin(temp) as tempInKelvin,
‘K’ as scale
insert into OutputStream ;
102. Siddhi Query : Aggregator Extension
from TempStream
select deviceID, roomNo, temp
custom:stdev(temp) as stdevTemp,
‘C’ as scale
insert into OutputStream ;
103. Siddhi Query : Window Extension
from TempStream
#window.custom:lastUnique(roomNo,2 min)
select *
insert into OutputStream ;
104. Siddhi Query : Transform Extension
from XYZSpeedStream
#transform.custom:getVelocityVector(v,vx,vy,vz)
select velocity, direction
insert into SpeedStream ;