Scalable Realtime
Analytics with
declarative, SQL like,
Complex Event
Processing Scripts
Srinath Perera
Director, Research WSO2
Apache Member
(@srinath_perera)
srinath@wso2.com
(Batch) Analytics
Scientists are doing this for 25 year with
MPI (1991) on special Hardware
Took off with Google’s MapReduce
paper (2004), Apache Hadoop, Hive and
whole eco system created.
It was successful, So we are here!!
But, processing takes time.
Value of Some Insights degrade Fast!
For some usecases ( e.g. stock markets, traffic, surveillance, patient
monitoring) the value of insights degrade very quickly with time.
- E.g. stock markets and speed of light
We need technology that can produce
outputs fast
- Static Queries, but need very fast output
(Alerts, Realtime control)
- Dynamic and Interactive Queries ( Data
exploration)
History
Realtime Analytics are not new either!!
- Active Databases (2000+)
- Stream processing (Aurora, Borealis (2005+)
and later Storm)
- Distributed Streaming Operators (e.g.
Database research topic around 2005)
- CEP vendor roadmap ( from
http://www.complexevents.com/2014/12/03/cep-
tooling-market-survey-2014/)
Realtime AnalyticsTools
I. Stream Processing
Program a set of processors and wire them up, data flows though
the graph.
A middleware framework handles data flow, distribution, and fault
tolerance (e.g. Apache Storm, Samza)
Processors may be in the same machine or multiple machines
II. Complex Event Processing
III. Micro Batch
Process data in small batches, and
then combine results for final results
(e.g. Spark)
Works for simple aggregates, but
tricky to do this for complex
operations (e.g. Event Sequences)
Can do it with MapReduce as well if
the deadlines are not too tight.
IV. OLAP Style In Memory Computing
Usually done to support interactive
queries
Index data to make them them
readily accessible so you can respond
to queries fast. (e.g. Apache Drill)
Tools like Druid, VoltDB and SAP
Hana can do this with all data in
memory to make things really fast.
Realtime Analytics Patterns
Simple counting (e.g. failure count)
Counting with Windows ( e.g. failure count every hour)
Preprocessing: filtering, transformations (e.g. data cleanup)
Alerts , thresholds (e.g. Alarm on high temperature)
Data Correlation, Detect missing events, detecting erroneous data
(e.g. detecting failed sensors)
Joining event streams (e.g. detect a hit on soccer ball)
Merge with data in a database, collect, update data conditionally
Realtime Analytics Patterns (contd.)
Detecting Event Sequence Patterns (e.g. small transaction followed
by large transaction)
Tracking - follow some related entity’s state in space, time etc. (e.g.
location of airline baggage, vehicle, tracking wild life)
 Detect trends – Rise, turn, fall, Outliers, Complex trends like triple
bottom etc., (e.g. algorithmic trading, SLA, load balancing)
Learning a Model (e.g. Predictive maintenance)
Predicting next value and corrective actions (e.g. automated car)
Apache Hive
A SQL like data processing language
Since many understand SQL, Hive
made large scale data processing Big
Data accessible to many
Expressive, short, and sweet.
Define core operations that covers 90%
of problems
Lets experts dig in when they like!
(Batch Processing, Hive)
(Realtime Analytics, X)
What is X?
CEP = SQL for Realtime Analytics
Easy to follow from SQL
Expressive, short, and sweet.
Define core operations that covers 90% of
problems
Lets experts dig in when they like!
Lets look at the core operations.
Operators: Filters
Assume a temperature stream
Here weather:convertFtoC() is a
user defined function. They are
used to extend the language.
define stream TempStream (ts long, temp double);
from TempratureStream [weather:convertFtoC(temp) > 30.0)
and roomNo != 2043]
select roomNo, temp
insert into HotRoomsStream ;
Usecases:
- Alerts , thresholds (e.g. Alarm on
high temperature)
- Preprocessing: filtering,
transformations (e.g. data cleanup)
Operators:Windows and Aggregation
Support many window types
- Batch Windows, Sliding windows, Custom windows
Usecases
- Simple counting (e.g. failure count)
- Counting with Windows ( e.g. failure count every hour)
from TempratureStream#window.time(1 min)
select roomNo, avg(temp) as avgTemp
insert into HotRoomsStream ;
Operators: Patterns
Models a followed by relation: e.g.
event A followed by event B
Very powerful tool for tracking
and detecting patterns
from every (a1 = TempratureStream)
-> a2 = TempratureStream [temp > a1.temp + 5 ]
within 1 day
select a2.ts as ts, a2.temp – a1.temp as diff
insert into HotDayAlertStream;
Usecases
- Detecting Event Sequence Patterns
- Tracking
- Detect trends
Operators: Joins
Join two data streams based on a condition and windows
Usecases
- Data Correlation, Detect missing events, detecting erroneous data
- Joining event streams
from TempStream[temp > 30.0]#window.time(1 min) as T
join RegulatorStream[isOn == false]#window.length(1) as R on
T.roomNo == R.roomNo
select T.roomNo, R.deviceID, ‘start’ as action insert into
RegulatorActionStream
Operators:Access Data from the Disk
Event tables allow users to map a database to a window and join a
data stream with the window
Usecases
- Merge with data in a database, collect, update data conditionally
define stream TempStream (ts long, temp double);
define table HistTempTable(day long, avgT double);
from TempStream #window.length(1) join OldTempTable
on getDayOfYear(ts) == HistTempTable.day && ts > avgT
select ts, temp
insert into PurchaseUserStream ;
Revisit Patterns
Predictive Analytics
 Build models and use them with
WSO2 CEP, BAM and ESB using
upcoming WSO2 Machine Learner
Product ( 2015 Q2)
 Build model using R, export them as
PMML, and use within WSO2 CEP
 Call R Scripts from CEP queries
 Regression and Anomaly Detection
Operators in CEP
Case Study: Realtime Soccer Analysis
Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
TFLTraffic Analysis
Built using TFL
( Transport for
London) open data
feeds.
http://goo.gl/04tX6k
http://goo.gl/9xNiCm
Great, Does it Scale?
Idea 1: Network of CEP Nodes
For scaling, we arrange CEP
processing nodes in a graph like with
stream processing.
The Graph can be implemented
using an stream processing engine
like Apache Storm
Idea II: Compile SQL like Queries to a
Network of CEP Nodes
from TempStream[temp > 33]
insert into HighTempStream;
from HighTempStream#window(1h)
select max(temp)as max
insert into HourlyMaxTempStream;

How do We partition the Data to scale
up the Analysis?
Lets follow MapReduce
Map Reduce does not scale itself, it asks users to break
the problem to many small independent problems.
Idea III: Let the Users specify Parallelism
Language include parallel constructs:
partitions, pipelines, distributed
operators
Assign each partition to a different
node, and partition the data accordingly
define partition on TempStream.region {
from TempStream[temp > 33]
insert into HighTempStream;
}
from HighTempStream#window(1h)
select max(temp)as max
insert into HourlyMaxTempStream;
Handling Ordering
When the data processed in
parallel, output might be generated
out of order.
Due to lack of a global time, we
cannot trigger windows and other
time sensitive constructs
Solution: the current time needs to
be propagated though the graph
Putting EverythingTogether
WSO2 CEP & Big Data Platform
CEP = SQL for Realtime Analytics
Easy to follow from SQL
Expressive, short, sweet and fast!!
Define core operations that covers 90% of
problems
Lets experts dig in when they like!
And it Scales!!
Questions?
Visit us at Booth 1025http://wso2.com/landing/strata-
hadoop-world-ca-2015/

Scalable Realtime Analytics with declarative SQL like Complex Event Processing Scripts

  • 1.
    Scalable Realtime Analytics with declarative,SQL like, Complex Event Processing Scripts Srinath Perera Director, Research WSO2 Apache Member (@srinath_perera) srinath@wso2.com
  • 2.
    (Batch) Analytics Scientists aredoing this for 25 year with MPI (1991) on special Hardware Took off with Google’s MapReduce paper (2004), Apache Hadoop, Hive and whole eco system created. It was successful, So we are here!! But, processing takes time.
  • 3.
    Value of SomeInsights degrade Fast! For some usecases ( e.g. stock markets, traffic, surveillance, patient monitoring) the value of insights degrade very quickly with time. - E.g. stock markets and speed of light We need technology that can produce outputs fast - Static Queries, but need very fast output (Alerts, Realtime control) - Dynamic and Interactive Queries ( Data exploration)
  • 4.
    History Realtime Analytics arenot new either!! - Active Databases (2000+) - Stream processing (Aurora, Borealis (2005+) and later Storm) - Distributed Streaming Operators (e.g. Database research topic around 2005) - CEP vendor roadmap ( from http://www.complexevents.com/2014/12/03/cep- tooling-market-survey-2014/)
  • 6.
  • 7.
    I. Stream Processing Programa set of processors and wire them up, data flows though the graph. A middleware framework handles data flow, distribution, and fault tolerance (e.g. Apache Storm, Samza) Processors may be in the same machine or multiple machines
  • 8.
  • 9.
    III. Micro Batch Processdata in small batches, and then combine results for final results (e.g. Spark) Works for simple aggregates, but tricky to do this for complex operations (e.g. Event Sequences) Can do it with MapReduce as well if the deadlines are not too tight.
  • 10.
    IV. OLAP StyleIn Memory Computing Usually done to support interactive queries Index data to make them them readily accessible so you can respond to queries fast. (e.g. Apache Drill) Tools like Druid, VoltDB and SAP Hana can do this with all data in memory to make things really fast.
  • 11.
    Realtime Analytics Patterns Simplecounting (e.g. failure count) Counting with Windows ( e.g. failure count every hour) Preprocessing: filtering, transformations (e.g. data cleanup) Alerts , thresholds (e.g. Alarm on high temperature) Data Correlation, Detect missing events, detecting erroneous data (e.g. detecting failed sensors) Joining event streams (e.g. detect a hit on soccer ball) Merge with data in a database, collect, update data conditionally
  • 12.
    Realtime Analytics Patterns(contd.) Detecting Event Sequence Patterns (e.g. small transaction followed by large transaction) Tracking - follow some related entity’s state in space, time etc. (e.g. location of airline baggage, vehicle, tracking wild life)  Detect trends – Rise, turn, fall, Outliers, Complex trends like triple bottom etc., (e.g. algorithmic trading, SLA, load balancing) Learning a Model (e.g. Predictive maintenance) Predicting next value and corrective actions (e.g. automated car)
  • 13.
    Apache Hive A SQLlike data processing language Since many understand SQL, Hive made large scale data processing Big Data accessible to many Expressive, short, and sweet. Define core operations that covers 90% of problems Lets experts dig in when they like!
  • 14.
    (Batch Processing, Hive) (RealtimeAnalytics, X) What is X?
  • 15.
    CEP = SQLfor Realtime Analytics Easy to follow from SQL Expressive, short, and sweet. Define core operations that covers 90% of problems Lets experts dig in when they like! Lets look at the core operations.
  • 16.
    Operators: Filters Assume atemperature stream Here weather:convertFtoC() is a user defined function. They are used to extend the language. define stream TempStream (ts long, temp double); from TempratureStream [weather:convertFtoC(temp) > 30.0) and roomNo != 2043] select roomNo, temp insert into HotRoomsStream ; Usecases: - Alerts , thresholds (e.g. Alarm on high temperature) - Preprocessing: filtering, transformations (e.g. data cleanup)
  • 17.
    Operators:Windows and Aggregation Supportmany window types - Batch Windows, Sliding windows, Custom windows Usecases - Simple counting (e.g. failure count) - Counting with Windows ( e.g. failure count every hour) from TempratureStream#window.time(1 min) select roomNo, avg(temp) as avgTemp insert into HotRoomsStream ;
  • 18.
    Operators: Patterns Models afollowed by relation: e.g. event A followed by event B Very powerful tool for tracking and detecting patterns from every (a1 = TempratureStream) -> a2 = TempratureStream [temp > a1.temp + 5 ] within 1 day select a2.ts as ts, a2.temp – a1.temp as diff insert into HotDayAlertStream; Usecases - Detecting Event Sequence Patterns - Tracking - Detect trends
  • 19.
    Operators: Joins Join twodata streams based on a condition and windows Usecases - Data Correlation, Detect missing events, detecting erroneous data - Joining event streams from TempStream[temp > 30.0]#window.time(1 min) as T join RegulatorStream[isOn == false]#window.length(1) as R on T.roomNo == R.roomNo select T.roomNo, R.deviceID, ‘start’ as action insert into RegulatorActionStream
  • 20.
    Operators:Access Data fromthe Disk Event tables allow users to map a database to a window and join a data stream with the window Usecases - Merge with data in a database, collect, update data conditionally define stream TempStream (ts long, temp double); define table HistTempTable(day long, avgT double); from TempStream #window.length(1) join OldTempTable on getDayOfYear(ts) == HistTempTable.day && ts > avgT select ts, temp insert into PurchaseUserStream ;
  • 21.
  • 22.
    Predictive Analytics  Buildmodels and use them with WSO2 CEP, BAM and ESB using upcoming WSO2 Machine Learner Product ( 2015 Q2)  Build model using R, export them as PMML, and use within WSO2 CEP  Call R Scripts from CEP queries  Regression and Anomaly Detection Operators in CEP
  • 23.
    Case Study: RealtimeSoccer Analysis Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
  • 24.
    TFLTraffic Analysis Built usingTFL ( Transport for London) open data feeds. http://goo.gl/04tX6k http://goo.gl/9xNiCm
  • 25.
  • 26.
    Idea 1: Networkof CEP Nodes For scaling, we arrange CEP processing nodes in a graph like with stream processing. The Graph can be implemented using an stream processing engine like Apache Storm
  • 27.
    Idea II: CompileSQL like Queries to a Network of CEP Nodes from TempStream[temp > 33] insert into HighTempStream; from HighTempStream#window(1h) select max(temp)as max insert into HourlyMaxTempStream; 
  • 28.
    How do Wepartition the Data to scale up the Analysis? Lets follow MapReduce Map Reduce does not scale itself, it asks users to break the problem to many small independent problems.
  • 29.
    Idea III: Letthe Users specify Parallelism Language include parallel constructs: partitions, pipelines, distributed operators Assign each partition to a different node, and partition the data accordingly define partition on TempStream.region { from TempStream[temp > 33] insert into HighTempStream; } from HighTempStream#window(1h) select max(temp)as max insert into HourlyMaxTempStream;
  • 30.
    Handling Ordering When thedata processed in parallel, output might be generated out of order. Due to lack of a global time, we cannot trigger windows and other time sensitive constructs Solution: the current time needs to be propagated though the graph
  • 31.
  • 32.
    WSO2 CEP &Big Data Platform
  • 33.
    CEP = SQLfor Realtime Analytics Easy to follow from SQL Expressive, short, sweet and fast!! Define core operations that covers 90% of problems Lets experts dig in when they like! And it Scales!!
  • 34.
    Questions? Visit us atBooth 1025http://wso2.com/landing/strata- hadoop-world-ca-2015/