An Introduction to the WSO2
Analytics Platform
Srinath Perera
VP Research WSO2, Apache Member
(@srinath_perera)
srinath@wso2.com
Collect Data
 One Sensor API to publish
events
- REST, Thrift, Java, JMS,
Kafka
- Java clients, java script clients*
 First you define streams
(think it as a infinite table in
SQL DB)
 Then publish events via
Sensor API
“Publish once, process
anyway you like”
Collecting Data: Example
 Java example: create and send events
 Events send asynchronously
 See client given in http://goo.gl/vIJzqc for more info
Agent agent = new Agent(agentConfiguration);
publisher = new AsyncDataPublisher("tcp://hostname:7612", .. );
StreamDefinition definition = new StreamDefinition(STREAM_NAME,VERSION);
definition.addPayloadData("sid", STRING);
...
publisher.addStreamDefinition(definition);
...
Event event = new Event();
event.setPayloadData(eventData);
publisher.publish(STREAM_NAME, VERSION, event); Send events
Define Stream
Initialize Stream
Data Collection Examples
• Collect data from inbuilt agents in
WSO2 products, Tomcat etc.
• Collecting your log data via log stash
• Collecting JVM and JMX stats via agent
• Ingesting data from message queues
such as JMS or Kafka
• Pulling data from a RSS feed, or
scraping a web page
• Write a custom agent to collect data
from your system and push it to DAS
Photo credit http://www.torange.us/ CC license
Analysis: Batch Analytics
• Batch analytics reads data from a disk ( or some other
storage) and process them record by record
• “MapReduce” is most widely used technology for batch
analytics
– Apache Hadoop
– Apache Spark 30X faster and much more flexible
• Analytics (Min, Max, average, correlation, histograms, might
join or group data in many ways)
• Key Performance indicators (KPIs)
– E.g. Profit per square feet for retail
• Presented as a Dashboard
SQL like Queries: Spark SQL
 Since many understands SQL, Hive made
large scale data processing Big Data
accessible to many
 Expressive, short, and sweet.
 Define core operations that covers 90%
of problems
 Lets experts dig in when they like! (via
User Defined functions)
insert overwrite table BusSpeed
select hour, average(v) as avgV, busID
from BusStream group by busID, getHour(ts);
Spark SQL Query
 Count entries where username is not empty group by user name
and ordered by the count
SELECT username, COUNT(*) AS count FROM wikiData WHERE
username <> '' GROUP BY username ORDER BY count DESC
LIMIT 10
Usecase: API Usage
• Looking at different API calls by countries
• Designed to draw attention to what APIs are used and where
Value of some Insights degrade
Fast!
 For some usecases ( e.g. stock
markets, traffic, surveillance, patient
monitoring) the value of insights
degrades very quickly with time.
 We need technology that can produce
outputs fast
 Static Queries, but need very fast output
(Alerts, Realtime control)
 Dynamic and Interactive Queries ( Data
exploration)
Realtime Analytics: Complex Event
Processing
CEP Queries 1
 Calculate average temperature over a 1 minute sliding window
group by roomNo
Define Stream TempStream(roomNo string, temp double )
from TempStream#window.time(1 min)
select roomNo, avg(temp) as avgTemp
group by roomNo
insert all events into AvgRoomTempStream ;
CEP Queries 2
 Using data from a Football game
 Kick stream shows kicks by players on the ball
 Ball possession is hit by me, followed by any number of hits by me,
followed by hit by someone else
from every k1 =KickStream,
KickStream[playerid = k1.playerid]*,
KickStream[playerid != k1.playerid]
select ..
insert into BallPosessionStream;
People
Tracking via
BLE
• Track people through BLE via
triangulation
• Higher level logic via Complex
Event Processing
• Traffic Monitoring
• Smart retail
• Airport management
Realtime Soccer Analysis
Watch at:
https://www.youtube.com/watch?v=nRI6buQ0NOM
Scaling CEP Queries on top of Storm
▪ Accepts CEP queries with hints about how to partition streams
▪ Partition streams, build a Apache Storm topology running CEP nodes as Storm
Sprouts, and run it. see http://goo.gl/pP3kdX for more info.
CEP Queries On Strom
 @dist(parallel='4’) ask to run it with 4 nodes
 Use partition definition to break the data so they can run in parallel
define partition on TempStream.region {
@dist(parallel='4’)
from TempStream[temp > 33]
insert into HighTempStream;
}
from HighTempStream#window(1h)
select max(temp)as max
insert into HourlyMaxTempStream;
Interactive Analytics
 Best way to explore data is by
asking Ad-hoc questions
 Interactive Analytics ( Search)
let you query the system and
receive fast results (<10s)
 Shows data in context (e.g. by
grouping events from the
same transaction together)
 Built using Lucence based
Indexes.
SparkSQL> SELECT * FROM TWITTER_DATA
Predictive Analytics
 Can you “Write a program to drive a Car?”
 Machine learning
 Takes in lot of examples, and build a program
that matches those examples
 We call that program a “model”
 Lot of tools
- R ( Statistical language)
- Sci-kit learn (Python)
- Apache Spark’s MLBase and Apache Mahout
(Java)
Predictive Analytics in DAS
• Building models
– With WSO2 Machine
Learner Product via a
Wizard ( powered by
MLLib)
– Build model using R and
export them as PMML
• Built models can be used
them with both WSO2 CEP
and ESB
Using the Model
 Within CEP
from InputStream#ml:predict(’/../diabetes-model', 'double')
select *
insert into PredictionStream;
<predict>
<model storage-location=”../downloaded-ml-model"/>
<features>
<feature name="SI2" expression="$body/features/SI2"/>
..
</features>
<predictionOutput property="result"/>
</predict>
 Within ESB
WSO2 Machine Learner
• Upload or select data
• Explore the data
• Train a Machine learning
model
WSO2 Machine Learner
• Compare Results
• Understand why
• Iterate
Supported Algorithms
• Deep Learning based classification (H2O’s Stacked Autoencoders
Classifier).
• Classification algorithms - Decision Trees, Linear Regression, Lasso
Regression, SVM, Naïve
• K-Mean clustering for unsupervised learning on your data
• Employ Anomaly Detection using K Means Algorithm to identify
fraud, network penetration and other difficult scenarios
• Recommendations Engine (Collaborative Filtering Algorithm)
Predict wait time in the Airport
• Predicting the time
to go through
airport
• Real-time updates
and events to
passengers
• Let airport manage
by allocate resources
Predict Promising Customers
• Typical website can get millions of users
• Only very small fraction coverts
• Each user, we know what he access, where is
works, country, what browser, OS, etc.
• Problem is to predict what users will covert
• Used Logistic regression, Random Forest,
Survival Modeling etc.
Predict Super Bowl
• Predicted 7 of the 11
games
• Done with Random
Forest Algorithm
• Even what we missed
are instructive
See Yuda’s post: Predicting the Super Bowl with Machine Learning
Anomaly Detection:Markov Models
• Can model probability
of a sequences
• Given a sequence, can
predict likelihood, and
use that to detect
anomalies.
• Implemented with
WSO2 CEP
Anomaly Detection: Clustering
• Use clustering to identify
normal behavior as clusters
• Consider points away from
all cluster as anomalies.
• Point is considered away
from a cluster if it is
outside 99% percentile line
for that cluster
• Includes in WSO2 ML
Communicate: Dashboards
• Dashboard give an “Overall idea”
in a glance (e.g. car dashboard)
– Boring when everything is good!!
• Build your own dashboard.
– WSO2 DAS supports a gadget
generation Wizard
– You can write your own Gadgets
using D3 and Javascript.
Gadget Generation Wizard
• Starts with data in tabular format
• Map each column to dimension in your plot
like X,Y, color, point size, etc
• Create a chart with few clicks
Powered by
VizGrammer lib
that uses Vaga
undneath (see
https://github.com/
wso2/VizGrammar
)
Communicate: Alerts
▪ Done with CEP Queries
▪ Last Mile
- Email, SMS
- Push notifications to a UI
- Pager
- Trigger physical Alarm
Real Life Use Cases
▪ Cisco ( OEM the platform with Cisco
solutions, Health, Smart Parking)
▪ Experian ( Digital Marketing) - see video
▪ Pacific Controls ( Smart City Platform, Vehicle
tracking, building monitoring) - see video
▪ Throttling and Anomaly Detection ( by group
of Telco companies)
▪ API Analytics (13+ customers)
No battle plan survives
contact with the enemy
--Helmuth von Moltke
Key Differentiators
• Open Source, under Apache 2 license
• Publish data once, analyze it anyway you like
experience.
• Flexible packaging or as a scalable cluster
• Rich, extensible, SQL-like configuration language
• Compact, easy to learn syntax addressing complex
requirements, such as time windows, patterns,
sequences which would be complex to develop in a
programming language such as Java.
• Rich set of data connectors, which can be easily
extended
More Information
▪ Introducing WSO2 Analytics Platform: Note for Architects,
https://iwringer.wordpress.com/2015/03/18/introducing-wso2-
analytics-platform-note-for-architects/
▪ WSO2 Data Analytics Server, http://wso2.com/products/data-
analytics-server/
▪ WSO2 Complex Event Processor,
http://wso2.com/products/complex-event-processor/
▪ WSO2 Machine Learner, http://wso2.com/products/machine-learner/
Thank
You

Introduction to WSO2 Data Analytics Platform

  • 1.
    An Introduction tothe WSO2 Analytics Platform Srinath Perera VP Research WSO2, Apache Member (@srinath_perera) srinath@wso2.com
  • 4.
    Collect Data  OneSensor API to publish events - REST, Thrift, Java, JMS, Kafka - Java clients, java script clients*  First you define streams (think it as a infinite table in SQL DB)  Then publish events via Sensor API
  • 5.
  • 6.
    Collecting Data: Example Java example: create and send events  Events send asynchronously  See client given in http://goo.gl/vIJzqc for more info Agent agent = new Agent(agentConfiguration); publisher = new AsyncDataPublisher("tcp://hostname:7612", .. ); StreamDefinition definition = new StreamDefinition(STREAM_NAME,VERSION); definition.addPayloadData("sid", STRING); ... publisher.addStreamDefinition(definition); ... Event event = new Event(); event.setPayloadData(eventData); publisher.publish(STREAM_NAME, VERSION, event); Send events Define Stream Initialize Stream
  • 7.
    Data Collection Examples •Collect data from inbuilt agents in WSO2 products, Tomcat etc. • Collecting your log data via log stash • Collecting JVM and JMX stats via agent • Ingesting data from message queues such as JMS or Kafka • Pulling data from a RSS feed, or scraping a web page • Write a custom agent to collect data from your system and push it to DAS Photo credit http://www.torange.us/ CC license
  • 8.
    Analysis: Batch Analytics •Batch analytics reads data from a disk ( or some other storage) and process them record by record • “MapReduce” is most widely used technology for batch analytics – Apache Hadoop – Apache Spark 30X faster and much more flexible • Analytics (Min, Max, average, correlation, histograms, might join or group data in many ways) • Key Performance indicators (KPIs) – E.g. Profit per square feet for retail • Presented as a Dashboard
  • 9.
    SQL like Queries:Spark SQL  Since many understands SQL, Hive made large scale data processing Big Data accessible to many  Expressive, short, and sweet.  Define core operations that covers 90% of problems  Lets experts dig in when they like! (via User Defined functions) insert overwrite table BusSpeed select hour, average(v) as avgV, busID from BusStream group by busID, getHour(ts);
  • 10.
    Spark SQL Query Count entries where username is not empty group by user name and ordered by the count SELECT username, COUNT(*) AS count FROM wikiData WHERE username <> '' GROUP BY username ORDER BY count DESC LIMIT 10
  • 11.
    Usecase: API Usage •Looking at different API calls by countries • Designed to draw attention to what APIs are used and where
  • 12.
    Value of someInsights degrade Fast!  For some usecases ( e.g. stock markets, traffic, surveillance, patient monitoring) the value of insights degrades very quickly with time.  We need technology that can produce outputs fast  Static Queries, but need very fast output (Alerts, Realtime control)  Dynamic and Interactive Queries ( Data exploration)
  • 13.
  • 14.
    CEP Queries 1 Calculate average temperature over a 1 minute sliding window group by roomNo Define Stream TempStream(roomNo string, temp double ) from TempStream#window.time(1 min) select roomNo, avg(temp) as avgTemp group by roomNo insert all events into AvgRoomTempStream ;
  • 15.
    CEP Queries 2 Using data from a Football game  Kick stream shows kicks by players on the ball  Ball possession is hit by me, followed by any number of hits by me, followed by hit by someone else from every k1 =KickStream, KickStream[playerid = k1.playerid]*, KickStream[playerid != k1.playerid] select .. insert into BallPosessionStream;
  • 16.
    People Tracking via BLE • Trackpeople through BLE via triangulation • Higher level logic via Complex Event Processing • Traffic Monitoring • Smart retail • Airport management
  • 17.
    Realtime Soccer Analysis Watchat: https://www.youtube.com/watch?v=nRI6buQ0NOM
  • 18.
    Scaling CEP Querieson top of Storm ▪ Accepts CEP queries with hints about how to partition streams ▪ Partition streams, build a Apache Storm topology running CEP nodes as Storm Sprouts, and run it. see http://goo.gl/pP3kdX for more info.
  • 19.
    CEP Queries OnStrom  @dist(parallel='4’) ask to run it with 4 nodes  Use partition definition to break the data so they can run in parallel define partition on TempStream.region { @dist(parallel='4’) from TempStream[temp > 33] insert into HighTempStream; } from HighTempStream#window(1h) select max(temp)as max insert into HourlyMaxTempStream;
  • 20.
    Interactive Analytics  Bestway to explore data is by asking Ad-hoc questions  Interactive Analytics ( Search) let you query the system and receive fast results (<10s)  Shows data in context (e.g. by grouping events from the same transaction together)  Built using Lucence based Indexes. SparkSQL> SELECT * FROM TWITTER_DATA
  • 21.
    Predictive Analytics  Canyou “Write a program to drive a Car?”  Machine learning  Takes in lot of examples, and build a program that matches those examples  We call that program a “model”  Lot of tools - R ( Statistical language) - Sci-kit learn (Python) - Apache Spark’s MLBase and Apache Mahout (Java)
  • 22.
    Predictive Analytics inDAS • Building models – With WSO2 Machine Learner Product via a Wizard ( powered by MLLib) – Build model using R and export them as PMML • Built models can be used them with both WSO2 CEP and ESB
  • 23.
    Using the Model Within CEP from InputStream#ml:predict(’/../diabetes-model', 'double') select * insert into PredictionStream; <predict> <model storage-location=”../downloaded-ml-model"/> <features> <feature name="SI2" expression="$body/features/SI2"/> .. </features> <predictionOutput property="result"/> </predict>  Within ESB
  • 24.
    WSO2 Machine Learner •Upload or select data • Explore the data • Train a Machine learning model
  • 25.
    WSO2 Machine Learner •Compare Results • Understand why • Iterate
  • 26.
    Supported Algorithms • DeepLearning based classification (H2O’s Stacked Autoencoders Classifier). • Classification algorithms - Decision Trees, Linear Regression, Lasso Regression, SVM, Naïve • K-Mean clustering for unsupervised learning on your data • Employ Anomaly Detection using K Means Algorithm to identify fraud, network penetration and other difficult scenarios • Recommendations Engine (Collaborative Filtering Algorithm)
  • 27.
    Predict wait timein the Airport • Predicting the time to go through airport • Real-time updates and events to passengers • Let airport manage by allocate resources
  • 28.
    Predict Promising Customers •Typical website can get millions of users • Only very small fraction coverts • Each user, we know what he access, where is works, country, what browser, OS, etc. • Problem is to predict what users will covert • Used Logistic regression, Random Forest, Survival Modeling etc.
  • 29.
    Predict Super Bowl •Predicted 7 of the 11 games • Done with Random Forest Algorithm • Even what we missed are instructive See Yuda’s post: Predicting the Super Bowl with Machine Learning
  • 30.
    Anomaly Detection:Markov Models •Can model probability of a sequences • Given a sequence, can predict likelihood, and use that to detect anomalies. • Implemented with WSO2 CEP
  • 31.
    Anomaly Detection: Clustering •Use clustering to identify normal behavior as clusters • Consider points away from all cluster as anomalies. • Point is considered away from a cluster if it is outside 99% percentile line for that cluster • Includes in WSO2 ML
  • 32.
    Communicate: Dashboards • Dashboardgive an “Overall idea” in a glance (e.g. car dashboard) – Boring when everything is good!! • Build your own dashboard. – WSO2 DAS supports a gadget generation Wizard – You can write your own Gadgets using D3 and Javascript.
  • 33.
    Gadget Generation Wizard •Starts with data in tabular format • Map each column to dimension in your plot like X,Y, color, point size, etc • Create a chart with few clicks Powered by VizGrammer lib that uses Vaga undneath (see https://github.com/ wso2/VizGrammar )
  • 34.
    Communicate: Alerts ▪ Donewith CEP Queries ▪ Last Mile - Email, SMS - Push notifications to a UI - Pager - Trigger physical Alarm
  • 35.
    Real Life UseCases ▪ Cisco ( OEM the platform with Cisco solutions, Health, Smart Parking) ▪ Experian ( Digital Marketing) - see video ▪ Pacific Controls ( Smart City Platform, Vehicle tracking, building monitoring) - see video ▪ Throttling and Anomaly Detection ( by group of Telco companies) ▪ API Analytics (13+ customers) No battle plan survives contact with the enemy --Helmuth von Moltke
  • 36.
    Key Differentiators • OpenSource, under Apache 2 license • Publish data once, analyze it anyway you like experience. • Flexible packaging or as a scalable cluster • Rich, extensible, SQL-like configuration language • Compact, easy to learn syntax addressing complex requirements, such as time windows, patterns, sequences which would be complex to develop in a programming language such as Java. • Rich set of data connectors, which can be easily extended
  • 37.
    More Information ▪ IntroducingWSO2 Analytics Platform: Note for Architects, https://iwringer.wordpress.com/2015/03/18/introducing-wso2- analytics-platform-note-for-architects/ ▪ WSO2 Data Analytics Server, http://wso2.com/products/data- analytics-server/ ▪ WSO2 Complex Event Processor, http://wso2.com/products/complex-event-processor/ ▪ WSO2 Machine Learner, http://wso2.com/products/machine-learner/
  • 39.