Introduction to WSO2 Data Analytics Platform

An Introduction to the WSO2
Analytics Platform
Srinath Perera
VP Research WSO2, Apache Member
(@srinath_perera)
srinath@wso2.com

Collect Data
 One Sensor API to publish
events
- REST, Thrift, Java, JMS,
Kafka
- Java clients, java script clients*
 First you define streams
(think it as a infinite table in
SQL DB)
 Then publish events via
Sensor API

“Publish once, process
anyway you like”

Collecting Data: Example
 Java example: create and send events
 Events send asynchronously
 See client given in http://goo.gl/vIJzqc for more info
Agent agent = new Agent(agentConfiguration);
publisher = new AsyncDataPublisher("tcp://hostname:7612", .. );
StreamDefinition definition = new StreamDefinition(STREAM_NAME,VERSION);
definition.addPayloadData("sid", STRING);
...
publisher.addStreamDefinition(definition);
...
Event event = new Event();
event.setPayloadData(eventData);
publisher.publish(STREAM_NAME, VERSION, event); Send events
Define Stream
Initialize Stream

Data Collection Examples
• Collect data from inbuilt agents in
WSO2 products, Tomcat etc.
• Collecting your log data via log stash
• Collecting JVM and JMX stats via agent
• Ingesting data from message queues
such as JMS or Kafka
• Pulling data from a RSS feed, or
scraping a web page
• Write a custom agent to collect data
from your system and push it to DAS
Photo credit http://www.torange.us/ CC license

Analysis: Batch Analytics
• Batch analytics reads data from a disk ( or some other
storage) and process them record by record
• “MapReduce” is most widely used technology for batch
analytics
– Apache Hadoop
– Apache Spark 30X faster and much more flexible
• Analytics (Min, Max, average, correlation, histograms, might
join or group data in many ways)
• Key Performance indicators (KPIs)
– E.g. Profit per square feet for retail
• Presented as a Dashboard

SQL like Queries: Spark SQL
 Since many understands SQL, Hive made
large scale data processing Big Data
accessible to many
 Expressive, short, and sweet.
 Define core operations that covers 90%
of problems
 Lets experts dig in when they like! (via
User Defined functions)
insert overwrite table BusSpeed
select hour, average(v) as avgV, busID
from BusStream group by busID, getHour(ts);

Spark SQL Query
 Count entries where username is not empty group by user name
and ordered by the count
SELECT username, COUNT(*) AS count FROM wikiData WHERE
username <> '' GROUP BY username ORDER BY count DESC
LIMIT 10

Usecase: API Usage
• Looking at different API calls by countries
• Designed to draw attention to what APIs are used and where

Value of some Insights degrade
Fast!
 For some usecases ( e.g. stock
markets, traffic, surveillance, patient
monitoring) the value of insights
degrades very quickly with time.
 We need technology that can produce
outputs fast
 Static Queries, but need very fast output
(Alerts, Realtime control)
 Dynamic and Interactive Queries ( Data
exploration)

Realtime Analytics: Complex Event
Processing

CEP Queries 1
 Calculate average temperature over a 1 minute sliding window
group by roomNo
Define Stream TempStream(roomNo string, temp double )
from TempStream#window.time(1 min)
select roomNo, avg(temp) as avgTemp
group by roomNo
insert all events into AvgRoomTempStream ;

CEP Queries 2
 Using data from a Football game
 Kick stream shows kicks by players on the ball
 Ball possession is hit by me, followed by any number of hits by me,
followed by hit by someone else
from every k1 =KickStream,
KickStream[playerid = k1.playerid]*,
KickStream[playerid != k1.playerid]
select ..
insert into BallPosessionStream;

People
Tracking via
BLE
• Track people through BLE via
triangulation
• Higher level logic via Complex
Event Processing
• Traffic Monitoring
• Smart retail
• Airport management

Realtime Soccer Analysis
Watch at:
https://www.youtube.com/watch?v=nRI6buQ0NOM

Scaling CEP Queries on top of Storm
▪ Accepts CEP queries with hints about how to partition streams
▪ Partition streams, build a Apache Storm topology running CEP nodes as Storm
Sprouts, and run it. see http://goo.gl/pP3kdX for more info.

CEP Queries On Strom
 @dist(parallel='4’) ask to run it with 4 nodes
 Use partition definition to break the data so they can run in parallel
define partition on TempStream.region {
@dist(parallel='4’)
from TempStream[temp > 33]
insert into HighTempStream;
}
from HighTempStream#window(1h)
select max(temp)as max
insert into HourlyMaxTempStream;

Interactive Analytics
 Best way to explore data is by
asking Ad-hoc questions
 Interactive Analytics ( Search)
let you query the system and
receive fast results (<10s)
 Shows data in context (e.g. by
grouping events from the
same transaction together)
 Built using Lucence based
Indexes.
SparkSQL> SELECT * FROM TWITTER_DATA

Predictive Analytics
 Can you “Write a program to drive a Car?”
 Machine learning
 Takes in lot of examples, and build a program
that matches those examples
 We call that program a “model”
 Lot of tools
- R ( Statistical language)
- Sci-kit learn (Python)
- Apache Spark’s MLBase and Apache Mahout
(Java)

Predictive Analytics in DAS
• Building models
– With WSO2 Machine
Learner Product via a
Wizard ( powered by
MLLib)
– Build model using R and
export them as PMML
• Built models can be used
them with both WSO2 CEP
and ESB

Using the Model
 Within CEP
from InputStream#ml:predict(’/../diabetes-model', 'double')
select *
insert into PredictionStream;
<predict>
<model storage-location=”../downloaded-ml-model"/>
<features>
<feature name="SI2" expression="$body/features/SI2"/>
..
</features>
<predictionOutput property="result"/>
</predict>
 Within ESB

WSO2 Machine Learner
• Upload or select data
• Explore the data
• Train a Machine learning
model

WSO2 Machine Learner
• Compare Results
• Understand why
• Iterate

Supported Algorithms
• Deep Learning based classification (H2O’s Stacked Autoencoders
Classifier).
• Classification algorithms - Decision Trees, Linear Regression, Lasso
Regression, SVM, Naïve
• K-Mean clustering for unsupervised learning on your data
• Employ Anomaly Detection using K Means Algorithm to identify
fraud, network penetration and other difficult scenarios
• Recommendations Engine (Collaborative Filtering Algorithm)

Predict wait time in the Airport
• Predicting the time
to go through
airport
• Real-time updates
and events to
passengers
• Let airport manage
by allocate resources

Predict Promising Customers
• Typical website can get millions of users
• Only very small fraction coverts
• Each user, we know what he access, where is
works, country, what browser, OS, etc.
• Problem is to predict what users will covert
• Used Logistic regression, Random Forest,
Survival Modeling etc.

Predict Super Bowl
• Predicted 7 of the 11
games
• Done with Random
Forest Algorithm
• Even what we missed
are instructive
See Yuda’s post: Predicting the Super Bowl with Machine Learning

Anomaly Detection:Markov Models
• Can model probability
of a sequences
• Given a sequence, can
predict likelihood, and
use that to detect
anomalies.
• Implemented with
WSO2 CEP

Anomaly Detection: Clustering
• Use clustering to identify
normal behavior as clusters
• Consider points away from
all cluster as anomalies.
• Point is considered away
from a cluster if it is
outside 99% percentile line
for that cluster
• Includes in WSO2 ML

Communicate: Dashboards
• Dashboard give an “Overall idea”
in a glance (e.g. car dashboard)
– Boring when everything is good!!
• Build your own dashboard.
– WSO2 DAS supports a gadget
generation Wizard
– You can write your own Gadgets
using D3 and Javascript.

Gadget Generation Wizard
• Starts with data in tabular format
• Map each column to dimension in your plot
like X,Y, color, point size, etc
• Create a chart with few clicks
Powered by
VizGrammer lib
that uses Vaga
undneath (see
https://github.com/
wso2/VizGrammar
)

Communicate: Alerts
▪ Done with CEP Queries
▪ Last Mile
- Email, SMS
- Push notifications to a UI
- Pager
- Trigger physical Alarm

Real Life Use Cases
▪ Cisco ( OEM the platform with Cisco
solutions, Health, Smart Parking)
▪ Experian ( Digital Marketing) - see video
▪ Pacific Controls ( Smart City Platform, Vehicle
tracking, building monitoring) - see video
▪ Throttling and Anomaly Detection ( by group
of Telco companies)
▪ API Analytics (13+ customers)
No battle plan survives
contact with the enemy
--Helmuth von Moltke

Key Differentiators
• Open Source, under Apache 2 license
• Publish data once, analyze it anyway you like
experience.
• Flexible packaging or as a scalable cluster
• Rich, extensible, SQL-like configuration language
• Compact, easy to learn syntax addressing complex
requirements, such as time windows, patterns,
sequences which would be complex to develop in a
programming language such as Java.
• Rich set of data connectors, which can be easily
extended

More Information
▪ Introducing WSO2 Analytics Platform: Note for Architects,
https://iwringer.wordpress.com/2015/03/18/introducing-wso2-
analytics-platform-note-for-architects/
▪ WSO2 Data Analytics Server, http://wso2.com/products/data-
analytics-server/
▪ WSO2 Complex Event Processor,
http://wso2.com/products/complex-event-processor/
▪ WSO2 Machine Learner, http://wso2.com/products/machine-learner/

Introduction to WSO2 Data Analytics Platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Introduction to WSO2 Data Analytics Platform

Similar to Introduction to WSO2 Data Analytics Platform (20)

More from Srinath Perera

More from Srinath Perera (20)

Recently uploaded

Recently uploaded (20)

Introduction to WSO2 Data Analytics Platform