Today’s highly connected world is flooding businesses with big and fast-moving data. The ability to trawl this data ocean and identify actionable insights can deliver a competitive advantage to any organization. The WSO2 Analytics Platform enables businesses to do just that by providing batch, real-time, interactive and predictive analysis capabilities all in one place.
In this tutorial we will
Plug in the WSO2 Analytics Platform to some common business use cases
Showcase the numerous capabilities of the platform
Demonstrate how to collect data, analyze, predict and communicate effectively
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Data Needs
1. WSO2 Analytics Platform: The
One Stop Shop for All Your Data
Needs
Sinthuja Rajendran
Associate Technical Lead, WSO2
Nirmal Fernando
Associate Technical Lead, WSO2
2. WSO2 Analytics Platform
WSO2 Analytics Platform uniquely combines simultaneous
real-time and interactive, batch with predictive analytics to
turn data from IoT, mobile and Web apps into actionable
insights
4. WSO2 Data Analytics Server
• Fully-open source solution with the ability to build systems and
applications that collect and analyze both realtime and persisted data and
communicate the results.
• Part of WSO2 Big Data Analytics Platform
• High performance data capture framework
• Highly available and scalable by design
• Pre-built Data Agents for WSO2 products
6. Data Processing Pipeline
Collect Data
• Define scheme for
data
• Send events to batch
and/or Real time
pipeline
•Publish events
Analyze
•Spark SQL for batch
analytics
•Siddhi Query Language
for real time analytics
•Predictive models for
Machine Learning.
Communicate
•Alerts
•Dashboards
•API
8. Data Model
{
'name': 'stream.name',
'version': '1.0.0',
'nickName': 'stream nick name',
'description': 'description of the stream',
'metaData':[
{'name':'meta_data_1','type':'STRING'},
],
'correlationData':[
{'name':'correlation_data_1','type':'STRING'}
],
'payloadData':[
{'name':'payload_data_1','type':'BOOL'},
{'name':'payload_data_2','type':'LONG'}
]
}
● Data published conforming to a strongly typed data stream
12. Date Persistence
● Data Abstraction Layer to enable pluggable data connectors
○ RDBMS, Cassandra, HBase, custom..
● Analytics Tables
○ The data persistence entity in WSO2 Data Analytics Server
○ Provides a backend data source agnostic way of storing and retrieving data
○ Allows applications to be written in a way, that it does not depend on a specific data source, e.
g. JDBC (RDBMS), Cassandra APIs etc..
○ WSO2 DAS gives a standard REST API in accessing the Analytics Tables
● Analytics Record Stores
○ An Analytics Record Store, stores a specific set of Analytics Tables
○ Event persistence can configure which Analytics Record Store to be used for storing incoming
events
○ Single Analytics Table namespace, the target record store only given at the time of table
creation
○ Useful in creating Analytics Tables where data will be stored in multiple target databases
13. Batch Analytics Engine
● Powered by Apache Spark up to 30x higher performance than Hadoop
● Parallel, distributed with optimized in-memory processing
● Scalable script-based analytics written using an easy-to-learn, SQL-like
query language powered by Spark SQL
● Interactive built in web interface for ad-hoc query execution
● HA/FD supported scheduled query script execution
● Run Spark on a single node, Spark embedded Carbon server cluster or
connect to external Spark cluster
15. Spark Queries (cont..)
SELECT */<column_names> from <temp_table>;
Eg:
select house_id, household_id, plug_id, max(value) - min (value) as usage, compositeID(house_id,
household_id, plug_id) as composite_id from debsData where property = false group by house_id, household_id,
plug_id;
Select Queries
16. Spark Queries (cont..)
INSERT INTO/OVERWRITE TABLE <table_name> <SELECT_query>
Eg:
INSERT OVERWRITE TABLE PlugUsage select house_id, household_id, plug_id, max(value) - min (value) as usage,
compositeID(house_id, household_id, plug_id) as composite_id from debsData where property = false group by
house_id, household_id, plug_id;
Insert Queries
17. Supported functions by Spark
● Query statements, including SELECT, GROUP BY, ORDER BY, SORT BY, etc.
● All Hive operators, including Relational operators , Arithmetic operators, Logical operators, Complex type
constructors, Mathematical functions, String functions.
● User defined functions (UDF)
● User defined aggregation functions (UDAF)
● User defined serialization formats (SerDes)
● Window functions
● Joins
● Sub-queries
● Sampling
● Explain
● Partitioned tables including dynamic partition insertion
● View
● All Hive DDL Functions, such as CREATE TABLE, ALTER TABLE, etc.
18. Create UDF Functions
● Apache Spark allows UDFs (User Defined Functions) to be created if you
want want to use a feature that is not available for Spark by default.
● WSO2 DAS has an abstraction layer for generic Spark UDF (User Defined
Functions) which makes it convenient to introduce UDFs to the server.
19. Eg:
public class StringConcatonator {
/**
This UDF returns the concatenation of two strings
*/
public String concat(String firstString, String secondString) {
return firstString + secondString;
}
}
• Add below to DAS_HOME/repository/conf/analytics/spark/spark-udf-config.xml
<udf-configuration>
<custom-udf-classes>
<class-name>org.wso2.customUDFs.StringConcatonator</class-name>
...
</custom-udf-classes>
</udf-configuration>
20. Publishing events from Spark
• After running the analytics by using spark, then result data
can be published to a stream
CREATE TEMPORARY TABLE <table_name>
USING org.wso2.carbon.analytics.spark.event.EventStreamProvider
OPTIONS (receiverURL "<das_receiver_url>",
authURL "<das_receiver_auth_url>",
username "<user_name>",
password "<password>",
streamName "<stream_name>",
version "<stream_version>",
description "<description>",
nickName "<nick_name>"
payload "<payload>
);
24. ● Idea is to given the “Overall idea” in a glance
(e.g. car dashboard)
● Support for personalization, you can build
your own dashboard.
● Also the entry point for Drill down
● How to build?
○ Dashboard via Google Gadget and
content via HTML5 + Javascript
○ Use WSO2 Dashboard Server to build
a dashboard (or JSP/PHP)
○ Use charting libraries like Vega or D3
Dashboard
25. ● Start with data in tabular format
● Map each column to dimension in your plot like X,Y,
color, point size, etc
● Also do drill-downs
● Create a chart with few clicks
Gadget Generation Wizard
29. Interactive Analytics
● Full text data indexing support powered by Apache Lucene
● Drill down search support
● Distributed data indexing
○ Designed to support scalability
● Near real time data indexing and retrieval
○ Data indexed immediately as received
31. Activity Monitoring
• Correlate the messages collected based on the activity_id in
the metadata of the event
• Trace the transaction path where the events could be in
different tables and with lucene query
38. What’s Real-time Analytics?...
Real-time Analytics in Complex Event Processing
→
• Gather data from multiple sources
• Correlate data streams over time
• Find interesting occurrences
• And Notify
• All in Real-time !
41. Real-time Execution
• Process in streaming fashion
(one event at a time)
• Execution logic written as Execution Plans
• Execution Plan
– An isolated logical execution unit
– Includes a set of queries, and relates to multiple input and
output event streams
– Executed using dedicated WSO2 Siddhi engine
45. define stream SoftDrinkSales
(region string, brand string, quantity int,
price double);
from SoftDrinkSales
select brand, avg(price*quantity) as avgCost,‘USD’ as currency
insert into AvgCostStream
from AvgCostStream
select brand, toEuro(avgCost) as avgCost,‘EURO’ as currency
insert into OutputStream ;
Enriching Streams
Using Functions
Siddhi Query ...
46. define stream SoftDrinkSales
(region string, brand string, quantity int,
price double);
from SoftDrinkSales[region == ‘USA’ and quantity > 99]
select brand, price, quantity
insert into WholeSales ;
from SoftDrinkSales#window.time(1 hour)
select region, brand, avg(quantity) as avgQuantity
group by region, brand
insert into LastHourSales ;
Filtering
Aggregation over 1 hour
Other supported window types:
timeBatch(), length(), lengthBatch(), etc.
Siddhi Query (Filter & Window) ...
47. define stream Purchase (price double, cardNo long,place string);
from every (a1 = Purchase[price < 10] ) ->
a2 = Purchase[ price >10000 and a1.cardNo == a2.cardNo ]
within 1 day
select a1.cardNo as cardNo, a2.price as price, a2.place as place
insert into PotentialFraud ;
Siddhi Query (Pattern) ...
48. define stream StockStream (symbol string, price double, volume int);
partition by (symbol of StockStream)
begin
from t1=StockStream,
t2=StockStream [(t2[last] is null and t1.price < price) or
(t2[last].price < price)]+
within 5 min
select t1.price as initialPrice, t2[last].price as finalPrice,t1.symbol
insert into IncreaingMyStockPriceStream
end;
Siddhi Query (Trends & Partition)...
49. define table CardUserTable (name string, cardNum long) ;
@from(eventtable = 'rdbms' , datasource.name = ‘CardDataSource’ , table.
name = ‘UserTable’, caching.algorithm’=‘LRU’)
define table CardUserTable (name string, cardNum long)
Cache types supported
• Basic: A size-based algorithm based on FIFO.
• LRU (Least Recently Used): The least recently used event is dropped
when cache is full.
• LFU (Least Frequently Used): The least frequently used event is dropped
when cache is full.
Siddhi Query (Table) ...
Supported for RDBMS, In-
Memory, Analytics Table,
Hazelcast
50. define stream Purchase (price double, cardNo long, place string);
define stream CardUserStream (name string, cardNo long) ;
define table CardUserTable (name string, cardNum long) ;
from Purchase#window.length(1) join CardUserTable
on Purchase.cardNo == CardUserTable.cardNum
select Purchase.cardNo as cardNo, CardUserTable.name as name, Purchase.price as price
insert into PurchaseUserStream ;
from CardUserStream
select name, cardNo as cardNum
update CardUserTable
on CardUserTable.name == name ;
Similarly insert into and
delete are also supported!
Siddhi Query (Table) ...
51. • Function extension
• Aggregator extension
• Window extension
• Stream Processor extension
define stream SalesStream (brand string, price double, currency string);
from SalesStream
select brand, custom:toUSD(price, currency) as priceInUSD
insert into OutputStream ;
Referred with namespaces
Siddhi Query (Extension) ...
52. • geo: Geographical processing
• nlp: Natural language Processing (with Stanford NLP)
• ml: Running machine learning models of WSO2 Machine
Lerner
• pmml: Running PMML models learnt by R
• timeseries: Regression and time series
• math: Mathematical operations
• str: String operations
• regex: Regular expression
• ...
Siddhi Extensions
54. WSO2 CEP (Real-time) Scalability
Distributed Real-time = Siddhi +
Advantages over Apache Storm
• No need to write Java code (Supports SQL like query language)
• Can be used with any programming language
• Can handle over a million tuples processed per second per
node.
• Scalable, fault-tolerant, guarantees your data will be processed
• etc ...
62. Realtime Dashboard
• Dashboard
– Google Gadget
– HTML5 + javascripts
• Support gadget
generation
– Using D3 and Vega
• Gather data for UI from
– Websockets
– Polling
• Support Custom Gadgets
and Dashboards
63. Beyond Boundaries
• Expose analytics results
as API
– Mobile Apps, Third Party
• Provides
– Security, Billing,
– Throttling, Quotas & SLA
• How ?
– Write data to database from DAS
– Build Services via WSO2 Data Services Server
– Expose them as APIs via WSO2 API Manager
67. What is Predictive Analytics?...
Predictive Analytics in
→
• Upload, pre-process, and explore data
• Create models, tune algorithms and make
predictions
• Integrate for better intelligence
69. WSO2 Machine Learner
• Guided UI to build machine
learning models
– Via Spark MlLib
– Via H2O.ai
• Run models using CEP, DAS
and ESB
• Run R Scripts, Regression and Anomaly Detection real-time
71. ML Models
ML_Algo(Data) => Model
• Outcome of ML algos are models
– E.g. Learning classification generate a model that you can use to classify
data.
• ML Wizard help you create models
• These models will be publish to registry or downloaded
• Then can be applied in CEP, DAS, ESB etc. for prediction