This document provides an overview of batch and interactive analytics. It defines batch analytics as processing stored data through time-consuming tasks, while interactive analytics allows ad-hoc querying of stored data for quick results. The document then outlines technologies used for batch and interactive analytics like Spark, Elasticsearch and Solr. It provides details on the WSO2 analytics architecture and how it supports both batch and interactive processing, alerts, and mixing of real-time and batch data. Example solutions like service monitoring, activity monitoring and log analysis are also presented.
2. 2
Agenda
2
๏ Batch and Interactive Processing Defined
๏ Technologies used for Batch/Interactive Analytics
๏ WSO2 Analytics Architecture
๏ Solutions
๏ Demo
3. 3
Let’s Break It Down...
3
๏ Batch Analytics:
Batch Analytics is where the data is first
stored, and later read back to do some
relatively time consuming data processing
task.
๏ Interactive Analytics:
Interactive analytics is used where, a
stored data set can be queried in an ad-
hoc manner in finding useful information
quickly.
Source: http://themarketingblog.ecornell.com/
4. 4
Where Can We Use It?
4
๏ Service Statistics Generation
๏ Extracting KPIs: average response
time, maximum latency etc..
๏ Log Analysis
๏ Efficiently store and analyse logs, in
supporting comprehensive search
operations
๏ Activity Monitoring
๏ Trace a workflow of events
throughout a system. Useful in
finding failed transactions,
performance issues etc..
๏ Solving Optimization Problems
๏ Analysing large amount of past
data in optimizing parameters for
an existing algorithm
Source: http://www.axentas.com/
10. 1
Data Model
1
Data Published according to a strongly typed data stream
{
'name': 'stream.name',
'version': '1.0.0',
'nickName': 'stream nickname',
'description': 'description of the stream',
'metaData':[
{'name':'meta_data_1','type':'STRING'},
],
'correlationData':[
{'name':'correlation_data_1','type':'STRING'}
],
'payloadData':[
{'name':'payload_data_1','type':'BOOL'},
{'name':'payload_data_2','type':'LONG'}
]
}
11. 1
WSO2 DAS - Batch Processing
1
๏ Powered by Apache Spark 10 - 100x higher performance than Hadoop
๏ Parallel, distributed with optimized in-memory processing
๏ Can run on top of Hadoop Yarn, Mesos or in Standalone mode
๏ Scalable script-based analytics written using an easy-to-learn, SQL-like query
language powered by Spark SQL
๏ Interactive built in web interface (Spark Console) for ad-hoc query execution
๏ HA/FO supported scheduled query script execution
๏ Run Spark on a single node, Spark embedded Carbon server cluster or connect to
external Spark cluster
๏ Custom UDF support
INSERT INTO TABLE UserTable SELECT userName, COUNT(DISTINCT orderID), SUM(quantity) FROM PhoneSalesTable
WHERE version= "1.0.0" GROUP BY userName;
e.g.:-
12. 1
Spark vs Hadoop MapReduce
1
๏ Hadoop MapReduce
๏ Supports only Map/Reduce, fine
for single pass computations
๏ High processing latency and
inefficiencies related to
intermediate results persisted
๏ Hard to implement iterative
algorithms
๏ Spark
๏ Resilient Distributed Dataset (RDD)
based
๏ Support more than just Map and
Reduce functions
๏ Intermediate results kept in-
memory
๏ Lazy evaluation of data operations,
allowing more optimization
๏ Allows developer to implement
complex data operations in a DAG
pattern
๏ In-Memory/Persisted mode
operation, switch when required
๏ Simpler API
13. 1
WSO2 DAS - Interactive Analytics Features
1
๏ Full text data indexing support powered by Apache Lucene
๏ Drill-down search support
๏ Distributed data indexing
๏ Designed to support scalability
๏ Near real-time data indexing and retrieval
๏ Data indexed immediately as received
๏ Distributed indexing implementation for scalability
๏ Index sharding with Lucene indices
๏ Data storage scalability achieved with underlying database, e.g. HBase,
Cassandra, RDBMS etc..
log: “ERROR” AND (ip: “192.168.4.33” OR ip: “192.168.4.34”) AND type: “HTTPD”
e.g.:-
15. 1
WSO2 DAS - Alerts
1
๏ Detecting conditions can be done via CEP
Queries
๏ Key is the “Last Mile”
๏ Email
๏ SMS
๏ Push notifications to a UI
๏ Pager
๏ Trigger physical Alarm
๏ How?
๏ Batch Analytics: Using WSO2’s custom
Analytics Provider for Spark SQL to
directly send records as events to an
event stream
๏ Select Email sender “Output Adaptor”
from DAS, or send from DAS to ESB ->
ESB Connectors
24. ● Activity monitoring is for tracking events from multiple nodes in a flow to understand a
specific activity
○ e.g.:-
■ A client initiating a web services request which travels through multiple
ESBs, application servers and returns back. This flow will be uniquely
identified and visualized in DAS
○ Used for tracing messages, finding performance hotspots in the flow
○ Implemented based on a correlation id based mechanism and indexing
○ Upcoming: Mediator level tracing and profiling in WSO2 ESB 5.0
Solutions Supported with Batch/Interactive Analytics:
Activity Monitoring
30. Solutions Supported with Batch/Interactive Analytics:
Log Analysis
● Log analysis toolbox
● Log event indexing
○ Uses the new DAS v3.x indexing support
○ Event attributes can be indexed to be search by server, cluster, log type and also log
messages itself for full text search
● Custom search queries using Lucene queries and regular expressions
● Logstash adaptor for log publishing