Simultaneous Analysis of Massive Data
Streams in Real-Time and Batch
Anjana Fernando
Technical Lead
WSO2
Agenda
• How massive data streams created
• How to receive
• How to store
• How to analyze, batch vs real-time
• WSO2 Big Data solution
• Demo
Massive Data Streams -> Data Streams with Big Data
What is Big Data?
❏ The 3 Vs
❏ Velocity
❏ Volume
❏ Variety
Where does it originate from?
• Machine logs
• Social media
• Archives
• Traffic information
• Weather data
• Sensor data (IoT)
What do I do with it?
Create intelligence..
• Should I take an umbrella to work today?
• What is the best route to go back home?
• What are the current market trends?
• Are my servers running healthily?
Protocols used to publish data..
• HTTP
• MQTT
• Zigbee
• Thrift
• Avro
• ProtoBuf
How to store the data?
• Relational databases
• Block data stores
-> HDFS
• Column oriented
-> HBase
-> Cassandra
• Document based
-> MongoDB
-> CouchDB
• In-Memory
-> VoltDB
A
C P
How to analyse data?
• Two options:
-> Batch processing: Schedule data processing jobs
and receive the processed data later
-> Real-time processing: The queries are executed
and the results are retrieved instantly
Analysing data..
• Batch processing
-> Apache Hadoop: Map/Reduce processing system
and a distributed file system
Analysing data..
• Batch processing - Data Warehouse
-> Apache Hive - Hadoop based framework for working
on large scale data stores with SQL-like queries
INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT
orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0"
GROUP BY userName;
Analysing data..
• Batch processing - In-Memory Computing
-> Apache Spark - Functional programming model,
in-memory computing, claims 10x - 100x faster than
Hadoop
Analysing data..
• Real-time processing - Stream Processing
-> Apache Storm - Distributed and fault-tolerant
Spouts Bolts
Analysing data..
• Real-time processing - Complex Event Processing
-> WSO2 Siddhi:
Big Data Architecture with WSO2..
• Data Streams
{
'name':'phone.retail.shop',
'version':'1.0.0',
'nickName': 'Phone_Retail_Shop',
'description': 'Phone Sales',
'metaData':[
{'name':'clientType','type':'STRING'}
],
'payloadData':[
{'name':'brand','type':'STRING'},
{'name':'quantity','type':'INT'},
{'name':'total','type':'INT'},
{'name':'user','type':'STRING'}
]
}
The common stream format used in both CEP and BAM; The stream
definition contains the stream name, version and other attributes that
makes up the stream.
Big Data Architecture with WSO2..
• WSO2 BAM
-> Data Receiver - High performance binary format data
publishing with Apache Thrift, shared with WSO2 CEP
-> Data Storage - Cassandra for highly scalable data store
-> Data Analyzer - Hive based batch processing
Big Data Architecture with WSO2..
• WSO2 BAM..
-> Activity Monitoring: Implemented using a custom indexing
mechanism to instantly search for events of a specific activity in
the system
Big Data Architecture with WSO2..
• WSO2 BAM..
-> Incremental Data Processing - Customized Hive to support
incremental data processing:
@Incremental (name="salesAnalysis" , tables="PhoneSalesTable")
SELECT brandname,
Count(DISTINCT orderid),
Sum(quantity)
FROM phonesalestable
WHERE version = "1.0.0"
GROUP BY brandname;
Big Data Architecture with WSO2..
• WSO2 CEP
-> Same data receiver as BAM, where this is the point where the
same event is sent to both servers, where BAM for batch
processing and CEP for real-time processing of the same data
streams
-> Real-time in-memory processing, based on WSO2 Siddhi
engine, with data adapters for receiving and sending event with
different data types and transports, e.g. XML, JSON, Text, HTTP,
JMS, SMTP
Demo
Questions?
Thank you!

Simultaneous analysis of massive data streams in real time and batch

  • 1.
    Simultaneous Analysis ofMassive Data Streams in Real-Time and Batch Anjana Fernando Technical Lead WSO2
  • 2.
    Agenda • How massivedata streams created • How to receive • How to store • How to analyze, batch vs real-time • WSO2 Big Data solution • Demo
  • 3.
    Massive Data Streams-> Data Streams with Big Data
  • 4.
    What is BigData? ❏ The 3 Vs ❏ Velocity ❏ Volume ❏ Variety
  • 5.
    Where does itoriginate from? • Machine logs • Social media • Archives • Traffic information • Weather data • Sensor data (IoT)
  • 6.
    What do Ido with it? Create intelligence.. • Should I take an umbrella to work today? • What is the best route to go back home? • What are the current market trends? • Are my servers running healthily?
  • 7.
    Protocols used topublish data.. • HTTP • MQTT • Zigbee • Thrift • Avro • ProtoBuf
  • 8.
    How to storethe data? • Relational databases • Block data stores -> HDFS • Column oriented -> HBase -> Cassandra • Document based -> MongoDB -> CouchDB • In-Memory -> VoltDB A C P
  • 9.
    How to analysedata? • Two options: -> Batch processing: Schedule data processing jobs and receive the processed data later -> Real-time processing: The queries are executed and the results are retrieved instantly
  • 10.
    Analysing data.. • Batchprocessing -> Apache Hadoop: Map/Reduce processing system and a distributed file system
  • 11.
    Analysing data.. • Batchprocessing - Data Warehouse -> Apache Hive - Hadoop based framework for working on large scale data stores with SQL-like queries INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0" GROUP BY userName;
  • 12.
    Analysing data.. • Batchprocessing - In-Memory Computing -> Apache Spark - Functional programming model, in-memory computing, claims 10x - 100x faster than Hadoop
  • 13.
    Analysing data.. • Real-timeprocessing - Stream Processing -> Apache Storm - Distributed and fault-tolerant Spouts Bolts
  • 14.
    Analysing data.. • Real-timeprocessing - Complex Event Processing -> WSO2 Siddhi:
  • 16.
    Big Data Architecturewith WSO2.. • Data Streams { 'name':'phone.retail.shop', 'version':'1.0.0', 'nickName': 'Phone_Retail_Shop', 'description': 'Phone Sales', 'metaData':[ {'name':'clientType','type':'STRING'} ], 'payloadData':[ {'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'}, {'name':'total','type':'INT'}, {'name':'user','type':'STRING'} ] } The common stream format used in both CEP and BAM; The stream definition contains the stream name, version and other attributes that makes up the stream.
  • 17.
    Big Data Architecturewith WSO2.. • WSO2 BAM -> Data Receiver - High performance binary format data publishing with Apache Thrift, shared with WSO2 CEP -> Data Storage - Cassandra for highly scalable data store -> Data Analyzer - Hive based batch processing
  • 18.
    Big Data Architecturewith WSO2.. • WSO2 BAM.. -> Activity Monitoring: Implemented using a custom indexing mechanism to instantly search for events of a specific activity in the system
  • 19.
    Big Data Architecturewith WSO2.. • WSO2 BAM.. -> Incremental Data Processing - Customized Hive to support incremental data processing: @Incremental (name="salesAnalysis" , tables="PhoneSalesTable") SELECT brandname, Count(DISTINCT orderid), Sum(quantity) FROM phonesalestable WHERE version = "1.0.0" GROUP BY brandname;
  • 20.
    Big Data Architecturewith WSO2.. • WSO2 CEP -> Same data receiver as BAM, where this is the point where the same event is sent to both servers, where BAM for batch processing and CEP for real-time processing of the same data streams -> Real-time in-memory processing, based on WSO2 Siddhi engine, with data adapters for receiving and sending event with different data types and transports, e.g. XML, JSON, Text, HTTP, JMS, SMTP
  • 21.
  • 22.
  • 23.