Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-time and Batch


Published on

Published in: Technology

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Simultaneous Analysis of Massive Data Streams in Real-Time and Batch Anjana Fernando Technical Lead WSO2
  • 2. Agenda • How massive data streams created • How to receive • How to store • How to analyze, batch vs real-time • WSO2 Big Data solution • Demo
  • 3. Massive Data Streams -> Data Streams with Big Data
  • 4. What is Big Data? ❏ The 3 Vs ❏ Velocity ❏ Volume ❏ Variety Image Source:
  • 5. Where does it originate from? • Machine logs • Social media • Archives • Traffic information • Weather data • Sensor data (IoT)
  • 6. What do I do with it? Create intelligence.. • Should I take an umbrella to work today? • What is the best route to go back home? • What are the current market trends? • Are my servers running healthily?
  • 7. Protocols used to publish data.. • HTTP • MQTT • Zigbee • Thrift • Avro • ProtoBuf
  • 8. How to store the data? • Relational databases • Block data stores -> HDFS • Column oriented -> HBase -> Cassandra • Document based -> MongoDB -> CouchDB • In-Memory -> VoltDB A C P
  • 9. How to analyse data? • Two options: -> Batch processing: Schedule data processing jobs and receive the processed data later -> Real-time processing: The queries are executed and the results are retrieved instantly
  • 10. Analysing data.. • Batch processing -> Apache Hadoop: Map/Reduce processing system and a distributed file system
  • 11. Analysing data.. • Batch processing - Data Warehouse -> Apache Hive - Hadoop based framework for working on large scale data stores with SQL-like queries INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0" GROUP BY userName;
  • 12. Analysing data.. • Batch processing - In-Memory Computing -> Apache Spark - Functional programming model, in- memory computing, claims 10x - 100x faster than Hadoop
  • 13. Analysing data.. • Real-time processing - Stream Processing -> Apache Storm - Distributed and fault-tolerant Spouts Bolts
  • 14. Analysing data.. • Real-time processing - Complex Event Processing -> WSO2 Siddhi:
  • 15. Big Data Architecture with WSO2.. • Data Streams { 'name':'', 'version':'1.0.0', 'nickName': 'Phone_Retail_Shop', 'description': 'Phone Sales', 'metaData':[ {'name':'clientType','type':'STRING'} ], 'payloadData':[ {'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'}, {'name':'total','type':'INT'}, {'name':'user','type':'STRING'} ] } The common stream format used in both CEP and BAM; The stream definition contains the stream name, version and other attributes that makes up the stream.
  • 16. Big Data Architecture with WSO2.. • WSO2 BAM -> Data Receiver - High performance binary format data publishing with Apache Thrift, shared with WSO2 CEP -> Data Storage - Cassandra for highly scalable data store -> Data Analyzer - Hive based batch processing
  • 17. Big Data Architecture with WSO2.. • WSO2 BAM.. -> Activity Monitoring: Implemented using a custom indexing mechanism to instantly search for events of a specific activity in the system
  • 18. Big Data Architecture with WSO2.. • WSO2 BAM.. -> Incremental Data Processing - Customized Hive to support incremental data processing: @Incremental (name="salesAnalysis" , tables="PhoneSalesTable") SELECT brandname, Count(DISTINCT orderid), Sum(quantity) FROM phonesalestable WHERE version = "1.0.0" GROUP BY brandname;
  • 19. Big Data Architecture with WSO2.. • WSO2 CEP -> Same data receiver as BAM, where this is the point where the same event is sent to both servers, where BAM for batch processing and CEP for real-time processing of the same data streams -> Real-time in-memory processing, based on WSO2 Siddhi engine, with data adapters for receiving and sending event with different data types and transports, e.g. XML, JSON, Text, HTTP, JMS, SMTP
  • 20. Demo
  • 21. Questions?
  • 22. Thank you!