BIG DATA ANALYTICS FOR
REAL TIME SYSTEMS
Kamalika Dutta Manasi Jayapal
Overview
 Introduction
 Big Data Analytics
 Real Time Systems
 Challenges of Real Time Analytics
 Technologies
 Tools
 Use Cases
 Future Work and Conclusion
2 Big Data Analytics for Real Time Systems
Overview
3 Big Data Analytics for Real Time Systems
 Introduction
 Big Data Analytics
 Real Time Systems
 Challenges of Real Time Analytics
 Technologies
 Tools
 Use Cases
 Future Work and Conclusion
Where does Big Data come from?
4 Big Data Analytics for Real Time Systems
Courtesy: http://goo.gl/JWswfj
What makes it Big Data?
5 Big Data Analytics for Real Time Systems
Courtesy: Oracle
VARIABILITY
Evolution of Big Data
6 Big Data Analytics for Real Time Systems
1960s 1967
Automatic Data
Compression
1997
Information Explosion
Our Literature Survey!
Overview
7 Big Data Analytics for Real Time Systems
 Introduction
 Big Data Analytics
 Real Time Systems
 Challenges of Real Time Analytics
 Technologies
 Tools
 Use Cases
 Future Work and Conclusion
Big Data Analytics
“Big data analytics is the process of examining large data sets to
uncover hidden patterns, unknown correlations, market trends,
customer preferences and other useful business information.“
8 Big Data Analytics for Real Time Systems
 Predictive Analysis
 Text Analysis
 Data Mining
 Statistical Analysis
Courtesy: smartdatacollective.com
Sample Systems
9 Big Data Analytics for Real Time Systems
Analytics & 3 V‘s
10 Big Data Analytics for Real Time Systems
Courtesy: watalon.com
Overview
 Introduction
 Big Data Analytics
 Real Time Systems
 Challenges of Real Time Analytics
 Technologies
 Tools
 Use Cases
 Future Work and Conclusion
11 Big Data Analytics for Real Time Systems
Real Time Systems
“A real-time system is one that processes information and produces a
response within a specified time, else risk severe consequences,
sometimes including failure.“
12 Big Data Analytics for Real Time Systems
 Telecommunication
Systems
 Anti-Lock Brakes in a Car
 Air Traffic Control System
 Weather Forecasting
System
Courtesy: yourdon.com
Real-Time Analytics of Big Data
13 Big Data Analytics for Real Time Systems
What is Happening?
Kilobytes/
Sec
Megabytes/
Sec
Gigabytes 
Terabytes
Petabytes 
Exabytes
Seconds Milliseconds Minutes
Minutes 
Hours
Big Data
Real Time
Courtesy: infochimps.com
Overview
 Introduction
 Big Data Analytics
 Real Time Systems
 Challenges of Real Time Analytics
 Technologies
 Tools
 Use Cases
 Future Work and Conclusion
14 Big Data Analytics for Real Time Systems
Challenges of Real Time Analytics
15 Big Data Analytics for Real Time Systems
Expensive
Complex Architecture, Batch Processing
Semi and Unstructured Data: New Sources are unpredictable; Relational
databases are not capable, leaving us hamstrung
Market too Dynamic to Predict: Subscribers preferences change; competition
adds acceleration to it
Scalability: Requires sub-second response times; more than a single server can
handle
Thinking Beyond Hadoop!
16 Big Data Analytics for Real Time Systems
Manage & store huge
volume of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all
data sources
Integration, Data Quality, Security,
Lifecycle Management, MDM
Understand and navigate
federated big data sources
Federated Discovery and Navigation
Courtesy: IBM
Our Solution
 Do the impossible: Incorporate any kind
of data
 Scale Big: Scale without any complexity
 Not Time Consuming: Seconds to
Minutes
 Real Time: Try to analyze data without
expensive data warehouse loads
17 Big Data Analytics for Real Time Systems
Powerful Analytics, In Place, In Real Time.
Courtesy: slideshare.com
Overview
 Introduction
 Big Data Analytics
 Real Time Systems
 Challenges of Real Time Analytics
 Technologies
 Tools
 Use Cases
 Future Work and Conclusion
18 Big Data Analytics for Real Time Systems
In-Memory Computing
In-memory computing primarily relies on keeping data in a server's RAM as a
means of processing at faster speeds. It uses a type of middleware software that
allows one to store data in RAM, across a cluster of computers, and process it in
parallel.
19 Big Data Analytics for Real Time Systems
Courtesy: Stratecast
Stream Processing
20 Big Data Analytics for Real Time Systems
Courtesy: EMC
 Stream-processing systems operate on continuous data streams e.g., click
streams on web pages, user request/query streams, monitoring events,
notifications, etc.
 Stream processing delivers real-time analytic processing on constantly changing
data in motion.
 Analyse first store later!
Complex Event Processing
Complex Event Processing (CEP) processes multiple event streams generated
within the enterprise to construct data abstraction and identify meaningful
patterns among those streams.
21 Big Data Analytics for Real Time Systems
 Analytics across both real-time and historical data.
 Real-time event capture, filtering, pattern detection, matching, and
aggregation.
Overview
 Introduction
 Big Data Analytics
 Real Time Systems
 Challenges of Real Time Analytics
 Technologies
 Tools
 Use Cases
 Future Work and Conclusion
22 Big Data Analytics for Real Time Systems
Tools for Real Time Analytics
Big Data is NOT new, the Tools ARE!
23 Big Data Analytics for Real Time Systems
IBM InfoSphere Streams
Kafka
 A high performance distributed publish-subscribe messaging system.
 Designed for processing of real time activity stream data.
 Initially developed at LinkedIn, now part of Apache.
 Kafka works in combination with Apache Storm, Apache HBase and Apache
Spark for real-time analysis and rendering of streaming data.
24 Big Data Analytics for Real Time Systems
 Fast
 Scalable
 Durable
 Fault-tolerant
Storm
 A highly distributed real-time computation system.
 Acquired by Twitter.
 Twitter claims, “Over a million tuples processed per second per node.”
 Fast, Scalable, Reliable and Fault-tolerant.
25 Big Data Analytics for Real Time Systems
 Stream: Unbounded
sequence of tuples
 Primitives
 Spouts: Pull messages
 Bolts: Perform core
functions of stream
computing
Stream
Spark Streaming
 Was developed in the AMPLab at
UC Berkeley.
 In-memory computing
capabilities deliver speed.
 Low latency
 High throughput
 Fault tolerant
 New programing model:
 Discretized streams (Dstreams)
 Resilient Distributed Datasets
26 Big Data Analytics for Real Time Systems
Spark Streaming uses micro-batching to support continuous stream processing. It is
an extension of Spark which is a batch-processing system.
Courtesy: Apache Spark
Spring XD (XD=eXtreme Data)
 Spring XD is a unified, distributed, and extensible system for data ingestion, real
time analytics, batch processing, and data export.
 Spring XD framework supports streams for the ingestion of event driven data
from a source to a sink that passes through any number of processors.
27 Big Data Analytics for Real Time Systems
Courtesy: Infoq
Comparison of Tools (1)
Spark Streaming Apache Storm Spring XD
Definition
A fast and general purpose
cluster computing system.
A distributed real-time
computation system.
A unified, distributed, and
extensible system for data
ingestion, real time analytics,
batch processing, and data
export.
Implemented in Scala Clojure Java
Programming API Scala, Java, Python
Java API and usable with any
programing language.
Java
Development A full top level Apache project. Undergoing Apache project. Spring project by Pivotal.
Processing Model
Batch processing framework
that also does micro-batching.
Stream Processing Framework
that processes and dispatches
messages as soon as they
arrive.
Unified platform for stream
processing.
Fault Tolerance
Recovery of lost work and
restart of workers via the
resource manager.
Restart of Workers,
Supervisors like nothing ever
happened.
Reassignment of work to
container working.
28 Big Data Analytics for Real Time Systems
Comparison of Tools (2)
Spark Streaming Apache Storm Spring XD
Data processing
Messages are not lost and
delivered once. (Small-scale
batching)
Keeps track of each and every
record.
Unacknowledged messages
are retried until the
container comes back.
Use Cases
• Combines batch and
stream processing
(Lambda Architecture).
• Machine Learning:
Improve performance of
iterative algorithms
• Power Real-time
Dashboards.
Prevention of:
• securities fraud
• compliance violations
• security breaches
• network outage
• Stream tweets to
Hadoop for sentiment
analysis.
• High throughput
distributed data
ingestion into HDFS from
a variety of input
sources.
• Real-time analytics at
ingestion time, e.g.
gathering metrics and
counting values.
29 Big Data Analytics for Real Time Systems
Which tools are right for you?
30 Big Data Analytics for Real Time Systems
Lambda Architecture
31 Big Data Analytics for Real Time Systems
 In 2013, Nathan Marz and James Warren proposed the Lambda Architecture
that attempts to provide a methodology to build a Big Data system.
 Such a system would balance latency, throughput, and fault-tolerance by
using batch processing to provide comprehensive and accurate pre-computed
views, while simultaneously using real-time stream processing to provide
dynamic views.
Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable real-time data systems. O'Reilly Media, 2013.
Courtesy: Trivadis
Lambda Architecture Example
32 Big Data Analytics for Real Time Systems
Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable real-time data systems. O'Reilly Media, 2013.
Courtesy: Trivadis
Overview
 Introduction
 Big Data Analytics
 Real Time Systems
 Challenges of Real Time Analytics
 Technologies
 Tools
 Use Cases
 Future Work and Conclusion
33 Big Data Analytics for Real Time Systems
Use Cases
34 Big Data Analytics for Real Time Systems
 Healthcare
 Capture and analyze real-time data from medical monitors,
alerting hospital staff to potential health problems before patients
manifest clinical signs of infection or other issues.
 Analyze privacy-protected streams of medical device data to
detect early signs of disease, identify correlations among multiple
patients.
 Finance
 Analyze ticks, tweets, satellite imagery, weather trends, and any
other type of data to inform trading algorithms in real time.
 Apply fraud insights to take action in real time. Use analytics on
streaming data to confidently differentiate legitimate actions,
while preventing or interrupting suspicious actions and respond
immediately to criminal patterns and activities.
Use Cases
35 Big Data Analytics for Real Time Systems
 Government
 Identify social program fraud within seconds based on program
history, citizen profile, and geospatial data.
 Identify items or patterns for deeper investigation in Cyber-
security.
 Transport
 Traffic managers can now respond quickly and accurately to
relevant insights from real-time analytics drawn from data feeds
and reports.
 Telematics can provide data-in-motion such as vehicle speed, data
relating to the transmission control system, braking, air bags, tire
pressure and wiper speed as well as geospatial and current
environmental conditions data. Hence, automotive companies can
strengthen customer relationships
Use Cases
36 Big Data Analytics for Real Time Systems
 Telecommunication
 Improve customer profitability analysis, end-to-end visibility for
new product rollouts and real-time analysis for better the network
customers.
 Perform capacity planning for mobile networks as new high-
bandwidth services are introduced. Improve customer experience.
 Retail
 See a product recurring in abandoned shopping carts. Run a
promotion to close more sales of that product.
 Evaluate sales performance in real time. Take measures now to
achieve sales quotas.
 An electric coupon delivery service sends e-mails to customers
with recommendations matched to their interests derived from
their location information, membership information, and
information on nearby stores.
37 Big Data Analytics for Real Time Systems
Courtesy: SAP
Overview
 Introduction
 Big Data Analytics
 Real Time Systems
 Challenges of Real Time Analytics
 Technologies
 Tools
 Use Cases
 Future Work and Conclusion
38 Big Data Analytics for Real Time Systems
Future Work
 Increased Level of Merging
 Application of Social and Digital Media
 New Technologies
 Further Development of Telemetric Data
 Self Learning Systems
 Complex Statistical Methods
39 Big Data Analytics for Real Time Systems
Conclusion
40 Big Data Analytics for Real Time Systems
Resources
Privacy Security
TimeCost
“Consumer Data will be the biggest differentiator in the next two to three years.
Whoever unlocks the reams of data and uses it strategically, will win”
-Angela Ahrendts, CEO, Burberry
?
41 Big Data Analytics for Real Time Systems

Big Data Analytics for Real Time Systems

  • 1.
    BIG DATA ANALYTICSFOR REAL TIME SYSTEMS Kamalika Dutta Manasi Jayapal
  • 2.
    Overview  Introduction  BigData Analytics  Real Time Systems  Challenges of Real Time Analytics  Technologies  Tools  Use Cases  Future Work and Conclusion 2 Big Data Analytics for Real Time Systems
  • 3.
    Overview 3 Big DataAnalytics for Real Time Systems  Introduction  Big Data Analytics  Real Time Systems  Challenges of Real Time Analytics  Technologies  Tools  Use Cases  Future Work and Conclusion
  • 4.
    Where does BigData come from? 4 Big Data Analytics for Real Time Systems Courtesy: http://goo.gl/JWswfj
  • 5.
    What makes itBig Data? 5 Big Data Analytics for Real Time Systems Courtesy: Oracle VARIABILITY
  • 6.
    Evolution of BigData 6 Big Data Analytics for Real Time Systems 1960s 1967 Automatic Data Compression 1997 Information Explosion Our Literature Survey!
  • 7.
    Overview 7 Big DataAnalytics for Real Time Systems  Introduction  Big Data Analytics  Real Time Systems  Challenges of Real Time Analytics  Technologies  Tools  Use Cases  Future Work and Conclusion
  • 8.
    Big Data Analytics “Bigdata analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.“ 8 Big Data Analytics for Real Time Systems  Predictive Analysis  Text Analysis  Data Mining  Statistical Analysis Courtesy: smartdatacollective.com
  • 9.
    Sample Systems 9 BigData Analytics for Real Time Systems
  • 10.
    Analytics & 3V‘s 10 Big Data Analytics for Real Time Systems Courtesy: watalon.com
  • 11.
    Overview  Introduction  BigData Analytics  Real Time Systems  Challenges of Real Time Analytics  Technologies  Tools  Use Cases  Future Work and Conclusion 11 Big Data Analytics for Real Time Systems
  • 12.
    Real Time Systems “Areal-time system is one that processes information and produces a response within a specified time, else risk severe consequences, sometimes including failure.“ 12 Big Data Analytics for Real Time Systems  Telecommunication Systems  Anti-Lock Brakes in a Car  Air Traffic Control System  Weather Forecasting System Courtesy: yourdon.com
  • 13.
    Real-Time Analytics ofBig Data 13 Big Data Analytics for Real Time Systems What is Happening? Kilobytes/ Sec Megabytes/ Sec Gigabytes  Terabytes Petabytes  Exabytes Seconds Milliseconds Minutes Minutes  Hours Big Data Real Time Courtesy: infochimps.com
  • 14.
    Overview  Introduction  BigData Analytics  Real Time Systems  Challenges of Real Time Analytics  Technologies  Tools  Use Cases  Future Work and Conclusion 14 Big Data Analytics for Real Time Systems
  • 15.
    Challenges of RealTime Analytics 15 Big Data Analytics for Real Time Systems Expensive Complex Architecture, Batch Processing Semi and Unstructured Data: New Sources are unpredictable; Relational databases are not capable, leaving us hamstrung Market too Dynamic to Predict: Subscribers preferences change; competition adds acceleration to it Scalability: Requires sub-second response times; more than a single server can handle
  • 16.
    Thinking Beyond Hadoop! 16Big Data Analytics for Real Time Systems Manage & store huge volume of any data Hadoop File System MapReduce Manage streaming data Stream Computing Analyze unstructured data Text Analytics Engine Data WarehousingStructure and control data Integrate and govern all data sources Integration, Data Quality, Security, Lifecycle Management, MDM Understand and navigate federated big data sources Federated Discovery and Navigation Courtesy: IBM
  • 17.
    Our Solution  Dothe impossible: Incorporate any kind of data  Scale Big: Scale without any complexity  Not Time Consuming: Seconds to Minutes  Real Time: Try to analyze data without expensive data warehouse loads 17 Big Data Analytics for Real Time Systems Powerful Analytics, In Place, In Real Time. Courtesy: slideshare.com
  • 18.
    Overview  Introduction  BigData Analytics  Real Time Systems  Challenges of Real Time Analytics  Technologies  Tools  Use Cases  Future Work and Conclusion 18 Big Data Analytics for Real Time Systems
  • 19.
    In-Memory Computing In-memory computingprimarily relies on keeping data in a server's RAM as a means of processing at faster speeds. It uses a type of middleware software that allows one to store data in RAM, across a cluster of computers, and process it in parallel. 19 Big Data Analytics for Real Time Systems Courtesy: Stratecast
  • 20.
    Stream Processing 20 BigData Analytics for Real Time Systems Courtesy: EMC  Stream-processing systems operate on continuous data streams e.g., click streams on web pages, user request/query streams, monitoring events, notifications, etc.  Stream processing delivers real-time analytic processing on constantly changing data in motion.  Analyse first store later!
  • 21.
    Complex Event Processing ComplexEvent Processing (CEP) processes multiple event streams generated within the enterprise to construct data abstraction and identify meaningful patterns among those streams. 21 Big Data Analytics for Real Time Systems  Analytics across both real-time and historical data.  Real-time event capture, filtering, pattern detection, matching, and aggregation.
  • 22.
    Overview  Introduction  BigData Analytics  Real Time Systems  Challenges of Real Time Analytics  Technologies  Tools  Use Cases  Future Work and Conclusion 22 Big Data Analytics for Real Time Systems
  • 23.
    Tools for RealTime Analytics Big Data is NOT new, the Tools ARE! 23 Big Data Analytics for Real Time Systems IBM InfoSphere Streams
  • 24.
    Kafka  A highperformance distributed publish-subscribe messaging system.  Designed for processing of real time activity stream data.  Initially developed at LinkedIn, now part of Apache.  Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data. 24 Big Data Analytics for Real Time Systems  Fast  Scalable  Durable  Fault-tolerant
  • 25.
    Storm  A highlydistributed real-time computation system.  Acquired by Twitter.  Twitter claims, “Over a million tuples processed per second per node.”  Fast, Scalable, Reliable and Fault-tolerant. 25 Big Data Analytics for Real Time Systems  Stream: Unbounded sequence of tuples  Primitives  Spouts: Pull messages  Bolts: Perform core functions of stream computing Stream
  • 26.
    Spark Streaming  Wasdeveloped in the AMPLab at UC Berkeley.  In-memory computing capabilities deliver speed.  Low latency  High throughput  Fault tolerant  New programing model:  Discretized streams (Dstreams)  Resilient Distributed Datasets 26 Big Data Analytics for Real Time Systems Spark Streaming uses micro-batching to support continuous stream processing. It is an extension of Spark which is a batch-processing system. Courtesy: Apache Spark
  • 27.
    Spring XD (XD=eXtremeData)  Spring XD is a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export.  Spring XD framework supports streams for the ingestion of event driven data from a source to a sink that passes through any number of processors. 27 Big Data Analytics for Real Time Systems Courtesy: Infoq
  • 28.
    Comparison of Tools(1) Spark Streaming Apache Storm Spring XD Definition A fast and general purpose cluster computing system. A distributed real-time computation system. A unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. Implemented in Scala Clojure Java Programming API Scala, Java, Python Java API and usable with any programing language. Java Development A full top level Apache project. Undergoing Apache project. Spring project by Pivotal. Processing Model Batch processing framework that also does micro-batching. Stream Processing Framework that processes and dispatches messages as soon as they arrive. Unified platform for stream processing. Fault Tolerance Recovery of lost work and restart of workers via the resource manager. Restart of Workers, Supervisors like nothing ever happened. Reassignment of work to container working. 28 Big Data Analytics for Real Time Systems
  • 29.
    Comparison of Tools(2) Spark Streaming Apache Storm Spring XD Data processing Messages are not lost and delivered once. (Small-scale batching) Keeps track of each and every record. Unacknowledged messages are retried until the container comes back. Use Cases • Combines batch and stream processing (Lambda Architecture). • Machine Learning: Improve performance of iterative algorithms • Power Real-time Dashboards. Prevention of: • securities fraud • compliance violations • security breaches • network outage • Stream tweets to Hadoop for sentiment analysis. • High throughput distributed data ingestion into HDFS from a variety of input sources. • Real-time analytics at ingestion time, e.g. gathering metrics and counting values. 29 Big Data Analytics for Real Time Systems
  • 30.
    Which tools areright for you? 30 Big Data Analytics for Real Time Systems
  • 31.
    Lambda Architecture 31 BigData Analytics for Real Time Systems  In 2013, Nathan Marz and James Warren proposed the Lambda Architecture that attempts to provide a methodology to build a Big Data system.  Such a system would balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate pre-computed views, while simultaneously using real-time stream processing to provide dynamic views. Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable real-time data systems. O'Reilly Media, 2013. Courtesy: Trivadis
  • 32.
    Lambda Architecture Example 32Big Data Analytics for Real Time Systems Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable real-time data systems. O'Reilly Media, 2013. Courtesy: Trivadis
  • 33.
    Overview  Introduction  BigData Analytics  Real Time Systems  Challenges of Real Time Analytics  Technologies  Tools  Use Cases  Future Work and Conclusion 33 Big Data Analytics for Real Time Systems
  • 34.
    Use Cases 34 BigData Analytics for Real Time Systems  Healthcare  Capture and analyze real-time data from medical monitors, alerting hospital staff to potential health problems before patients manifest clinical signs of infection or other issues.  Analyze privacy-protected streams of medical device data to detect early signs of disease, identify correlations among multiple patients.  Finance  Analyze ticks, tweets, satellite imagery, weather trends, and any other type of data to inform trading algorithms in real time.  Apply fraud insights to take action in real time. Use analytics on streaming data to confidently differentiate legitimate actions, while preventing or interrupting suspicious actions and respond immediately to criminal patterns and activities.
  • 35.
    Use Cases 35 BigData Analytics for Real Time Systems  Government  Identify social program fraud within seconds based on program history, citizen profile, and geospatial data.  Identify items or patterns for deeper investigation in Cyber- security.  Transport  Traffic managers can now respond quickly and accurately to relevant insights from real-time analytics drawn from data feeds and reports.  Telematics can provide data-in-motion such as vehicle speed, data relating to the transmission control system, braking, air bags, tire pressure and wiper speed as well as geospatial and current environmental conditions data. Hence, automotive companies can strengthen customer relationships
  • 36.
    Use Cases 36 BigData Analytics for Real Time Systems  Telecommunication  Improve customer profitability analysis, end-to-end visibility for new product rollouts and real-time analysis for better the network customers.  Perform capacity planning for mobile networks as new high- bandwidth services are introduced. Improve customer experience.  Retail  See a product recurring in abandoned shopping carts. Run a promotion to close more sales of that product.  Evaluate sales performance in real time. Take measures now to achieve sales quotas.  An electric coupon delivery service sends e-mails to customers with recommendations matched to their interests derived from their location information, membership information, and information on nearby stores.
  • 37.
    37 Big DataAnalytics for Real Time Systems Courtesy: SAP
  • 38.
    Overview  Introduction  BigData Analytics  Real Time Systems  Challenges of Real Time Analytics  Technologies  Tools  Use Cases  Future Work and Conclusion 38 Big Data Analytics for Real Time Systems
  • 39.
    Future Work  IncreasedLevel of Merging  Application of Social and Digital Media  New Technologies  Further Development of Telemetric Data  Self Learning Systems  Complex Statistical Methods 39 Big Data Analytics for Real Time Systems
  • 40.
    Conclusion 40 Big DataAnalytics for Real Time Systems Resources Privacy Security TimeCost “Consumer Data will be the biggest differentiator in the next two to three years. Whoever unlocks the reams of data and uses it strategically, will win” -Angela Ahrendts, CEO, Burberry ?
  • 41.
    41 Big DataAnalytics for Real Time Systems