Big Data Analytics for Real Time Systems

BIG DATA ANALYTICS FOR
REAL TIME SYSTEMS
Kamalika Dutta Manasi Jayapal

Overview
 Introduction
 Big Data Analytics
 Real Time Systems
 Challenges of Real Time Analytics
 Technologies
 Tools
 Use Cases
 Future Work and Conclusion
2 Big Data Analytics for Real Time Systems

Overview
 Introduction
 Technologies
 Tools
 Use Cases

Where does Big Data come from?
Courtesy: http://goo.gl/JWswfj

What makes it Big Data?
Courtesy: Oracle
VARIABILITY

Evolution of Big Data
1960s 1967
Automatic Data
Compression
1997
Information Explosion
Our Literature Survey!

Overview
 Introduction
 Technologies
 Tools
 Use Cases

Big Data Analytics
“Big data analytics is the process of examining large data sets to
uncover hidden patterns, unknown correlations, market trends,
customer preferences and other useful business information.“
 Predictive Analysis
 Text Analysis
 Data Mining
 Statistical Analysis
Courtesy: smartdatacollective.com

Sample Systems

Analytics & 3 V‘s
Courtesy: watalon.com

Overview
 Introduction
 Technologies
 Tools
 Use Cases

Real Time Systems
“A real-time system is one that processes information and produces a
response within a specified time, else risk severe consequences,
sometimes including failure.“
 Telecommunication
Systems
 Anti-Lock Brakes in a Car
 Air Traffic Control System
 Weather Forecasting
System
Courtesy: yourdon.com

Real-Time Analytics of Big Data
What is Happening?
Kilobytes/
Sec
Megabytes/
Sec
Gigabytes 
Terabytes
Petabytes 
Exabytes
Seconds Milliseconds Minutes
Minutes 
Hours
Big Data
Real Time
Courtesy: infochimps.com

Overview
 Introduction
 Technologies
 Tools
 Use Cases

Challenges of Real Time Analytics
Expensive
Complex Architecture, Batch Processing
Semi and Unstructured Data: New Sources are unpredictable; Relational
databases are not capable, leaving us hamstrung
Market too Dynamic to Predict: Subscribers preferences change; competition
adds acceleration to it
Scalability: Requires sub-second response times; more than a single server can
handle

Thinking Beyond Hadoop!
Manage & store huge
volume of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all
data sources
Integration, Data Quality, Security,
Lifecycle Management, MDM
Understand and navigate
federated big data sources
Federated Discovery and Navigation
Courtesy: IBM

Our Solution
 Do the impossible: Incorporate any kind
of data
 Scale Big: Scale without any complexity
 Not Time Consuming: Seconds to
Minutes
 Real Time: Try to analyze data without
expensive data warehouse loads
Powerful Analytics, In Place, In Real Time.
Courtesy: slideshare.com

Overview
 Introduction
 Technologies
 Tools
 Use Cases

In-Memory Computing
In-memory computing primarily relies on keeping data in a server's RAM as a
means of processing at faster speeds. It uses a type of middleware software that
allows one to store data in RAM, across a cluster of computers, and process it in
parallel.
Courtesy: Stratecast

Stream Processing
Courtesy: EMC
 Stream-processing systems operate on continuous data streams e.g., click
streams on web pages, user request/query streams, monitoring events,
notifications, etc.
 Stream processing delivers real-time analytic processing on constantly changing
data in motion.
 Analyse first store later!

Complex Event Processing
Complex Event Processing (CEP) processes multiple event streams generated
within the enterprise to construct data abstraction and identify meaningful
patterns among those streams.
 Analytics across both real-time and historical data.
 Real-time event capture, filtering, pattern detection, matching, and
aggregation.

Overview
 Introduction
 Technologies
 Tools
 Use Cases

Tools for Real Time Analytics
Big Data is NOT new, the Tools ARE!
IBM InfoSphere Streams

Kafka
 A high performance distributed publish-subscribe messaging system.
 Designed for processing of real time activity stream data.
 Initially developed at LinkedIn, now part of Apache.
 Kafka works in combination with Apache Storm, Apache HBase and Apache
Spark for real-time analysis and rendering of streaming data.
 Fast
 Scalable
 Durable
 Fault-tolerant

Storm
 A highly distributed real-time computation system.
 Acquired by Twitter.
 Twitter claims, “Over a million tuples processed per second per node.”
 Fast, Scalable, Reliable and Fault-tolerant.
 Stream: Unbounded
sequence of tuples
 Primitives
 Spouts: Pull messages
 Bolts: Perform core
functions of stream
computing
Stream

Spark Streaming
 Was developed in the AMPLab at
UC Berkeley.
 In-memory computing
capabilities deliver speed.
 Low latency
 High throughput
 Fault tolerant
 New programing model:
 Discretized streams (Dstreams)
 Resilient Distributed Datasets
Spark Streaming uses micro-batching to support continuous stream processing. It is
an extension of Spark which is a batch-processing system.
Courtesy: Apache Spark

Spring XD (XD=eXtreme Data)
 Spring XD is a unified, distributed, and extensible system for data ingestion, real
time analytics, batch processing, and data export.
 Spring XD framework supports streams for the ingestion of event driven data
from a source to a sink that passes through any number of processors.
Courtesy: Infoq

Comparison of Tools (1)
Spark Streaming Apache Storm Spring XD
Definition
A fast and general purpose
cluster computing system.
A distributed real-time
computation system.
A unified, distributed, and
extensible system for data
ingestion, real time analytics,
batch processing, and data
export.
Implemented in Scala Clojure Java
Programming API Scala, Java, Python
Java API and usable with any
programing language.
Java
Development A full top level Apache project. Undergoing Apache project. Spring project by Pivotal.
Processing Model
Batch processing framework
that also does micro-batching.
Stream Processing Framework
that processes and dispatches
messages as soon as they
arrive.
Unified platform for stream
processing.
Fault Tolerance
Recovery of lost work and
restart of workers via the
resource manager.
Restart of Workers,
Supervisors like nothing ever
happened.
Reassignment of work to
container working.

Comparison of Tools (2)
Spark Streaming Apache Storm Spring XD
Data processing
Messages are not lost and
delivered once. (Small-scale
batching)
Keeps track of each and every
record.
Unacknowledged messages
are retried until the
container comes back.
Use Cases
• Combines batch and
stream processing
(Lambda Architecture).
• Machine Learning:
Improve performance of
iterative algorithms
• Power Real-time
Dashboards.
Prevention of:
• securities fraud
• compliance violations
• security breaches
• network outage
• Stream tweets to
Hadoop for sentiment
analysis.
• High throughput
distributed data
ingestion into HDFS from
a variety of input
sources.
• Real-time analytics at
ingestion time, e.g.
gathering metrics and
counting values.

Which tools are right for you?

Lambda Architecture
 In 2013, Nathan Marz and James Warren proposed the Lambda Architecture
that attempts to provide a methodology to build a Big Data system.
 Such a system would balance latency, throughput, and fault-tolerance by
using batch processing to provide comprehensive and accurate pre-computed
views, while simultaneously using real-time stream processing to provide
dynamic views.
Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable real-time data systems. O'Reilly Media, 2013.
Courtesy: Trivadis

Lambda Architecture Example
Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable real-time data systems. O'Reilly Media, 2013.
Courtesy: Trivadis

Overview
 Introduction
 Technologies
 Tools
 Use Cases

Use Cases
 Healthcare
 Capture and analyze real-time data from medical monitors,
alerting hospital staff to potential health problems before patients
manifest clinical signs of infection or other issues.
 Analyze privacy-protected streams of medical device data to
detect early signs of disease, identify correlations among multiple
patients.
 Finance
 Analyze ticks, tweets, satellite imagery, weather trends, and any
other type of data to inform trading algorithms in real time.
 Apply fraud insights to take action in real time. Use analytics on
streaming data to confidently differentiate legitimate actions,
while preventing or interrupting suspicious actions and respond
immediately to criminal patterns and activities.

Use Cases
 Government
 Identify social program fraud within seconds based on program
history, citizen profile, and geospatial data.
 Identify items or patterns for deeper investigation in Cyber-
security.
 Transport
 Traffic managers can now respond quickly and accurately to
relevant insights from real-time analytics drawn from data feeds
and reports.
 Telematics can provide data-in-motion such as vehicle speed, data
relating to the transmission control system, braking, air bags, tire
pressure and wiper speed as well as geospatial and current
environmental conditions data. Hence, automotive companies can
strengthen customer relationships

Use Cases
 Telecommunication
 Improve customer profitability analysis, end-to-end visibility for
new product rollouts and real-time analysis for better the network
customers.
 Perform capacity planning for mobile networks as new high-
bandwidth services are introduced. Improve customer experience.
 Retail
 See a product recurring in abandoned shopping carts. Run a
promotion to close more sales of that product.
 Evaluate sales performance in real time. Take measures now to
achieve sales quotas.
 An electric coupon delivery service sends e-mails to customers
with recommendations matched to their interests derived from
their location information, membership information, and
information on nearby stores.

Courtesy: SAP

Overview
 Introduction
 Technologies
 Tools
 Use Cases

Future Work
 Increased Level of Merging
 Application of Social and Digital Media
 New Technologies
 Further Development of Telemetric Data
 Self Learning Systems
 Complex Statistical Methods

Conclusion
Resources
Privacy Security
TimeCost
“Consumer Data will be the biggest differentiator in the next two to three years.
Whoever unlocks the reams of data and uses it strategically, will win”
-Angela Ahrendts, CEO, Burberry
?

Big Data Analytics for Real Time Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Analytics for Real Time Systems

Similar to Big Data Analytics for Real Time Systems (20)

Recently uploaded

Recently uploaded (20)

Big Data Analytics for Real Time Systems