SlideShare a Scribd company logo
REAL TIME STREAMING
ANALYTICS
About me
• Padma Chitturi
• Analytics Lead @ Fractal Analytics
• Author of “Apache Spark for
DataScience cookbook”
• https://github.com/ChitturiPadma/S
parkforDataScienceCookbook
Big Data use-cases across Industries
Banking
lImprove customer intelligence
lReduce risk
lIdentify fraud
lMarketing campaigns.
Healthcare
lOptimal treatment
lRremote patient monitoring
lDetecting disease
lPersonalized medicine.
Manfacturing
lProduct quality
lDetect machine failures
lSales forecasting
lMarket pricing & planningRetail
lCustomer behavior
lBuying patterns of customers
lRecommending products
lMaintain the inventory.
Telecom
lTraffic control
lCustomer experience
lLocation based services
lPrecise marketing
Insurance
lClaims Management
lRisk Management
lCustomer Experience & Insight
Airlines
lProviding travel offers
lPredicting fligh delays
lAvoiding travel accidents
lIncreasing security
Big Data
Agriculture
lPrecision Agriculture
lDemand forecasting
lReduce manpower
lBetter Farming decisions.
Need for “Real Time”Analytics across Industries
Fraud detection Connected Car Data Identity &
Protection Services
Click Stream
Analysis
Financial Sales
Tracking
Improving Patient-
Care
Overview of Spark
• In-memory cluster computing framework for processing
and analyzing large volumes of data.
• Key Features:
• Easy to use ( expressive API for batch & real-time processing
• Fast (provides in-memory persisting and optimizes disk seeks)
• General-purpose (support batch, real-time and graph processing).
• Scalable (as the data grows, computational power can be
increased by adding more nodes).
• Fault-tolerant (handles node failures without interrupting the
application by launching tasks on the nodes having replicated
copy)
What is Spark Streaming ?
• Extends Spark for doing large scale stream processing
• Scales to 100s of nodes and achieves second scale
latencies
• Efficient and fault-tolerant stateful stream processing
• Integrates with Spark’s batch and interactive processing
• Provides a simple batch-like API for implementing
complex algorithms.
Discretized Stream Processing
• Run a streaming computation as a series of very small,
deterministic batch jobs.
 Chop up the live stream into batches of X
seconds
 Spark treats each batch of data as RDDs
and processes them using RDD operations
 Finally, the processed results of the RDD
operations are returned in batches
 Batch sizes as low as ½ second, latency ~
1 second
 Potential for combining batch processing
and streaming processing in the same
system
Spark
Streaming
Spark
processed
results
Live data stream
batches of X
seconds
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
Twitter Streaming API
batch @ t+1batch @ t batch @ t+2
tweets DStream
stored in memory as an RDD
(immutable, distributed dataset)
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
val hashTags = tweets.flatMap (status => getTags(status))
new DStream
transformation: modify data in one DStream to create
another DStream
batch @ t+1batch @ t batch @ t+2
flatMap flatMap flatMap
hashTags Dstream
[#cat, #dog, … ]
new RDDs created
for every batch
tweets Dstream
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
Example 2 – Count the hashtags over last 1 min
sliding window
operation
window length sliding interval
DStream of data
window length
sliding interval
• val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
Example 2 – Count the hashtags over last 1 min
tagCounts
hashTags
t-1 t t+1 t+2 t+3
sliding window
countByValue
count over all
the data in the
window
Fault-tolerance
• RDDs remember the operations that
created them
• Batches of input data are replicated
in memory for fault-tolerance
• Data lost due to worker failure, can
be recomputed from replicated input
data
• Therefore, all transformed data is
fault-tolerant
input data
replicated
in memory
flatMap
lost partitions
recomputed on
other workers
tweets
RDD
hashTags
RDD
Fault-tolerance
• Spark Streaming program
t = ssc.twitterStream(“…”)
.map(…)
t.foreach(…)
t1 = ssc.twitterStream(“…”)
t2 = ssc.twitterStream(“…”)
t = t1.union(t2).map(…)
t.saveAsHadoopFiles(…)
t.map(…).foreach(…)
t.filter(…).foreach(…)
DStream Graph
T
M
F
E
Twitter Input DStream
Mapped DStream
Foreach DStreamDummy DStream signifying
an output operation
T
U
M
T
M FF
E
F
E
F
E
Dstream Graph -> RDD Graphs -> Spark Jobs
• Every interval, RDD graph is computed from DStream graph
• For each output operation, a Spark action is created
• For each action, a Spark job is created to compute it
T
U
M
T
M FF
E
F
E
F
E
DStream Graph RDD GraphBlock RDDs with
data received in
last batch interval B
U
M
B
M FA
A A
3 Spark Jobs
Execution Model – Job Scheduling
• Spark Streaming +Spark Driver
Network
Input
Tracker
Job
Scheduler Spark’s
Schedulers
DStream
Graph
Job
Manager
JobQueue
Jobs
Block IDsRDDs
Spark Workers
Jobs executed on
worker nodes
Block
Manager
Data
Received
Block
Manager
RDD Checkpointing
Saving RDD to HDFS to prevent RDD graph from growing too large
• Done internally in Spark transparent to the user program
• Done lazily, saved to HDFS the first time it is computed
red_rdd.checkpoint() HDFS file
Contents of red_rdd saved
to a HDFS file
transparent to
all child RDDs
RDD Checkpointing
Stateful DStream operators can have infinite lineages
data
t-1 t t+1 t+2 t+3
states
Large lineages lead to …
• Large closure of the RDD object  large task sizes  high task launch
times
• High recovery times under failure
• Periodic RDD Check-pointing solves this
• Useful for iterative Spark programs as well.
HDF
S
HDF
S
Performance Tuning
• Increase Read parallelism
• Increase downstream processing parallelism
• Achieve stable configuration that can sustain the
streaming workload
• Optimize for low-latency
• Memory settings and explore GC options.
• Achieve Fault-tolerance
• Serializing the objects.
Analytics transforms the business
Institutionalization
Real time
Data
Sophistication
 Sharpen the saw
 Support strategic decisions
 Achieve breakthrough innovation
 Observe everything
 Fuse external data
 Leverage unstructured data
 Incorporate a “feedback” loop
 Explore AI
 Leverage unsupervised methods
 Build a data driven culture
 Do systematic experimentation
 Forge a multidisciplinary team
 Operationalize decisions
 Reduce decision latency
 Increase contextual relevance
Disruption
Enabling Real-TimeAnalytics
Sensors
Social
Machine Learning
• It is derived from the concept that it deals with “construction
and study of systems that can learn from data”
• It is seen as building blocks to make computers learn to
behave more intelligently.
• Two phases in learning process – training & testing
• Two kinds of learning
• Unsupervised
• no labels in the training data
• Algorithms detects the patterns in the data and groups the
observations of similar characteristics together
• Supervised
• We have training data with correct labels
• Use training data to prepare the algorithm
• Then apply it to data without a correct label
Some types of algorithms
• Prediction
• predicting variable from data
• Classification
• Assigning observations to pre-defined classes
• Clustering
• Splitting observations into groups based on similarity
• Recommendation
• Predicts what people might like & uncovers relationship between
items.
Steps in Analytics Workload
• Data Collection
• Pre-processing the data (cleaning & data munging)
• Retrieve sample data from the actual population
• Descriptive statistics on the sample data
• Exploratory Data Analysis with Spark
• Uni-variate analysis
• Bivariate analysis
• Missing Value treatment
• Outlier- detection
• Feature Engineering
• Apply machine learning models
• Optimize and fine-tune the model parameters
Sample Data
Types of labels:
• Denial of service (DoS – attack type)
• Normal
• Probe (attack)
• R2L (attack)
• U2R(attack)
Unsupervised Learning - Clustering
• Clustering is the assignment of a set of observations into subsets
(called clusters) so that observations in the same cluster are similar in
some sense.
• Find areas dense with data (also area without
• data)
• Anomaly – far from any cluster
• Supervise with labels to improve,
interpret
Streaming K-means (K-means ++)
• Assign points to nearest center, update centers, iterate
• Goal: points close to nearest cluster center
• Must choose k = number of clusters
• ++ means smarter starting point
Clustering – choosing parameters
• Initial plotted tsne plot which gives the
distribution of data in 2 dimensions. It helps
to identify if the data can be clustered.
• Normalize the data before applying k-means
i.e. standardize the scores as
• Choose k value using elbow method or
using PCA analysis
• Convert categorical variables to numeric
using one-hot encoding
tsne plot
elbow plot
Streaming k-means
Approach:
Start with k cluster centres initially
For every incoming batch of data, centroids keep updating.
The clusters drift over time, and after certain stage, they stabilize.
Continuously learns new data patterns.
Outliers are detected as anomalies.
Pros:
More useful when the data points don't have labels associated with them.
Simple to implement.
Cons:
Doesn't fit for high dimension data.
Kafka
Streaming
K-means
Network
Data
“Offline” vs “Online” algorithms
 Build models on static data
 Train algorithms on “batches” of data
 Use the model to make predictions on
incoming data stream
• Pros:
 Easy to analyze
 High accuracy
 Batch algorithms are quite accessible.
• Cons:
 Unable to identify dynamic patterns
 Build model on live stream of data
 Training happens continuously on live
data
 Use the model for both predict and learn
on streaming data.
• Pros:
 Model evolves continuously.
 Identifies rapidly changing patterns in
the data.
• Cons:
 Streaming algorithms are not widely
available.
 Active area of research
Offline Learning Online Learning
Streaming SVM
In machine learning, support vector machines (SVM) are supervised learning
models with associated learning algorithms that analyze data used for classification
and regression analysis.
 Capable of reflecting changes of dataset real time.
SVM is resistive to noise.
It uses high dimension to separate dataset.
Prediction rate can be increased by scaling the Spark cluster.
Example of a Real-timeAnalytics environment
Real time streaming analytics

More Related Content

What's hot

Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
Inside Analysis
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
The (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentThe (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology resident
Pedro Staziaki
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
Mohamed Magdy
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive Framework
Ran Zhang
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
Gabriel Moreira
 
The role of data engineering in data science and analytics practice
The role of data engineering in data science and analytics practiceThe role of data engineering in data science and analytics practice
The role of data engineering in data science and analytics practice
Joseph Benjamin Ilagan
 
Open Source Tools for Big Data
Open Source Tools for Big DataOpen Source Tools for Big Data
Open Source Tools for Big Data
Teemu Heikkilä
 
Big Data
Big DataBig Data
Big Data
NGDATA
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
Venkata Reddy Konasani
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Deepika ParthaSarathy
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
SuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalSuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-final
stelligence
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Teradata Aster
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
SoftServe
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
boorad
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
Caserta
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 

What's hot (20)

Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
The (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentThe (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology resident
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive Framework
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
 
The role of data engineering in data science and analytics practice
The role of data engineering in data science and analytics practiceThe role of data engineering in data science and analytics practice
The role of data engineering in data science and analytics practice
 
Open Source Tools for Big Data
Open Source Tools for Big DataOpen Source Tools for Big Data
Open Source Tools for Big Data
 
Big Data
Big DataBig Data
Big Data
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
SuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalSuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-final
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 

Similar to Real time streaming analytics

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
DataWorks Summit
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
DNA: an overview
DNA: an overviewDNA: an overview
DNA: an overview
Cisco DevNet
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
Selvaraj Kesavan
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
Tao Li
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
Ahmet Bulut
 
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real WorldWSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego
 
big-data-anallytics.pptx
big-data-anallytics.pptxbig-data-anallytics.pptx
big-data-anallytics.pptx
Sangamesh Kalyan
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
Maycon Viana Bordin
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 

Similar to Real time streaming analytics (20)

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
DNA: an overview
DNA: an overviewDNA: an overview
DNA: an overview
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real WorldWSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real World
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
big-data-anallytics.pptx
big-data-anallytics.pptxbig-data-anallytics.pptx
big-data-anallytics.pptx
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

Real time streaming analytics

  • 2. About me • Padma Chitturi • Analytics Lead @ Fractal Analytics • Author of “Apache Spark for DataScience cookbook” • https://github.com/ChitturiPadma/S parkforDataScienceCookbook
  • 3. Big Data use-cases across Industries Banking lImprove customer intelligence lReduce risk lIdentify fraud lMarketing campaigns. Healthcare lOptimal treatment lRremote patient monitoring lDetecting disease lPersonalized medicine. Manfacturing lProduct quality lDetect machine failures lSales forecasting lMarket pricing & planningRetail lCustomer behavior lBuying patterns of customers lRecommending products lMaintain the inventory. Telecom lTraffic control lCustomer experience lLocation based services lPrecise marketing Insurance lClaims Management lRisk Management lCustomer Experience & Insight Airlines lProviding travel offers lPredicting fligh delays lAvoiding travel accidents lIncreasing security Big Data Agriculture lPrecision Agriculture lDemand forecasting lReduce manpower lBetter Farming decisions.
  • 4. Need for “Real Time”Analytics across Industries Fraud detection Connected Car Data Identity & Protection Services Click Stream Analysis Financial Sales Tracking Improving Patient- Care
  • 5. Overview of Spark • In-memory cluster computing framework for processing and analyzing large volumes of data. • Key Features: • Easy to use ( expressive API for batch & real-time processing • Fast (provides in-memory persisting and optimizes disk seeks) • General-purpose (support batch, real-time and graph processing). • Scalable (as the data grows, computational power can be increased by adding more nodes). • Fault-tolerant (handles node failures without interrupting the application by launching tasks on the nodes having replicated copy)
  • 6. What is Spark Streaming ? • Extends Spark for doing large scale stream processing • Scales to 100s of nodes and achieves second scale latencies • Efficient and fault-tolerant stateful stream processing • Integrates with Spark’s batch and interactive processing • Provides a simple batch-like API for implementing complex algorithms.
  • 7. Discretized Stream Processing • Run a streaming computation as a series of very small, deterministic batch jobs.  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches  Batch sizes as low as ½ second, latency ~ 1 second  Potential for combining batch processing and streaming processing in the same system Spark Streaming Spark processed results Live data stream batches of X seconds
  • 8. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) Twitter Streaming API batch @ t+1batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed dataset)
  • 9. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) new DStream transformation: modify data in one DStream to create another DStream batch @ t+1batch @ t batch @ t+2 flatMap flatMap flatMap hashTags Dstream [#cat, #dog, … ] new RDDs created for every batch tweets Dstream
  • 10. val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue() Example 2 – Count the hashtags over last 1 min sliding window operation window length sliding interval DStream of data window length sliding interval
  • 11. • val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue() Example 2 – Count the hashtags over last 1 min tagCounts hashTags t-1 t t+1 t+2 t+3 sliding window countByValue count over all the data in the window
  • 12. Fault-tolerance • RDDs remember the operations that created them • Batches of input data are replicated in memory for fault-tolerance • Data lost due to worker failure, can be recomputed from replicated input data • Therefore, all transformed data is fault-tolerant input data replicated in memory flatMap lost partitions recomputed on other workers tweets RDD hashTags RDD
  • 13. Fault-tolerance • Spark Streaming program t = ssc.twitterStream(“…”) .map(…) t.foreach(…) t1 = ssc.twitterStream(“…”) t2 = ssc.twitterStream(“…”) t = t1.union(t2).map(…) t.saveAsHadoopFiles(…) t.map(…).foreach(…) t.filter(…).foreach(…) DStream Graph T M F E Twitter Input DStream Mapped DStream Foreach DStreamDummy DStream signifying an output operation T U M T M FF E F E F E
  • 14. Dstream Graph -> RDD Graphs -> Spark Jobs • Every interval, RDD graph is computed from DStream graph • For each output operation, a Spark action is created • For each action, a Spark job is created to compute it T U M T M FF E F E F E DStream Graph RDD GraphBlock RDDs with data received in last batch interval B U M B M FA A A 3 Spark Jobs
  • 15. Execution Model – Job Scheduling • Spark Streaming +Spark Driver Network Input Tracker Job Scheduler Spark’s Schedulers DStream Graph Job Manager JobQueue Jobs Block IDsRDDs Spark Workers Jobs executed on worker nodes Block Manager Data Received Block Manager
  • 16. RDD Checkpointing Saving RDD to HDFS to prevent RDD graph from growing too large • Done internally in Spark transparent to the user program • Done lazily, saved to HDFS the first time it is computed red_rdd.checkpoint() HDFS file Contents of red_rdd saved to a HDFS file transparent to all child RDDs
  • 17. RDD Checkpointing Stateful DStream operators can have infinite lineages data t-1 t t+1 t+2 t+3 states Large lineages lead to … • Large closure of the RDD object  large task sizes  high task launch times • High recovery times under failure • Periodic RDD Check-pointing solves this • Useful for iterative Spark programs as well. HDF S HDF S
  • 18. Performance Tuning • Increase Read parallelism • Increase downstream processing parallelism • Achieve stable configuration that can sustain the streaming workload • Optimize for low-latency • Memory settings and explore GC options. • Achieve Fault-tolerance • Serializing the objects.
  • 19. Analytics transforms the business Institutionalization Real time Data Sophistication  Sharpen the saw  Support strategic decisions  Achieve breakthrough innovation  Observe everything  Fuse external data  Leverage unstructured data  Incorporate a “feedback” loop  Explore AI  Leverage unsupervised methods  Build a data driven culture  Do systematic experimentation  Forge a multidisciplinary team  Operationalize decisions  Reduce decision latency  Increase contextual relevance Disruption
  • 21. Machine Learning • It is derived from the concept that it deals with “construction and study of systems that can learn from data” • It is seen as building blocks to make computers learn to behave more intelligently. • Two phases in learning process – training & testing • Two kinds of learning • Unsupervised • no labels in the training data • Algorithms detects the patterns in the data and groups the observations of similar characteristics together • Supervised • We have training data with correct labels • Use training data to prepare the algorithm • Then apply it to data without a correct label
  • 22. Some types of algorithms • Prediction • predicting variable from data • Classification • Assigning observations to pre-defined classes • Clustering • Splitting observations into groups based on similarity • Recommendation • Predicts what people might like & uncovers relationship between items.
  • 23. Steps in Analytics Workload • Data Collection • Pre-processing the data (cleaning & data munging) • Retrieve sample data from the actual population • Descriptive statistics on the sample data • Exploratory Data Analysis with Spark • Uni-variate analysis • Bivariate analysis • Missing Value treatment • Outlier- detection • Feature Engineering • Apply machine learning models • Optimize and fine-tune the model parameters
  • 24. Sample Data Types of labels: • Denial of service (DoS – attack type) • Normal • Probe (attack) • R2L (attack) • U2R(attack)
  • 25. Unsupervised Learning - Clustering • Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. • Find areas dense with data (also area without • data) • Anomaly – far from any cluster • Supervise with labels to improve, interpret
  • 26. Streaming K-means (K-means ++) • Assign points to nearest center, update centers, iterate • Goal: points close to nearest cluster center • Must choose k = number of clusters • ++ means smarter starting point
  • 27. Clustering – choosing parameters • Initial plotted tsne plot which gives the distribution of data in 2 dimensions. It helps to identify if the data can be clustered. • Normalize the data before applying k-means i.e. standardize the scores as • Choose k value using elbow method or using PCA analysis • Convert categorical variables to numeric using one-hot encoding tsne plot elbow plot
  • 28. Streaming k-means Approach: Start with k cluster centres initially For every incoming batch of data, centroids keep updating. The clusters drift over time, and after certain stage, they stabilize. Continuously learns new data patterns. Outliers are detected as anomalies. Pros: More useful when the data points don't have labels associated with them. Simple to implement. Cons: Doesn't fit for high dimension data. Kafka Streaming K-means Network Data
  • 29. “Offline” vs “Online” algorithms  Build models on static data  Train algorithms on “batches” of data  Use the model to make predictions on incoming data stream • Pros:  Easy to analyze  High accuracy  Batch algorithms are quite accessible. • Cons:  Unable to identify dynamic patterns  Build model on live stream of data  Training happens continuously on live data  Use the model for both predict and learn on streaming data. • Pros:  Model evolves continuously.  Identifies rapidly changing patterns in the data. • Cons:  Streaming algorithms are not widely available.  Active area of research Offline Learning Online Learning
  • 30. Streaming SVM In machine learning, support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.  Capable of reflecting changes of dataset real time. SVM is resistive to noise. It uses high dimension to separate dataset. Prediction rate can be increased by scaling the Spark cluster.
  • 31. Example of a Real-timeAnalytics environment