SlideShare a Scribd company logo
1 of 20
Streaming Analytics
Ashish Gupta (LinkedIn)
Neera Agarwal
What is a Data Stream
● Unbounded Data
● Data arriving continuously at high rate
● Too large to first store and then process
● Need to be processed in one pass
● May display Temporal Locality - patterns may evolve over time
Contrast with Batch Processing
1. Process Bounded Files. Files by ingestion time. Last 15 minutes, last 1 hour,
last 1 day
2. Program can go back and forth in data. Do multipass processing.
3. Sessions and joins can span files.
4. Many machine learning algorithms need full batch of data to train.
5. Very high Latency, but very high throughput as well.
a. Wait for files to arrive. Ie wait for file window to close.
b. Processing the whole file(s) take time.
Streaming Applications
● Joining Clicks and Impressions
● Mobile applications - User activity
● Session based analysis
● Fraud detection
● Industrial IOT
● LinkedIn’s Streaming Standardization Platform
Lambda Architecture
New
Data
Streaming
System
Batch System
(Hadoop)
Application
Speed DB
Batch DB
Speed Layer
Batch Layer
Streaming
System
Streaming Systems Architecture
New
Data
Streaming
System
Hadoop
Store and Serve
SQL and
NoSql DB
Sources
Kafka
Real-time
Consumers
Process and Analyze
Consumers
Kafka
New
Data
New
Data
Save Log
Spark
Flink
Storm
Samza
Kafka Streams
Consumers
What is Streaming Analytics
“ Continuous processing on unbounded data”
“Software that can filter, aggregate, enrich, and analyze a high throughput of data
from multiple disparate live data sources and in any data format to identify simple
and complex patterns to visualize business in real-time, detect urgent situations,
and automate immediate actions.” - Forrester
Streaming Concepts
Time Window Order Correctness
Delayed data
Out of order data
Event Time
Processing Time
Fixed Window
Sliding Window
Sessions
Consistency
At least Once
Exactly Once
Checkpointing
Streaming Concepts - Time
New
Data
Streaming
System
Kafka
New
Data
New
Data
Event Time Ingestion Time
Processing
Time
Streaming Concepts - Windows
Fixed Window/
Tumbling Window
Sliding Window Session Window
Kafka
Streams
Processing Model Mini Batch Event level Event level Event level Event level
Guarantee Exactly Once Exactly Once At least once At least once At least once
State Management Yes Yes No Yes Yes
Latency Medium Low Low Low Low
Built in primitives Batch and
streaming
Batch and
streaming
Low Level API Low level
API
Streaming
only
Back Pressure Yes Yes No via Kafka via Kafka
Open Source Streaming Systems
Case Study: LinkedIn Standardization Platform
User edits
profile
Company
Geo
Title
Education
Seniority
….
Search
Feed
…..
Job
Recommendation
Standardizer
LinkedIn Profile
Pattern : External Lookup/Stream to Table Join
Decide based on size of the data, latency needs and QPS of external systems
Streaming
System Service
External
Cache
External
DB
Internal
Cache
Pattern: Stream to Stream Joining
● Joins are expensive
● If partitions of two streams not collocated, then expensive shuffle
● Broadcast join if one file is small
Pattern: Reprocessing
Streaming
System (Samza)
Hadoop
SQL and
NoSql DB
Real-time
Consumers
Consumers
Title, Geo, Company
Education …..
DB Log
Events
DB
Snapshot
Kafka
KafkaDatabus
Databus
Standardizer
with ML Models Kafka
Pattern: Reprocessing
Streaming
System (Samza)
Hadoop
SQL and
NoSql DB
Real-time
Consumers
Consumers
Title, Geo, Company
Education …..
DB Log
Events
DB
Snapshot
Kafka
KafkaDatabus
Databus
rewind 2 hours
Standardizer
with ML Models Kafka
Apache Kafka
● Highly scalable messaging system
● Distributed commit log
● Developed in LinkedIn back in 2010
● At LinkedIn - more than 1.4 trillion messages
per day across over 1400 brokers
● Distributed, partitioned, replicated
● Message retention - based on time and size
Some Kafka use cases
● Queuing/Messaging
● Metrics
● Auditing
● Logging
Producer
Producer
Consumer
Consumer
Consumer
Broker
Broker
Broker
Broker
Kafka Cluster
Send messages Fetch messages
Producer
References
● MillWheel: http://research.google.com/pubs/pub41378.html
● DataFlow:http://research.google.com/pubs/pub43864.html
● Samza: http://samza.apache.org/
● Spark Streaming Paper: Discretized streams
● Big Data Stream Mining (KDD’15)
● Models and Issues in Data Stream Systems
Contact Us
Ashish Gupta - ahgupta@linkedin.com
https://www.linkedin.com/in/guptash
Neera Agarwal - neera8work@gmail.com
https://www.linkedin.com/in/neera-agarwal-21b9473

More Related Content

What's hot

Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 

What's hot (20)

Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time SolutionFast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
Fast Data: A Customer’s Journey to Delivering a Compelling Real-Time Solution
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory Speed
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12cProcessing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms compared
 
Monitoring and Troubleshooting a Real Time Pipeline
Monitoring and Troubleshooting a Real Time PipelineMonitoring and Troubleshooting a Real Time Pipeline
Monitoring and Troubleshooting a Real Time Pipeline
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 

Viewers also liked

Viewers also liked (20)

Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
 
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichReal Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at King
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Real-time analytics as a service at King
Real-time analytics as a service at King Real-time analytics as a service at King
Real-time analytics as a service at King
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsData Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operations
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the Enterprise
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoT
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 

Similar to Streaming Analytics

EQR Reporting: Rails + Amazon EC2
EQR Reporting:  Rails + Amazon EC2EQR Reporting:  Rails + Amazon EC2
EQR Reporting: Rails + Amazon EC2
jeperkins4
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Tim Vaillancourt
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 

Similar to Streaming Analytics (20)

[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
EQR Reporting: Rails + Amazon EC2
EQR Reporting:  Rails + Amazon EC2EQR Reporting:  Rails + Amazon EC2
EQR Reporting: Rails + Amazon EC2
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 

Recently uploaded

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 

Recently uploaded (20)

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 

Streaming Analytics

  • 1. Streaming Analytics Ashish Gupta (LinkedIn) Neera Agarwal
  • 2. What is a Data Stream ● Unbounded Data ● Data arriving continuously at high rate ● Too large to first store and then process ● Need to be processed in one pass ● May display Temporal Locality - patterns may evolve over time
  • 3. Contrast with Batch Processing 1. Process Bounded Files. Files by ingestion time. Last 15 minutes, last 1 hour, last 1 day 2. Program can go back and forth in data. Do multipass processing. 3. Sessions and joins can span files. 4. Many machine learning algorithms need full batch of data to train. 5. Very high Latency, but very high throughput as well. a. Wait for files to arrive. Ie wait for file window to close. b. Processing the whole file(s) take time.
  • 4. Streaming Applications ● Joining Clicks and Impressions ● Mobile applications - User activity ● Session based analysis ● Fraud detection ● Industrial IOT ● LinkedIn’s Streaming Standardization Platform
  • 6. Streaming System Streaming Systems Architecture New Data Streaming System Hadoop Store and Serve SQL and NoSql DB Sources Kafka Real-time Consumers Process and Analyze Consumers Kafka New Data New Data Save Log Spark Flink Storm Samza Kafka Streams Consumers
  • 7. What is Streaming Analytics “ Continuous processing on unbounded data” “Software that can filter, aggregate, enrich, and analyze a high throughput of data from multiple disparate live data sources and in any data format to identify simple and complex patterns to visualize business in real-time, detect urgent situations, and automate immediate actions.” - Forrester
  • 8. Streaming Concepts Time Window Order Correctness Delayed data Out of order data Event Time Processing Time Fixed Window Sliding Window Sessions Consistency At least Once Exactly Once Checkpointing
  • 9. Streaming Concepts - Time New Data Streaming System Kafka New Data New Data Event Time Ingestion Time Processing Time
  • 10. Streaming Concepts - Windows Fixed Window/ Tumbling Window Sliding Window Session Window
  • 11. Kafka Streams Processing Model Mini Batch Event level Event level Event level Event level Guarantee Exactly Once Exactly Once At least once At least once At least once State Management Yes Yes No Yes Yes Latency Medium Low Low Low Low Built in primitives Batch and streaming Batch and streaming Low Level API Low level API Streaming only Back Pressure Yes Yes No via Kafka via Kafka Open Source Streaming Systems
  • 12. Case Study: LinkedIn Standardization Platform User edits profile Company Geo Title Education Seniority …. Search Feed ….. Job Recommendation Standardizer LinkedIn Profile
  • 13. Pattern : External Lookup/Stream to Table Join Decide based on size of the data, latency needs and QPS of external systems Streaming System Service External Cache External DB Internal Cache
  • 14. Pattern: Stream to Stream Joining ● Joins are expensive ● If partitions of two streams not collocated, then expensive shuffle ● Broadcast join if one file is small
  • 15. Pattern: Reprocessing Streaming System (Samza) Hadoop SQL and NoSql DB Real-time Consumers Consumers Title, Geo, Company Education ….. DB Log Events DB Snapshot Kafka KafkaDatabus Databus Standardizer with ML Models Kafka
  • 16. Pattern: Reprocessing Streaming System (Samza) Hadoop SQL and NoSql DB Real-time Consumers Consumers Title, Geo, Company Education ….. DB Log Events DB Snapshot Kafka KafkaDatabus Databus rewind 2 hours Standardizer with ML Models Kafka
  • 17. Apache Kafka ● Highly scalable messaging system ● Distributed commit log ● Developed in LinkedIn back in 2010 ● At LinkedIn - more than 1.4 trillion messages per day across over 1400 brokers ● Distributed, partitioned, replicated ● Message retention - based on time and size
  • 18. Some Kafka use cases ● Queuing/Messaging ● Metrics ● Auditing ● Logging Producer Producer Consumer Consumer Consumer Broker Broker Broker Broker Kafka Cluster Send messages Fetch messages Producer
  • 19. References ● MillWheel: http://research.google.com/pubs/pub41378.html ● DataFlow:http://research.google.com/pubs/pub43864.html ● Samza: http://samza.apache.org/ ● Spark Streaming Paper: Discretized streams ● Big Data Stream Mining (KDD’15) ● Models and Issues in Data Stream Systems
  • 20. Contact Us Ashish Gupta - ahgupta@linkedin.com https://www.linkedin.com/in/guptash Neera Agarwal - neera8work@gmail.com https://www.linkedin.com/in/neera-agarwal-21b9473