SlideShare a Scribd company logo
1 of 17
Download to read offline
Streaming	Architecture	for	
Predictive	Modeling
Keira	Zhou
Data	Engineer	@	Capital	One	Labs
#whoami
• Data	Engineer	@	Capital	One	Labs
• Previous:	
• Fellow	@	Insight	Data	Engineering
• BS	+	MS	in	Systems	Engineering	@UVA
• Github Repo:	
• https://github.com/keiraqz/SparkModeling
1
Batch	vs.	Streaming
• Batch
• Run	a	process	in	a	scheduled	way
• Good	for	computation	on	huge	amount	of	historical	data
• Complex	analysis,	hard	to	compute	in	real	time
• Streaming
• Continuous
• “Life	as	it	happens”:	process	data	as	it	comes	in
• Traffic	information
• Heart	beat	monitoring
• Capital	One:	double	swipe
2
Task	and	Dataset
• Task:	Predict	Phishing	website
• e.g.	Your	password	for	x	website	is	expiring,	update	at:	http://abc.com/update
• Predict	if	a	website	URL	is	a	phishing	website
• Phishing	Websites	Data	Set
• From	UCI:	https://archive.ics.uci.edu/ml/datasets/Phishing+Websites#
• Extracted	features	+	Labeled	data	points
3
Feature	Selection
• Description	provided	in	the	dataset
• Some	examples
• Using	URL	Shortening	Services	“TinyURL”:	
• bit.ly/19DXSk4
• Age	of	Domain:	
• minimum	age	of	the	legitimate	domain	is	6	months
• Adding	Prefix	or	Suffix	Separated	by	(-)	to	the	Domain:
• http://www.confirme-paypal.com/
4
Train	and	Predict
Pre-trained	Model Online	Model	Updating Online	Prediction
Method	1 YES NO YES
Method 2 NO YES YES
Method	3 YES YES YES
5
Demo:	Method	2	&	Method	3
Online	updating	with	and	without	a	pre-trained	model
6
Backend
7
Train	on	Historical	Data
• Classification	problem
• Logistic	Regression	with	Stochastic	Gradient	Descent
8
Code	Snippet	#1
9
• Spark	2.0	feature:	Save	model	to	file
• Load	model	file
Predict	&	Update
• Streaming	Logistic	Regression	Model	with	Stochastic	Gradient	Descent
10
Build	the	Streaming	Pipeline
11
Code	Snippet	#2
12
• Kafka	->	Spark	Streaming:	Spark	Streaming	Kafka	connector
• Dependency:	spark-streaming-kafka-0-8_2.11-2.0.1.jar
• Use	the	right	version
• Make	sure	you	add	the	jar	and	its	dependencies	to	your	project
Parallelism	in	Spark
• Parallelism	in	Data	Receiving
• For	example,	split	Kafka	input	DStream based	on	topics	
• One	Kafka	Input	stream	for	one	topic	to	receive	data	in	parallel
• numStreams =	5	
• kafkaStreams =	[KafkaUtils.createStream(...)	for _	in range	(numStreams)]
• Parallelism	in	Data	Processing
• conf =	SparkConf().setMaster("local[2]")
• Meaning	- 2	threads:	“minimal”	parallelism
• Spark	Cluster:	YARN,	Mesos,	Standalone
13
Code	Snippet	#3
14
• Spark	Streaming	->	Redis:	Python	Redis connector	(pip	install	redis)
Summary
15
Data	Science Data	Engineering
• Find	the	right	features
• Get	labeled	data
• Manual	labeling
• Tuning	Spark
• #	of	drivers
• #	of	executors
• Memory
• level	of	parallelism
• Spark	Programming
• Spark	connector	or	
Python	connector
• Use	MLlib
Questions?
16

More Related Content

What's hot

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Databricks
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 

What's hot (20)

Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 
Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SF
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
 
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
 
Building an ML Tool to predict Article Quality Scores using Delta & MLFlow
Building an ML Tool to predict Article Quality Scores using Delta & MLFlowBuilding an ML Tool to predict Article Quality Scores using Delta & MLFlow
Building an ML Tool to predict Article Quality Scores using Delta & MLFlow
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize Seattle
 
Spark Summit East Keynote by Anjul Bhambhri
Spark Summit East Keynote by Anjul BhambhriSpark Summit East Keynote by Anjul Bhambhri
Spark Summit East Keynote by Anjul Bhambhri
 
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerH2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 

Viewers also liked

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 

Viewers also liked (10)

Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for Search
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 

Similar to Online Model Updating with Spark Streaming

Evolution of integration and microservices patterns with service mesh
Evolution of integration and microservices patterns with service meshEvolution of integration and microservices patterns with service mesh
Evolution of integration and microservices patterns with service mesh
Christian Posta
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at NetflixDistributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
sfbiganalytics
 

Similar to Online Model Updating with Spark Streaming (20)

Evolution of integration and microservices patterns with service mesh
Evolution of integration and microservices patterns with service meshEvolution of integration and microservices patterns with service mesh
Evolution of integration and microservices patterns with service mesh
 
Microservices and Integration: what's next with Istio service mesh
Microservices and Integration: what's next with Istio service meshMicroservices and Integration: what's next with Istio service mesh
Microservices and Integration: what's next with Istio service mesh
 
Evolution of integration and microservices patterns with service mesh
Evolution of integration and microservices patterns with service meshEvolution of integration and microservices patterns with service mesh
Evolution of integration and microservices patterns with service mesh
 
AzureSynapse.pptx
AzureSynapse.pptxAzureSynapse.pptx
AzureSynapse.pptx
 
Tracking and business intelligence
Tracking and business intelligenceTracking and business intelligence
Tracking and business intelligence
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Azure saturday pn 2018
Azure saturday pn 2018Azure saturday pn 2018
Azure saturday pn 2018
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
Data visualization in JavaScript
Data visualization in JavaScriptData visualization in JavaScript
Data visualization in JavaScript
 
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at NetflixDistributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per second
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
The databases in SSDT: A work with project and best practices
The databases in SSDT: A work with project and best practicesThe databases in SSDT: A work with project and best practices
The databases in SSDT: A work with project and best practices
 
The Path Through SharePoint Migrations
The Path Through SharePoint MigrationsThe Path Through SharePoint Migrations
The Path Through SharePoint Migrations
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysis
 

Recently uploaded

Recently uploaded (20)

AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 

Online Model Updating with Spark Streaming