SlideShare a Scribd company logo
1 of 31
Download to read offline
Using GraphX/Pregel on Browsing
History to Discover Purchase Intent
Zhang, Lisa
Rubicon Project Buyer Cloud
Problem
• Identify possible new customers for
our advertisers using intent data,
one of which is browsing history
travel-site-101.com, spark-summit.org
Challenges
Sites are numerous
and ever-changing
Need to build one
model per advertiser
Positive training cases
are sparse
Models run frequently:
every few hours
Offline Evaluation Metrics
• AUC: area under ROC
curve
• Precision at top 5% of
score: model used to
identify top users only
• Baseline: Previous
solution prior to Spark
Linear Dimensionality Reduction
SVDINPUT GBT OUTPUT
per advertiser
Dimension Reduction Classification
Evaluation
SVD: Top Sites
Home Improvement Advertiser
deal-site-101.com
chat-site-001.com
ecommerce-site-001.com
chat-site-002.com
invitation-site-001.com
classified-site-001.com
Telecom Advertiser
developer-forum-001.com
chat-site-001.com
invitation-site-001.com
deal-site-101.com
college-site-001.com
chat-site-002.com
The Issue with SVDs
• Dominated by the same signal across all
advertisers
• Identify online buyers, but not those
specific to each advertiser
• Not appropriate for our use case
SVD per Advertiser?
SVDINPUT GBT OUTPUT
per advertiser
Dimension Reduction Classification
Non-linear Approaches?
Too Complex:

Cannot run frequently,
we become slow to learn
about new sites
Too Simple:

Possibly same
problem as SVD
Speed
Complexity
Can We Simplify?
Intuition:

Given a known positive training case,
target other users that have similar site
history as the current user.
One natural way is to treat sites as a graph.
Sites as Graphs
• Easy to interpret
• Easy to
visualize
• Graph algos
well studied
Spark GraphX
• Spark’s API for parallel graph computations
• Comes with some common graph
algorithms
• API for developing new graph algorithms:
e.g. via pregel
Pregel API
• Pass messages from vertices to other, typically
adjacent, vertices: “Think like a vertex”
• Define an algorithm by stating:
how to send messages

how to merge multiple messages

how to update a vertex with message
repeat
Propagation Based Approach
• Pass positive
(converter)
information
across edges
• Give credit to
“similar” sites
Example Scenario
travel-site-101.com
book-my-travel-103.com
canoe-travel-102.com
1 converter / 40,000 visitors
0 converter / 48,000 visitors
0 converter / 41,000 visitors
Sending Messages
ω = 1/40,000
Δω = ω * edge_weight
Δω = ω * edge_weight
canoe-travel-102.com book-my-travel-103.com
travel-site-101.com
Receiving Messages
Δω1
hawaii-999.com
…
Δω2
Δωn
ωnew = ωold + λ • Σ Δωi
canoe-travel-102.com
travel-site-101.com
Weights After One Iteration
book-my-travel-103.com
canoe-travel-102.com
2.5 x 10^(-5)
1.2 x 10^(-5)
0.8 x 10^(-5)
travel-site-101.com
Simplified Code
type MT = Double; type ED = Double; type VD = Double

val lambda = …; val maxIterations = …

val initialMsg = 0.0



def updateVertex(id: VertexId, w: VD, delta_w: MT): VD =

w + lambda * delta_w

def sendMessage(edge: EdgeTriplet[VD, ED]): Iterator[(VertexId, MT)] = {

Iterator((edge.srcId, edge.attr * edge.dstAttr),

(edge.dstId, edge.attr * edge.srcAttr))

}

def mergeMsgs(w1: MT, w2: MT): MT = x + y



val graph: Graph[VD, ED] = …

graph.pregel(initialMessage, maxIterations, EdgeDirection.out)(

updateVertex, sendMessage, mergeMessage)
Model Output & Application
• Model output is a
mapping of sites to
final scores
• To apply the model,
aggregate scores of
sites visited by user
SITE SCORE
travel-site-101.com 0.5
canoe-travel-102.com 0.4
sport-team-101.com 0.1
… …
Other Factors
• Edge Weights: Cosine Similarity, Jaccard Index,
Conditional Probability
• Edge/Vertex Removal: Remove sites and edges on
the long-tail
• Hyper parameter Tuning: lambda, numIterations
and others through testing (there is no convergence)
Evaluation
Propagation: Top Sites
Home Improvement Advrt.
label-maker-101.com
laptop-bags-101.com
renovations-101.com
fitness-equipment-101.com
renovations-102.com
buy-realestate-101.com
Telecom Advertiser
canada-movies-101.ca
canadian-news-101.ca
canadian-jobs-101.ca
canadian-teacher-rating-101.ca
watch-tv-online.com
phone-system-review-101.com
Canadian
Telecom
Renovations
Challenges (from earlier)
Sites are numerous
and ever-changing
Need to build one
model per advertiser
Positive training cases
are sparse
Models run frequently:
every few hours
Resolutions
Graph built just in
time for training
Need to build one
model per advertiser
Positive training cases
are sparse
Models run frequently:
every few hours
Resolutions
Graph built just in
time for training
Graph built once;
propagation runs per
advertiser
Positive training cases
are sparse
Models run frequently:
every few hours
Resolutions
Graph built just in
time for training
Graph built once;
propagation runs per
advertiser
Propagation resolves
sparsity: intuitive and
interpretable
Models run frequently:
every few hours
Resolutions
Graph built just in
time for training
Graph built once;
propagation runs per
advertiser
Propagation resolves
sparsity: intuitive and
interpretable
Evaluating users fast;
does not require GraphX
General Spark Learnings
• Many small jobs > one large job: We split big jobs into multiple smaller,
concurrent, jobs and increased throughput (more jobs could run
concurrently).
• Serialization: Don’t save SparkContext as a member variable, define Python
classes in a separate file, check if your object serializes/deserializes well!
• Use rdd.reduceByKey() and others over rdd.groupByKey().
• Be careful with rdd.coalesce() vs rdd.repartition(), rdd.partitionBy() can be
your friend in the right circumstances.
THANK YOU.
lzhang@rubiconproject.com

More Related Content

What's hot

Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Databricks
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 

What's hot (20)

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
 
Tailored for Spark
Tailored for SparkTailored for Spark
Tailored for Spark
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflow
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
AI made easy with Flink AI Flow
AI made easy with Flink AI FlowAI made easy with Flink AI Flow
AI made easy with Flink AI Flow
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using Spark
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer Nori
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 

Viewers also liked

Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Spark Summit
 
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Spark Summit
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Spark Summit
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Spark Summit
 
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Summit
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Spark Summit
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Spark Summit
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Spark Summit
 

Viewers also liked (20)

Pregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph ProcessingPregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph Processing
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy Ryza
 
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
 
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu VatsBuilding a Recommendation Engine Using Diverse Features by Divyanshu Vats
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
 
Data Scientist Workbench 入門
Data Scientist Workbench 入門Data Scientist Workbench 入門
Data Scientist Workbench 入門
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui MengGeneralized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
 
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya BidaSpark Tuning for Enterprise System Administrators By Anya Bida
Spark Tuning for Enterprise System Administrators By Anya Bida
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Spark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun ConnollySpark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun Connolly
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida Ha
 
GraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkGraphX and Pregel - Apache Spark
GraphX and Pregel - Apache Spark
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 

Similar to Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Zhang

Pragmatic deep learning for image labelling
Pragmatic deep learning for image labellingPragmatic deep learning for image labelling
Pragmatic deep learning for image labelling
Pierre Gutierrez
 
In-Memory Data Grids Essentials. Oracle Coherence
In-Memory Data Grids Essentials. Oracle CoherenceIn-Memory Data Grids Essentials. Oracle Coherence
In-Memory Data Grids Essentials. Oracle Coherence
Vladimir Kondrashchenko
 
AWS Cost Optimization
AWS Cost OptimizationAWS Cost Optimization
AWS Cost Optimization
Miles Ward
 

Similar to Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Zhang (20)

AWS Activate Webinar - Growing on AWS
AWS Activate Webinar - Growing on AWSAWS Activate Webinar - Growing on AWS
AWS Activate Webinar - Growing on AWS
 
Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual Analysis
 
Making sense of the Graph Revolution
Making sense of the Graph RevolutionMaking sense of the Graph Revolution
Making sense of the Graph Revolution
 
A Swarm of Ads
A Swarm of AdsA Swarm of Ads
A Swarm of Ads
 
Vpon - 廣告效果導向為基礎的行動廣告系統
Vpon - 廣告效果導向為基礎的行動廣告系統Vpon - 廣告效果導向為基礎的行動廣告系統
Vpon - 廣告效果導向為基礎的行動廣告系統
 
presentation slides
presentation slidespresentation slides
presentation slides
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
 
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Pragmatic deep learning for image labelling
Pragmatic deep learning for image labellingPragmatic deep learning for image labelling
Pragmatic deep learning for image labelling
 
A Framework to Measure and Maximize Cloud ROI
A Framework to Measure and Maximize Cloud ROIA Framework to Measure and Maximize Cloud ROI
A Framework to Measure and Maximize Cloud ROI
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
 
In-Memory Data Grids Essentials. Oracle Coherence
In-Memory Data Grids Essentials. Oracle CoherenceIn-Memory Data Grids Essentials. Oracle Coherence
In-Memory Data Grids Essentials. Oracle Coherence
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Response prediction for display advertising - WSDM 2014
Response prediction for display advertising - WSDM 2014Response prediction for display advertising - WSDM 2014
Response prediction for display advertising - WSDM 2014
 
21 Experiments to Increase Throughput
21 Experiments to Increase Throughput21 Experiments to Increase Throughput
21 Experiments to Increase Throughput
 
AWS Cost Optimization
AWS Cost OptimizationAWS Cost Optimization
AWS Cost Optimization
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Recently uploaded (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Zhang

  • 1. Using GraphX/Pregel on Browsing History to Discover Purchase Intent Zhang, Lisa Rubicon Project Buyer Cloud
  • 2. Problem • Identify possible new customers for our advertisers using intent data, one of which is browsing history travel-site-101.com, spark-summit.org
  • 3. Challenges Sites are numerous and ever-changing Need to build one model per advertiser Positive training cases are sparse Models run frequently: every few hours
  • 4. Offline Evaluation Metrics • AUC: area under ROC curve • Precision at top 5% of score: model used to identify top users only • Baseline: Previous solution prior to Spark
  • 5. Linear Dimensionality Reduction SVDINPUT GBT OUTPUT per advertiser Dimension Reduction Classification
  • 7. SVD: Top Sites Home Improvement Advertiser deal-site-101.com chat-site-001.com ecommerce-site-001.com chat-site-002.com invitation-site-001.com classified-site-001.com Telecom Advertiser developer-forum-001.com chat-site-001.com invitation-site-001.com deal-site-101.com college-site-001.com chat-site-002.com
  • 8. The Issue with SVDs • Dominated by the same signal across all advertisers • Identify online buyers, but not those specific to each advertiser • Not appropriate for our use case
  • 9. SVD per Advertiser? SVDINPUT GBT OUTPUT per advertiser Dimension Reduction Classification
  • 10. Non-linear Approaches? Too Complex:
 Cannot run frequently, we become slow to learn about new sites Too Simple:
 Possibly same problem as SVD Speed Complexity
  • 11. Can We Simplify? Intuition:
 Given a known positive training case, target other users that have similar site history as the current user. One natural way is to treat sites as a graph.
  • 12. Sites as Graphs • Easy to interpret • Easy to visualize • Graph algos well studied
  • 13. Spark GraphX • Spark’s API for parallel graph computations • Comes with some common graph algorithms • API for developing new graph algorithms: e.g. via pregel
  • 14. Pregel API • Pass messages from vertices to other, typically adjacent, vertices: “Think like a vertex” • Define an algorithm by stating: how to send messages
 how to merge multiple messages
 how to update a vertex with message repeat
  • 15. Propagation Based Approach • Pass positive (converter) information across edges • Give credit to “similar” sites
  • 16. Example Scenario travel-site-101.com book-my-travel-103.com canoe-travel-102.com 1 converter / 40,000 visitors 0 converter / 48,000 visitors 0 converter / 41,000 visitors
  • 17. Sending Messages ω = 1/40,000 Δω = ω * edge_weight Δω = ω * edge_weight canoe-travel-102.com book-my-travel-103.com travel-site-101.com
  • 18. Receiving Messages Δω1 hawaii-999.com … Δω2 Δωn ωnew = ωold + λ • Σ Δωi canoe-travel-102.com travel-site-101.com
  • 19. Weights After One Iteration book-my-travel-103.com canoe-travel-102.com 2.5 x 10^(-5) 1.2 x 10^(-5) 0.8 x 10^(-5) travel-site-101.com
  • 20. Simplified Code type MT = Double; type ED = Double; type VD = Double
 val lambda = …; val maxIterations = …
 val initialMsg = 0.0
 
 def updateVertex(id: VertexId, w: VD, delta_w: MT): VD =
 w + lambda * delta_w
 def sendMessage(edge: EdgeTriplet[VD, ED]): Iterator[(VertexId, MT)] = {
 Iterator((edge.srcId, edge.attr * edge.dstAttr),
 (edge.dstId, edge.attr * edge.srcAttr))
 }
 def mergeMsgs(w1: MT, w2: MT): MT = x + y
 
 val graph: Graph[VD, ED] = …
 graph.pregel(initialMessage, maxIterations, EdgeDirection.out)(
 updateVertex, sendMessage, mergeMessage)
  • 21. Model Output & Application • Model output is a mapping of sites to final scores • To apply the model, aggregate scores of sites visited by user SITE SCORE travel-site-101.com 0.5 canoe-travel-102.com 0.4 sport-team-101.com 0.1 … …
  • 22. Other Factors • Edge Weights: Cosine Similarity, Jaccard Index, Conditional Probability • Edge/Vertex Removal: Remove sites and edges on the long-tail • Hyper parameter Tuning: lambda, numIterations and others through testing (there is no convergence)
  • 24. Propagation: Top Sites Home Improvement Advrt. label-maker-101.com laptop-bags-101.com renovations-101.com fitness-equipment-101.com renovations-102.com buy-realestate-101.com Telecom Advertiser canada-movies-101.ca canadian-news-101.ca canadian-jobs-101.ca canadian-teacher-rating-101.ca watch-tv-online.com phone-system-review-101.com Canadian Telecom Renovations
  • 25. Challenges (from earlier) Sites are numerous and ever-changing Need to build one model per advertiser Positive training cases are sparse Models run frequently: every few hours
  • 26. Resolutions Graph built just in time for training Need to build one model per advertiser Positive training cases are sparse Models run frequently: every few hours
  • 27. Resolutions Graph built just in time for training Graph built once; propagation runs per advertiser Positive training cases are sparse Models run frequently: every few hours
  • 28. Resolutions Graph built just in time for training Graph built once; propagation runs per advertiser Propagation resolves sparsity: intuitive and interpretable Models run frequently: every few hours
  • 29. Resolutions Graph built just in time for training Graph built once; propagation runs per advertiser Propagation resolves sparsity: intuitive and interpretable Evaluating users fast; does not require GraphX
  • 30. General Spark Learnings • Many small jobs > one large job: We split big jobs into multiple smaller, concurrent, jobs and increased throughput (more jobs could run concurrently). • Serialization: Don’t save SparkContext as a member variable, define Python classes in a separate file, check if your object serializes/deserializes well! • Use rdd.reduceByKey() and others over rdd.groupByKey(). • Be careful with rdd.coalesce() vs rdd.repartition(), rdd.partitionBy() can be your friend in the right circumstances.