SlideShare a Scribd company logo
1 of 31
Download to read offline
Identifying Returning Users
Simply Business
•  Largest UK business insurer provider
•  More than 370,000 policy holders
•  Using BML, tech and data to disrupt
the business insurance market
Data ’n’ Analytics
•  4 Data Engineers
•  3 Business Intelligence Developers
•  1 Data Scientist
•  1 Director of Data Science
•  And hiring! :-)
Our Infrastructure
Event	
  Enrichment	
  
(Spark	
  Streaming)	
  
Event	
  Collector	
  
(REST	
  Endpoint)	
   Kinesis	
  
MongoDB	
  
Kinesis	
  
Redshi>	
  
Downstream	
  NRT	
  
(Spark	
  Streaming)	
  
Returning Visitors
The Challenge
Accurately identify users over time as they interact with us
Why? Knowledge
To understand how our visitors behave:
•  They have long buying cycles
•  They use multiple devices
•  They engage with us through multiple channels
Why? Cost Analysis
To know how much it costs to acquire new customers:
•  They visit us from different marketing channels
•  Each of them have different costs
•  We remunerate our partners depending on that
Why? Personalization
To adapt their visit while they interact with us:
•  We can trigger offers depending on the channel they come from
•  We can prefill fields if they’re recognized
•  We can change the look and feel of the website
Cookies?
Their work great when you want to identify people coming back to your
website.
Challenges:
•  Different devices
•  Cleared cookies
•  Private sessions
•  What about telephone calls…?
Use More IDs!
We can use other IDs provided by visitors to link sessions:
•  Login details
•  Email addresses
•  Telephone numbers
•  Credit card numbers
•  Fingerprints
•  …
Streaming Solution
How many visitors do we have here?
Example
Timestamp	
   Event	
   Cookie	
  ID	
   Email	
  
1	
   page_view	
   111	
  
2	
   details_submiFed	
   111	
   a@a.com	
  
3	
   page_view	
   555	
  
4	
   page_view	
   888	
  
5	
   policy_sold	
   555	
   a@a.com	
  
6	
   details_submiFed	
   888	
   b@b.com	
  
How many visitors do we have here? Only 2
Example
Timestamp	
   Event	
   Cookie	
  ID	
   Email	
  
1	
   page_view	
   111	
  
2	
   details_submiFed	
   111	
   a@a.com	
  
3	
   page_view	
   555	
  
4	
   page_view	
   888	
  
5	
   policy_sold	
   555	
   a@a.com	
  
6	
   details_submiFed	
   888	
   b@b.com	
  
Example: Streaming
Timestamp	
   Event	
   Cookie	
  ID	
   Email	
  
1	
   page_view	
   111	
  
2	
   details_submiFed	
   111	
   a@a.com	
  
3	
   page_view	
   555	
  
4	
   page_view	
   888	
  
5	
   policy_sold	
   555	
   a@a.com	
  
6	
   details_submiFed	
   888	
   b@b.com	
  
Visitor	
  A:	
  111	
  
Example: Streaming
Timestamp	
   Event	
   Cookie	
  ID	
   Email	
  
1	
   page_view	
   111	
  
2	
   details_submiFed	
   111	
   a@a.com	
  
3	
   page_view	
   555	
  
4	
   page_view	
   888	
  
5	
   policy_sold	
   555	
   a@a.com	
  
6	
   details_submiFed	
   888	
   b@b.com	
  
Visitor	
  A:	
  111,	
  a@a.com	
  
Example: Streaming
Timestamp	
   Event	
   Cookie	
  ID	
   Email	
  
1	
   page_view	
   111	
  
2	
   details_submiFed	
   111	
   a@a.com	
  
3	
   page_view	
   555	
  
4	
   page_view	
   888	
  
5	
   policy_sold	
   555	
   a@a.com	
  
6	
   details_submiFed	
   888	
   b@b.com	
  
Visitor	
  A:	
  111,	
  a@a.com	
  
Visitor	
  B:	
  555	
  
	
  
Example: Streaming
Timestamp	
   Event	
   Cookie	
  ID	
   Email	
  
1	
   page_view	
   111	
  
2	
   details_submiFed	
   111	
   a@a.com	
  
3	
   page_view	
   555	
  
4	
   page_view	
   888	
  
5	
   policy_sold	
   555	
   a@a.com	
  
6	
   details_submiFed	
   888	
   b@b.com	
  
Visitor	
  A:	
  111,	
  a@a.com	
  
Visitor	
  B:	
  555	
  
Visitor	
  C:	
  888	
  
	
  
Example: Streaming
Timestamp	
   Event	
   Cookie	
  ID	
   Email	
  
1	
   page_view	
   111	
  
2	
   details_submiFed	
   111	
   a@a.com	
  
3	
   page_view	
   555	
  
4	
   page_view	
   888	
  
5	
   policy_sold	
   555	
   a@a.com	
  
6	
   details_submiFed	
   888	
   b@b.com	
  
Visitor	
  A:	
  111,	
  555,	
  a@a.com	
  
Visitor	
  C:	
  888	
  
	
  
Visitors	
  A	
  and	
  B	
  were	
  merged!	
  
Example: Streaming
Timestamp	
   Event	
   Cookie	
  ID	
   Email	
  
1	
   page_view	
   111	
  
2	
   details_submiFed	
   111	
   a@a.com	
  
3	
   page_view	
   555	
  
4	
   page_view	
   888	
  
5	
   policy_sold	
   555	
   a@a.com	
  
6	
   details_submiFed	
   888	
   b@b.com	
  
Visitor	
  A:	
  111,	
  555,	
  a@a.com	
  
Visitor	
  C:	
  888,	
  b@b.com	
  
	
  
Streaming Algorithm
for event in incoming_events:
prev_visitors = [get_visitor(id) for id in ids(event)]
if len(prev_visitors) == 0:
# First time that we see that visitor
create_visitor(event)
elif len(prev_visitors) == 1:
# Visitor coming back using a known ID
update_visitor(prev_visitors[0], event)
else:
# Event IDs were thought to belong to different visitors
merge_and_update_visitor(prev_visitors, event)  
Store mapping: event_ids [1..N] ---> [1] visitor
Batch Solution
Batch
We also need to solve this problem in batch mode because every now
and then we reprocess all our data.
We cannot do it with Spark Streaming because:
•  Kinesis only stores up to 7 days of data
•  It would take too long
•  We risk overwhelming our MongoDB
Example: Batch
Timestamp	
   Event	
   Cookie	
  ID	
   Email	
  
1	
   page_view	
   111	
  
2	
   details_submiFed	
   111	
   a@a.com	
  
3	
   page_view	
   555	
  
4	
   page_view	
   888	
  
5	
   policy_sold	
   555	
   a@a.com	
  
6	
   details_submiFed	
   888	
   b@b.com	
  
Batch Algorithm
•  Find the connected components in the graph
•  Already implemented in GraphX!
•  Vertices: visitor IDs
•  Edges: pairs of IDs occurring in the same event
Example: Batch
Timestamp	
   Event	
   Cookie	
  ID	
   Email	
  
1	
   page_view	
   111	
  
2	
   details_submiFed	
   111	
   a@a.com	
  
3	
   page_view	
   555	
  
4	
   page_view	
   888	
  
5	
   policy_sold	
   555	
   a@a.com	
  
6	
   details_submiFed	
   888	
   b@b.com	
  
111	
   a@a.com	
  
555	
  
888	
   b@b.com	
  
GraphX: Issues
•  There is no Python API for GraphX, and Graphframes performance
is worse than GraphX’s
•  GraphX requires vertices to have numeric IDs
•  You’ll have to create a mapping between the real IDs and the
numeric IDs
•  Perform the connected components calculation
•  Map back the numeric IDs to the real IDs
•  Documentation is not as good as other Spark projects
Sum Up
NRT:	
  Spark	
  Streaming	
   Batch:	
  GraphX	
  
Latency	
   Seconds	
   Minutes/Hours	
  
Throughput	
   Low	
   High	
  
Good	
  for	
   Online	
   Analysis	
  
Development	
   Your	
  own	
   Algorithm	
  available	
  
Sum Up
Beware
That two events share an ID does not necessarily mean that were
generated by the same person. Watch out for:
•  Clashing IDs
•  Fake IDs: test@test.com, 07123456789, etc.
•  Shared devices: couples, public computers, etc.
•  People using your system on behalf of someone else
•  Bugs
Analyze your data first!
Questions?
@dani_sola
dani.sola@simplybusiness.co.uk

More Related Content

Recently uploaded

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Recently uploaded (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Identifying Returning Users with Spark

  • 2. Simply Business •  Largest UK business insurer provider •  More than 370,000 policy holders •  Using BML, tech and data to disrupt the business insurance market
  • 3. Data ’n’ Analytics •  4 Data Engineers •  3 Business Intelligence Developers •  1 Data Scientist •  1 Director of Data Science •  And hiring! :-)
  • 4. Our Infrastructure Event  Enrichment   (Spark  Streaming)   Event  Collector   (REST  Endpoint)   Kinesis   MongoDB   Kinesis   Redshi>   Downstream  NRT   (Spark  Streaming)  
  • 6. The Challenge Accurately identify users over time as they interact with us
  • 7. Why? Knowledge To understand how our visitors behave: •  They have long buying cycles •  They use multiple devices •  They engage with us through multiple channels
  • 8. Why? Cost Analysis To know how much it costs to acquire new customers: •  They visit us from different marketing channels •  Each of them have different costs •  We remunerate our partners depending on that
  • 9. Why? Personalization To adapt their visit while they interact with us: •  We can trigger offers depending on the channel they come from •  We can prefill fields if they’re recognized •  We can change the look and feel of the website
  • 10. Cookies? Their work great when you want to identify people coming back to your website. Challenges: •  Different devices •  Cleared cookies •  Private sessions •  What about telephone calls…?
  • 11. Use More IDs! We can use other IDs provided by visitors to link sessions: •  Login details •  Email addresses •  Telephone numbers •  Credit card numbers •  Fingerprints •  …
  • 13. How many visitors do we have here? Example Timestamp   Event   Cookie  ID   Email   1   page_view   111   2   details_submiFed   111   a@a.com   3   page_view   555   4   page_view   888   5   policy_sold   555   a@a.com   6   details_submiFed   888   b@b.com  
  • 14. How many visitors do we have here? Only 2 Example Timestamp   Event   Cookie  ID   Email   1   page_view   111   2   details_submiFed   111   a@a.com   3   page_view   555   4   page_view   888   5   policy_sold   555   a@a.com   6   details_submiFed   888   b@b.com  
  • 15. Example: Streaming Timestamp   Event   Cookie  ID   Email   1   page_view   111   2   details_submiFed   111   a@a.com   3   page_view   555   4   page_view   888   5   policy_sold   555   a@a.com   6   details_submiFed   888   b@b.com   Visitor  A:  111  
  • 16. Example: Streaming Timestamp   Event   Cookie  ID   Email   1   page_view   111   2   details_submiFed   111   a@a.com   3   page_view   555   4   page_view   888   5   policy_sold   555   a@a.com   6   details_submiFed   888   b@b.com   Visitor  A:  111,  a@a.com  
  • 17. Example: Streaming Timestamp   Event   Cookie  ID   Email   1   page_view   111   2   details_submiFed   111   a@a.com   3   page_view   555   4   page_view   888   5   policy_sold   555   a@a.com   6   details_submiFed   888   b@b.com   Visitor  A:  111,  a@a.com   Visitor  B:  555    
  • 18. Example: Streaming Timestamp   Event   Cookie  ID   Email   1   page_view   111   2   details_submiFed   111   a@a.com   3   page_view   555   4   page_view   888   5   policy_sold   555   a@a.com   6   details_submiFed   888   b@b.com   Visitor  A:  111,  a@a.com   Visitor  B:  555   Visitor  C:  888    
  • 19. Example: Streaming Timestamp   Event   Cookie  ID   Email   1   page_view   111   2   details_submiFed   111   a@a.com   3   page_view   555   4   page_view   888   5   policy_sold   555   a@a.com   6   details_submiFed   888   b@b.com   Visitor  A:  111,  555,  a@a.com   Visitor  C:  888     Visitors  A  and  B  were  merged!  
  • 20. Example: Streaming Timestamp   Event   Cookie  ID   Email   1   page_view   111   2   details_submiFed   111   a@a.com   3   page_view   555   4   page_view   888   5   policy_sold   555   a@a.com   6   details_submiFed   888   b@b.com   Visitor  A:  111,  555,  a@a.com   Visitor  C:  888,  b@b.com    
  • 21. Streaming Algorithm for event in incoming_events: prev_visitors = [get_visitor(id) for id in ids(event)] if len(prev_visitors) == 0: # First time that we see that visitor create_visitor(event) elif len(prev_visitors) == 1: # Visitor coming back using a known ID update_visitor(prev_visitors[0], event) else: # Event IDs were thought to belong to different visitors merge_and_update_visitor(prev_visitors, event)   Store mapping: event_ids [1..N] ---> [1] visitor
  • 23. Batch We also need to solve this problem in batch mode because every now and then we reprocess all our data. We cannot do it with Spark Streaming because: •  Kinesis only stores up to 7 days of data •  It would take too long •  We risk overwhelming our MongoDB
  • 24. Example: Batch Timestamp   Event   Cookie  ID   Email   1   page_view   111   2   details_submiFed   111   a@a.com   3   page_view   555   4   page_view   888   5   policy_sold   555   a@a.com   6   details_submiFed   888   b@b.com  
  • 25. Batch Algorithm •  Find the connected components in the graph •  Already implemented in GraphX! •  Vertices: visitor IDs •  Edges: pairs of IDs occurring in the same event
  • 26. Example: Batch Timestamp   Event   Cookie  ID   Email   1   page_view   111   2   details_submiFed   111   a@a.com   3   page_view   555   4   page_view   888   5   policy_sold   555   a@a.com   6   details_submiFed   888   b@b.com   111   a@a.com   555   888   b@b.com  
  • 27. GraphX: Issues •  There is no Python API for GraphX, and Graphframes performance is worse than GraphX’s •  GraphX requires vertices to have numeric IDs •  You’ll have to create a mapping between the real IDs and the numeric IDs •  Perform the connected components calculation •  Map back the numeric IDs to the real IDs •  Documentation is not as good as other Spark projects
  • 29. NRT:  Spark  Streaming   Batch:  GraphX   Latency   Seconds   Minutes/Hours   Throughput   Low   High   Good  for   Online   Analysis   Development   Your  own   Algorithm  available   Sum Up
  • 30. Beware That two events share an ID does not necessarily mean that were generated by the same person. Watch out for: •  Clashing IDs •  Fake IDs: test@test.com, 07123456789, etc. •  Shared devices: couples, public computers, etc. •  People using your system on behalf of someone else •  Bugs Analyze your data first!