SlideShare a Scribd company logo
1 of 19
Download to read offline
CXENSE 2017 | DEXA www.cxense.com
Large-Scale User Similarity Modeling
Arne Sund, Head of Data Science at Cxense
CXENSE 2017 www.cxense.com
• Norwegian tech company with a global presence
• ~60 engineers incl. 5 data scientists
• Media and Publishing vertical
• Delivering solutions for:
• Insight into online user patterns
• Superior content recommendations
• Segmentation of users using ML / AI
• Create campaigns for targeting on own sites
A global SaaS company
CXENSE 2017 www.cxense.com
• Declining ad revenue
• The Duopoly
• How to attract digital-only subscribers
• How to keep existing subscribers
How to engage users online and keep them on
our site longer?
Challenges facing publishers online
CXENSE 2018 www.cxense.com
• Awareness
• Consideration
• Subscription
• Loyalty
Powering the journey from casual visitor to subscriber
Anonymous
User
Known
User
• Churn prevention
Revenue
CXENSE 2017 www.cxense.com
• To tailor content recommendations and offers/ads
• Small set of known users as input
• Find similar anonymous users
User Similarity Modeling
CXENSE www.cxense.com
User Similarity Modeling
Original segment
(Truth sample)
Lookalike segment
(predicted)
All unique users
CXENSE 2017 www.cxense.com
Defining user similarity
Users Are What They Read
• Represent users as vectors of consumed content
• Hit count vector of words and phrases
• High-dimensional vector space
• Computed using multiple algorithms
CXENSE 2017 www.cxense.com
Pageview events
User 4sk9yk1sb7v8yxas visited adressa.no at 18:02
Augmented with additional details
• Device: Type, Brand, Browser, OS
• Location: Country, Region, City
• Referrer: URL, Type
• Engagement: Active time, scroll depth
• ...
CXENSE 2017 www.cxense.com
Content profiles
NLP
Evjen solgt til AZ Alkmaar.
Vingsensasjonen Håkon Evjen
forlater Bodø/Glimt etter
sesongen og blir proff i
nederlandske AZ Alkmaar. Det
bekrefter Glimt på sitt nettsted.
Evjen har undertegnet en
kontrakt på 4,5 år med sin nye
klubb, der han starter 1. januar. –
Akkurat nå føles det bra, og jeg
er glad. Jeg tror dette kan bli
veldig spennende ...
Group Item
classification sports
pageclass article
person håkon evjen
person fredrik midtsjø
entity eliteserien
keyword bodø/glimt
location nederland
... ...
CXENSE 2017 www.cxense.com
Represent users as vectors
• Load all pageview events for a month
• Map URLs in pageview events to the content profile for that URL
• Create a vector where each unique word is a column
• Store hit count of each word in the right column for each user
sports håkon evjen nederland eliteserien ...
4sk9yk1sb7v8yxas 40 2 1 28 ...
... ... ... ... ... ...
CXENSE 2017 www.cxense.com
Computing user similarity
How to compare each user to a group of users
• Compute centroid (average vector) for the group
• Compute similarity for each user to the centroid
• Process batches of users
Scale quickly becomes a concern
• Millions of unique users is common
• A lot of possible words and phrases
CXENSE 2017 www.cxense.com
75 000 000 x 8 700 000
CXENSE 2017 www.cxense.com
• A measure of independence
• Find important words and phrases
• Reduce number of columns
• Easy to use in a Scikit-Learn Pipeline
sklearn.feature_selection.SelectKBest(score_func = chi2, …)
Pearson’s Chi-Squared Test
CXENSE 2017 www.cxense.com
Cosine similarity
• Values between -1 and 1
• Closer to 1 means more similar
• Independent of vector length
• Reducing effect of amount of consumed content
CXENSE www.cxense.com
Ranking based on similarity score
Ranking of anonymous users
Pick users with highest
similarity as the lookalikes
Customers choose a fraction
[%] of the total unique users.
New segment
Store results as
CXENSE 2017 www.cxense.com
• Billions of pageview events
• Hundreds of millions of unique users
• Millions of unique URLs
And it keeps growing!
Dataset Size & Scaling
CXENSE 2017 www.cxense.com
Optimize, run, optimize again
• Parallelize on every layer: threads, processes, jobs
• Keep the memory usage under control
• Use gRPC for data transfer whenever possible
• Stream big API responses directly to disk
CXENSE 2017 www.cxense.com
Optimizing Scipy Methods
for b in range(n_batches):
...
indices = np.hstack((indices, new_matrix.indices.astype(np.int32)))
indptr = np.hstack((indptr, (new_matrix.indptr.astype(np.int64) +
len(values))[1:]))
values = np.hstack((values, new_matrix.data.astype(np.int16)))
matrix = sp.sparse.csr_matrix((values, indices, indptr),
shape=(len(url_sets), vocab_length))
Creating a sparse matrix is easy using Scipy.
Until you discover that their approach for stacking matrices is inefficient.
www.cxense.com
Feel free to reach out via LinkedIn or the Meetup forum!Questions?
… and by the way: Amerikanske Piano byr 351 mill for medieselskapet Cxense

More Related Content

Similar to CXENSE 2017 | Large-Scale User Similarity Modeling

3 Flavours of Personalisation with Umbraco
3 Flavours of Personalisation with Umbraco3 Flavours of Personalisation with Umbraco
3 Flavours of Personalisation with UmbracoTheo Paraskevopoulos
 
CZSPC 2017 - Modern Business Applications: Microsoft flow, PowerApps & latest...
CZSPC 2017 - Modern Business Applications: Microsoft flow, PowerApps & latest...CZSPC 2017 - Modern Business Applications: Microsoft flow, PowerApps & latest...
CZSPC 2017 - Modern Business Applications: Microsoft flow, PowerApps & latest...Ahmad Najjar
 
Keynote eZ Roadshow & Diginight 2019 - oslo
Keynote eZ Roadshow & Diginight 2019 - osloKeynote eZ Roadshow & Diginight 2019 - oslo
Keynote eZ Roadshow & Diginight 2019 - osloeZ Systems
 
SP Tech Con San Francisco 2014 - Real World Examples - Hybrid Office 365 Envi...
SP Tech Con San Francisco 2014 - Real World Examples - Hybrid Office 365 Envi...SP Tech Con San Francisco 2014 - Real World Examples - Hybrid Office 365 Envi...
SP Tech Con San Francisco 2014 - Real World Examples - Hybrid Office 365 Envi...Summit 7 Systems
 
Learn more about Entity Extraction May 2014
Learn more about Entity Extraction May 2014Learn more about Entity Extraction May 2014
Learn more about Entity Extraction May 2014Anders Häggdahl
 
aOS Community Aachen - Service request management with the Nintex Workflow Pl...
aOS Community Aachen - Service request management with the Nintex Workflow Pl...aOS Community Aachen - Service request management with the Nintex Workflow Pl...
aOS Community Aachen - Service request management with the Nintex Workflow Pl...Jan von Reith
 
Big problems Big data, simple AWS solution
Big problems Big data, simple AWS solutionBig problems Big data, simple AWS solution
Big problems Big data, simple AWS solutionJean-Claude Sotto
 
Belsoft Collaboration Day 2018 - IBM Connections - Gegenwart und Zukunft
Belsoft Collaboration Day 2018 - IBM Connections - Gegenwart und ZukunftBelsoft Collaboration Day 2018 - IBM Connections - Gegenwart und Zukunft
Belsoft Collaboration Day 2018 - IBM Connections - Gegenwart und ZukunftBelsoft
 
Big problems Big Data, simple solutions
Big problems Big Data, simple solutionsBig problems Big Data, simple solutions
Big problems Big Data, simple solutionsClaudio Pontili
 
Coexist or Integrate? Manage Unstructured Content from Diverse Repositories a...
Coexist or Integrate? Manage Unstructured Content from Diverse Repositories a...Coexist or Integrate? Manage Unstructured Content from Diverse Repositories a...
Coexist or Integrate? Manage Unstructured Content from Diverse Repositories a...Concept Searching, Inc
 
MWLUG 2017 - Collaboration and Productivity from the other side
MWLUG 2017 - Collaboration and Productivity from the other sideMWLUG 2017 - Collaboration and Productivity from the other side
MWLUG 2017 - Collaboration and Productivity from the other sideJohn Head
 
How To Implement Engineering Search Within Your Organization Webinar
How To Implement Engineering Search Within Your Organization WebinarHow To Implement Engineering Search Within Your Organization Webinar
How To Implement Engineering Search Within Your Organization WebinarConcept Searching, Inc
 
SharePoint Conference North America 2018 - Las Vegas - Announcements
SharePoint Conference North America 2018 - Las Vegas - AnnouncementsSharePoint Conference North America 2018 - Las Vegas - Announcements
SharePoint Conference North America 2018 - Las Vegas - AnnouncementsNick Hobbs
 
ConSol Company Profile
ConSol Company ProfileConSol Company Profile
ConSol Company ProfileIsabel Baum
 
Content strategy for the content experience wave
Content strategy for the content experience waveContent strategy for the content experience wave
Content strategy for the content experience waveZoran Nikolovski
 
User Experience vs Customer Experience - same,same but different
User Experience vs Customer Experience - same,same but differentUser Experience vs Customer Experience - same,same but different
User Experience vs Customer Experience - same,same but differentNiels Anhalt
 
The rise of Digital Experience Platforms
The rise of Digital Experience PlatformsThe rise of Digital Experience Platforms
The rise of Digital Experience PlatformseZ Systems
 
SharePoint and javascript – modern development
SharePoint and javascript – modern developmentSharePoint and javascript – modern development
SharePoint and javascript – modern developmentYannick Plenevaux
 

Similar to CXENSE 2017 | Large-Scale User Similarity Modeling (20)

3 Flavours of Personalisation with Umbraco
3 Flavours of Personalisation with Umbraco3 Flavours of Personalisation with Umbraco
3 Flavours of Personalisation with Umbraco
 
CZSPC 2017 - Modern Business Applications: Microsoft flow, PowerApps & latest...
CZSPC 2017 - Modern Business Applications: Microsoft flow, PowerApps & latest...CZSPC 2017 - Modern Business Applications: Microsoft flow, PowerApps & latest...
CZSPC 2017 - Modern Business Applications: Microsoft flow, PowerApps & latest...
 
Keynote eZ Roadshow & Diginight 2019 - oslo
Keynote eZ Roadshow & Diginight 2019 - osloKeynote eZ Roadshow & Diginight 2019 - oslo
Keynote eZ Roadshow & Diginight 2019 - oslo
 
SP Tech Con San Francisco 2014 - Real World Examples - Hybrid Office 365 Envi...
SP Tech Con San Francisco 2014 - Real World Examples - Hybrid Office 365 Envi...SP Tech Con San Francisco 2014 - Real World Examples - Hybrid Office 365 Envi...
SP Tech Con San Francisco 2014 - Real World Examples - Hybrid Office 365 Envi...
 
Learn more about Entity Extraction May 2014
Learn more about Entity Extraction May 2014Learn more about Entity Extraction May 2014
Learn more about Entity Extraction May 2014
 
Produce reliable content_v5
Produce reliable content_v5Produce reliable content_v5
Produce reliable content_v5
 
aOS Community Aachen - Service request management with the Nintex Workflow Pl...
aOS Community Aachen - Service request management with the Nintex Workflow Pl...aOS Community Aachen - Service request management with the Nintex Workflow Pl...
aOS Community Aachen - Service request management with the Nintex Workflow Pl...
 
Big problems Big data, simple AWS solution
Big problems Big data, simple AWS solutionBig problems Big data, simple AWS solution
Big problems Big data, simple AWS solution
 
Belsoft Collaboration Day 2018 - IBM Connections - Gegenwart und Zukunft
Belsoft Collaboration Day 2018 - IBM Connections - Gegenwart und ZukunftBelsoft Collaboration Day 2018 - IBM Connections - Gegenwart und Zukunft
Belsoft Collaboration Day 2018 - IBM Connections - Gegenwart und Zukunft
 
Big problems Big Data, simple solutions
Big problems Big Data, simple solutionsBig problems Big Data, simple solutions
Big problems Big Data, simple solutions
 
Coexist or Integrate? Manage Unstructured Content from Diverse Repositories a...
Coexist or Integrate? Manage Unstructured Content from Diverse Repositories a...Coexist or Integrate? Manage Unstructured Content from Diverse Repositories a...
Coexist or Integrate? Manage Unstructured Content from Diverse Repositories a...
 
MWLUG 2017 - Collaboration and Productivity from the other side
MWLUG 2017 - Collaboration and Productivity from the other sideMWLUG 2017 - Collaboration and Productivity from the other side
MWLUG 2017 - Collaboration and Productivity from the other side
 
How To Implement Engineering Search Within Your Organization Webinar
How To Implement Engineering Search Within Your Organization WebinarHow To Implement Engineering Search Within Your Organization Webinar
How To Implement Engineering Search Within Your Organization Webinar
 
SharePoint Conference North America 2018 - Las Vegas - Announcements
SharePoint Conference North America 2018 - Las Vegas - AnnouncementsSharePoint Conference North America 2018 - Las Vegas - Announcements
SharePoint Conference North America 2018 - Las Vegas - Announcements
 
Hcl digital experience
Hcl digital experienceHcl digital experience
Hcl digital experience
 
ConSol Company Profile
ConSol Company ProfileConSol Company Profile
ConSol Company Profile
 
Content strategy for the content experience wave
Content strategy for the content experience waveContent strategy for the content experience wave
Content strategy for the content experience wave
 
User Experience vs Customer Experience - same,same but different
User Experience vs Customer Experience - same,same but differentUser Experience vs Customer Experience - same,same but different
User Experience vs Customer Experience - same,same but different
 
The rise of Digital Experience Platforms
The rise of Digital Experience PlatformsThe rise of Digital Experience Platforms
The rise of Digital Experience Platforms
 
SharePoint and javascript – modern development
SharePoint and javascript – modern developmentSharePoint and javascript – modern development
SharePoint and javascript – modern development
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

CXENSE 2017 | Large-Scale User Similarity Modeling

  • 1. CXENSE 2017 | DEXA www.cxense.com Large-Scale User Similarity Modeling Arne Sund, Head of Data Science at Cxense
  • 2. CXENSE 2017 www.cxense.com • Norwegian tech company with a global presence • ~60 engineers incl. 5 data scientists • Media and Publishing vertical • Delivering solutions for: • Insight into online user patterns • Superior content recommendations • Segmentation of users using ML / AI • Create campaigns for targeting on own sites A global SaaS company
  • 3. CXENSE 2017 www.cxense.com • Declining ad revenue • The Duopoly • How to attract digital-only subscribers • How to keep existing subscribers How to engage users online and keep them on our site longer? Challenges facing publishers online
  • 4. CXENSE 2018 www.cxense.com • Awareness • Consideration • Subscription • Loyalty Powering the journey from casual visitor to subscriber Anonymous User Known User • Churn prevention Revenue
  • 5. CXENSE 2017 www.cxense.com • To tailor content recommendations and offers/ads • Small set of known users as input • Find similar anonymous users User Similarity Modeling
  • 6. CXENSE www.cxense.com User Similarity Modeling Original segment (Truth sample) Lookalike segment (predicted) All unique users
  • 7. CXENSE 2017 www.cxense.com Defining user similarity Users Are What They Read • Represent users as vectors of consumed content • Hit count vector of words and phrases • High-dimensional vector space • Computed using multiple algorithms
  • 8. CXENSE 2017 www.cxense.com Pageview events User 4sk9yk1sb7v8yxas visited adressa.no at 18:02 Augmented with additional details • Device: Type, Brand, Browser, OS • Location: Country, Region, City • Referrer: URL, Type • Engagement: Active time, scroll depth • ...
  • 9. CXENSE 2017 www.cxense.com Content profiles NLP Evjen solgt til AZ Alkmaar. Vingsensasjonen Håkon Evjen forlater Bodø/Glimt etter sesongen og blir proff i nederlandske AZ Alkmaar. Det bekrefter Glimt på sitt nettsted. Evjen har undertegnet en kontrakt på 4,5 år med sin nye klubb, der han starter 1. januar. – Akkurat nå føles det bra, og jeg er glad. Jeg tror dette kan bli veldig spennende ... Group Item classification sports pageclass article person håkon evjen person fredrik midtsjø entity eliteserien keyword bodø/glimt location nederland ... ...
  • 10. CXENSE 2017 www.cxense.com Represent users as vectors • Load all pageview events for a month • Map URLs in pageview events to the content profile for that URL • Create a vector where each unique word is a column • Store hit count of each word in the right column for each user sports håkon evjen nederland eliteserien ... 4sk9yk1sb7v8yxas 40 2 1 28 ... ... ... ... ... ... ...
  • 11. CXENSE 2017 www.cxense.com Computing user similarity How to compare each user to a group of users • Compute centroid (average vector) for the group • Compute similarity for each user to the centroid • Process batches of users Scale quickly becomes a concern • Millions of unique users is common • A lot of possible words and phrases
  • 12. CXENSE 2017 www.cxense.com 75 000 000 x 8 700 000
  • 13. CXENSE 2017 www.cxense.com • A measure of independence • Find important words and phrases • Reduce number of columns • Easy to use in a Scikit-Learn Pipeline sklearn.feature_selection.SelectKBest(score_func = chi2, …) Pearson’s Chi-Squared Test
  • 14. CXENSE 2017 www.cxense.com Cosine similarity • Values between -1 and 1 • Closer to 1 means more similar • Independent of vector length • Reducing effect of amount of consumed content
  • 15. CXENSE www.cxense.com Ranking based on similarity score Ranking of anonymous users Pick users with highest similarity as the lookalikes Customers choose a fraction [%] of the total unique users. New segment Store results as
  • 16. CXENSE 2017 www.cxense.com • Billions of pageview events • Hundreds of millions of unique users • Millions of unique URLs And it keeps growing! Dataset Size & Scaling
  • 17. CXENSE 2017 www.cxense.com Optimize, run, optimize again • Parallelize on every layer: threads, processes, jobs • Keep the memory usage under control • Use gRPC for data transfer whenever possible • Stream big API responses directly to disk
  • 18. CXENSE 2017 www.cxense.com Optimizing Scipy Methods for b in range(n_batches): ... indices = np.hstack((indices, new_matrix.indices.astype(np.int32))) indptr = np.hstack((indptr, (new_matrix.indptr.astype(np.int64) + len(values))[1:])) values = np.hstack((values, new_matrix.data.astype(np.int16))) matrix = sp.sparse.csr_matrix((values, indices, indptr), shape=(len(url_sets), vocab_length)) Creating a sparse matrix is easy using Scipy. Until you discover that their approach for stacking matrices is inefficient.
  • 19. www.cxense.com Feel free to reach out via LinkedIn or the Meetup forum!Questions? … and by the way: Amerikanske Piano byr 351 mill for medieselskapet Cxense