SlideShare a Scribd company logo
1 of 37
Download to read offline
Shlomi Babluki, Principal Data Scientist
October, 2018
Interaction Based Feature Extraction
How to Convert User Activity into Valuable Features
User Activity
Business Proprietary & Confidential | 2
LikeClick
Share Play
Buy
Every Website Every Mobile App
Ranking
Engagement
App Store
Category
Keywords
Traffic Metrics
Traffic Sources
Audience
Industry
Content
Provides Digital Insights for 190+ countries
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
3
SimilarWeb PRO
4
Website Demographics
5
Millions of user in almost every country.
Our Data
6
International Panel
Learning Set
Direct measurement data (like Google Analytics) for ~50,000 Websites.
Gender Distribution - Standard Solution
● For each website in our panel:
○ Count the number of males
○ Count the number of females
○ Calculate the gender distribution
● Use the learning set to improve the estimation
Our panel is completely anonymous!
Website B Gender Distribution
77% - 88% Females
Website B:
● Panel: 90 Users
● Learning set: N/A
Website A - 80 Females and 20 Males
Our Idea - Example
A
B
Website A:
● Panel: 100 Users
● Learning set: 80% Females / 20% Males
8
Our Idea
Estimate website gender distribution
based on its user engagement with the other sites in
the learning set.
9
● An indicator matrix of websites (S) and users (U):
P(i, j) = 1 if user j visited website i
P(i, j) = 0 Otherwise
Our Panel Matrix (P)
10
Convert Our Panel Into An Interaction Matrix
● |S| = Millions, |U| = Millions
dim(P) = |S| x |U| → P is a very large sparse matrix!
- Principal Component Analysis
- Matrix Factorization (ALS Model)
- Word2Vec
- ...
‫״‬The Curse of Dimensionality‫״‬
We need to reduce the dimension of the panel matrix (P):
Feature Extraction / Dimension Reduction Algorithms:
dim(P) = |S| x |U|
dim(F) = |S| x K
Business Proprietary & Confidential | 11
The standard algorithms didn’t
solve our problem
The standard algorithms reduce the dimension without taking into account our problem.
Dimension Reduction - Conclusion
Business Proprietary & Confidential | 12
An algorithm that reduces the dimension in a way that is optimized to solve our problem.
The Problem
The Solution
Interaction Based Feature Extraction
13
Interaction Based Feature Extraction - Step 1/3
● Convert our learning set (L) into a matrix (D1):
○ Split the gender percentiles into K (=10) “buckets”:
[0.0-0.1, 0.1-0.2, 0.2-0.3, ... 0.9-1.0]
○ Map each value from the learning set into an indicator vector:
Website A, 0.73 → [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
Website B, 0.26 → [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
...
dim(D1) = |L| x K
14
Interaction Based Feature Extraction - Step 1/3
0.0 0.0 0.0 0.0 ...
0.0 0.0 1.0 0.0 ...
... ... ... ... ...
Learning Set Vector (K)
LearningSetWebsites
D1
15
● Multiply the D2 matrix by D1:
D = D2 * D1
dim(D) = (|U| x |L|) * (|L| x K) → |U|x K
Interaction Based Feature Extraction - Step 2/3
● Create another matrix (D2) of users (U) and only the learning set websites (L):
D2(i, j) = 1 if user i visited learning set website j
D2(i, j) = 0 Otherwise
dim(D2) = |U| x |L|
● Normalize each row (user) in the matrix D to 1.0
16
Interaction Based Feature Extraction - Step 2/3
0.0 0.0 1.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0
... ... ... ... ...
X
Learning Set Websites Learning Set Vector (K)
Users
LearningSetWebsites
D2 D1
Normalize Each Row
D =
17
Interaction Based Feature Extraction - Step 2/3
● D Matrix - Example:
User 1: [0.0, 0.0, 0.0, 0.0, 0.2, 0.3, 0.1 0.1, 0.2, 0.1]
User 2: [0.1, 0.0, 0.2, 0.1, 0.5, 0.0, 0.1, 0.0, 0.0, 0.0]
Business Proprietary & Confidential | 18
● Multiply the panel matrix (P) by D:
F = P * D
dim(F) = (|S| x |U|) * (|U| x K) → |S|x K
Interaction Based Feature Extraction - Step 3/3
● Normalize each row (website) in the matrix F to 1.0
19
Interaction Based Feature Extraction - Step 3/3
0.0 0.0 1.0 0.0 ...
1.0 1.0 0.0 0.0 ...
... ... ... ... ...
X X
Users Learning Set Websites Learning Set Vector (K)
Users
Websites
LearningSetWebsites
P D2 D1
Normalize Each RowNormalize Each Row
F =
20
Interaction Based Feature Extraction - Features
21
F1 F3 F4 F5 F6 F7 F8 F9 F10F2
Train a Regressor
Male Traffic ShareAllWebsites
F
Features (K=10)
LearningSetWebsites
X Y
22
Train
Predict
Random Forest Regression - Results
23
Expanding The Algorithm
24
● 27,000 Movies
● 138,000 Users
● ~20M Ratings (Explicit Feedback: 1-5)
● Multiple Genres per Movie
Can We Expand It To Other Domains?
MovieLens Dataset:
25
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5,
4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
● Action
● Adventure
● Animation
● Children
● Comedy
● Crime
● Documentary
● Drama
● Fantasy
Predicting Movie Genres
MovieLens Movie Genres (18):
● Film-Noir
● Horror
● Musical
● Mystery
● Romance
● Sci-Fi
● Thriller
● War
● Western
26
Predicting Movie Genres
27
movies.csv
ratings.csv
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
...
userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
...
International Panel
Learning Set
Predicting Movie Genres - Genre Features
3.0 0.0 1.0 4.0 ...
... ... ... ... ...
2.0 0.0 0.0 5.0 ...
... ... ... ... ...
X X
Users Train Set Movies Genre Vector (18)
Users
Movies
TrainSetMovies
P D2 D1
Normalize Each RowNormalize Each Row
F =
1.0 1.0 0.0 0.0 0.0 ...
0.0 0.0 1.0 1.0 0.0 ...
... ... ... ... ... ...
28
Predicting Movie Genres - Genre Features
29
Genre: Adventure
● Total Accuracy: 86.9% (6559)
● True Positive Rate: 70.0% (682)
● True Negative Rate: 88.9% (5877)
Predicting Movie Genres - Results
30
Genre: Animation
● Total Accuracy: 95.5% (6611)
● True Positive Rate: 93.4% (273)
● True Negative Rate: 95.6% (6338)
* Movies with more than 20 users / votes
Overall Accuracy: ~83%
● High Accuracy
● Very Low Dimension
● Scalable
● Simple & Explainable (not a “black box” solution)
Business Proprietary & Confidential | 31
Interaction Based Feature Extraction
Questions?
Interaction Based Feature Extraction - SQL Implementation
0.0 0.0 1.0 0.0 ...
1.0 1.0 0.0 0.0 ...
... ... ... ... ...
X X
Users Learning Set Websites Learning Set Vector (K)
Users
Websites
LearningSetWebsites
P D2 D1
Normalize Each RowNormalize Each Row
F =
Business Proprietary & Confidential | 33
0.0 0.0 0.0 0.0 ...
0.0 0.0 1.0 0.0 ...
... ... ... ... ...
0.0 0.0 1.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0
... ... ... ... ...
Table D1: <ls_site, k1, k2... k10>
Table D2: <user, ls_site>
Table P: <site, user>
Interaction Based Feature Extraction - SQL Implementation
Business Proprietary & Confidential | 34
SELECT
site
AVG(f1) as f1,
...
AVG(f10) as f10
FROM (
SELECT
site,
user,
D.k1 / (D.k1 + ... + D.k10) as f1,
...
D.k10 / (D.k1 + ... + D.k10) as f10
FROM (
SELECT P.site, D.user, D.ls_site, D.k1... D.k10
FROM P
JOIN (
SELECT D2.user, D1.ls_site, D1.k1... D1.k10
FROM D1
JOIN D2
ON D1.ls_site = D2.ls_site
) AS D
ON P.user = D.user AND P.site <> D.ls_site
) as F1
GROUP BY site, user
) as F
GROUP BY site
Interaction Based Feature Extraction - SQL Implementation
Business Proprietary & Confidential | 35
SELECT
site
AVG(f1) as f1,
...
AVG(f10) as f10
FROM (
SELECT
site,
user,
D.k1 / (D.k1 + ... + D.k10) as f1,
...
D.k10 / (D.k1 + ... + D.k10) as f10
FROM (
SELECT P.site, D.user, D.ls_site, D.k1... D.k10
FROM P
JOIN (
SELECT D2.user, D1.ls_site, D1.k1... D1.k10
FROM D1
JOIN D2
ON D1.ls_site = D2.ls_site
) AS D
ON P.user = D.user AND P.site <> D.ls_site
) as F1
GROUP BY site, user
) as F
GROUP BY site
Interaction Based Feature Extraction - SQL Implementation
Business Proprietary & Confidential | 36
Avoid Data Leakage
Thank You!
shlomib@similarweb.com
@shlomibabluki

More Related Content

What's hot

Image enhancement using alpha rooting based hybrid technique
Image enhancement using alpha rooting based hybrid techniqueImage enhancement using alpha rooting based hybrid technique
Image enhancement using alpha rooting based hybrid technique
Rahul Yadav
 
Contrast limited adaptive histogram equalization
Contrast limited adaptive histogram equalizationContrast limited adaptive histogram equalization
Contrast limited adaptive histogram equalization
Er. Nancy
 

What's hot (19)

Histogram Equalization
Histogram EqualizationHistogram Equalization
Histogram Equalization
 
Image enhancement using alpha rooting based hybrid technique
Image enhancement using alpha rooting based hybrid techniqueImage enhancement using alpha rooting based hybrid technique
Image enhancement using alpha rooting based hybrid technique
 
Dital Image Processing (Lab 2+3+4)
Dital Image Processing (Lab 2+3+4)Dital Image Processing (Lab 2+3+4)
Dital Image Processing (Lab 2+3+4)
 
Presentation 1
Presentation 1Presentation 1
Presentation 1
 
Simple Matlab tutorial using matlab inbuilt commands
Simple Matlab tutorial using matlab inbuilt commandsSimple Matlab tutorial using matlab inbuilt commands
Simple Matlab tutorial using matlab inbuilt commands
 
Mathematical operations in image processing
Mathematical operations in image processingMathematical operations in image processing
Mathematical operations in image processing
 
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
 
Digital image processing using matlab: basic transformations, filters and ope...
Digital image processing using matlab: basic transformations, filters and ope...Digital image processing using matlab: basic transformations, filters and ope...
Digital image processing using matlab: basic transformations, filters and ope...
 
iPlayset: Tangible Interface that brings Toys to life (Eiman Kanjo)
iPlayset: Tangible Interface that brings Toys to life (Eiman Kanjo)iPlayset: Tangible Interface that brings Toys to life (Eiman Kanjo)
iPlayset: Tangible Interface that brings Toys to life (Eiman Kanjo)
 
Digital Image Processing (Lab 05)
Digital Image Processing (Lab 05)Digital Image Processing (Lab 05)
Digital Image Processing (Lab 05)
 
Enhancement in Digital Image Processing
Enhancement in Digital Image ProcessingEnhancement in Digital Image Processing
Enhancement in Digital Image Processing
 
Math behind the kernels
Math behind the kernelsMath behind the kernels
Math behind the kernels
 
Contrast limited adaptive histogram equalization
Contrast limited adaptive histogram equalizationContrast limited adaptive histogram equalization
Contrast limited adaptive histogram equalization
 
Wong weisenbeck
Wong weisenbeckWong weisenbeck
Wong weisenbeck
 
Model-Based User Interface Optimization: Part V: DISCUSSION - At SICSA Summer...
Model-Based User Interface Optimization: Part V: DISCUSSION - At SICSA Summer...Model-Based User Interface Optimization: Part V: DISCUSSION - At SICSA Summer...
Model-Based User Interface Optimization: Part V: DISCUSSION - At SICSA Summer...
 
Digital Image Processing (Lab 06)
Digital Image Processing (Lab 06)Digital Image Processing (Lab 06)
Digital Image Processing (Lab 06)
 
Image colorization
Image colorizationImage colorization
Image colorization
 
1536 graphics &amp; graphical programming
1536 graphics &amp; graphical  programming1536 graphics &amp; graphical  programming
1536 graphics &amp; graphical programming
 
Developer’s Intro to Azure Machine Learning
Developer’s Intro to Azure Machine LearningDeveloper’s Intro to Azure Machine Learning
Developer’s Intro to Azure Machine Learning
 

Similar to Interaction-Based Feature Extraction: How to Convert Your Users’ Activity into Valuable Features with Shlomi Babluki

Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1
Suvadip Shome
 
Excel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for AuditorsExcel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for Auditors
Jim Kaplan CIA CFE
 

Similar to Interaction-Based Feature Extraction: How to Convert Your Users’ Activity into Valuable Features with Shlomi Babluki (20)

Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Recommending job ads to people
Recommending job ads to peopleRecommending job ads to people
Recommending job ads to people
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
 
Engine90 crawford-decision-making (1)
Engine90 crawford-decision-making (1)Engine90 crawford-decision-making (1)
Engine90 crawford-decision-making (1)
 
WWW 2021report public
WWW 2021report publicWWW 2021report public
WWW 2021report public
 
3D Multi Object GAN
3D Multi Object GAN3D Multi Object GAN
3D Multi Object GAN
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smaller
 
Architectural Patterns in Building Modular Domain Models
Architectural Patterns in Building Modular Domain ModelsArchitectural Patterns in Building Modular Domain Models
Architectural Patterns in Building Modular Domain Models
 
ch15.ppt
ch15.pptch15.ppt
ch15.ppt
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
 
Patterns & Practices for Cloud-based Microservices
Patterns & Practices for Cloud-based MicroservicesPatterns & Practices for Cloud-based Microservices
Patterns & Practices for Cloud-based Microservices
 
Successfully Implement Responsive Design Behavior with Adobe Experience Manager
Successfully Implement Responsive Design Behavior with Adobe Experience ManagerSuccessfully Implement Responsive Design Behavior with Adobe Experience Manager
Successfully Implement Responsive Design Behavior with Adobe Experience Manager
 
PASS Summit 2010 Keynote David DeWitt
PASS Summit 2010 Keynote David DeWittPASS Summit 2010 Keynote David DeWitt
PASS Summit 2010 Keynote David DeWitt
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Learning to Personalize
Learning to PersonalizeLearning to Personalize
Learning to Personalize
 
Yandex School of Data Analysis conference, Machine Learning and Very Large Da...
Yandex School of Data Analysis conference, Machine Learning and Very Large Da...Yandex School of Data Analysis conference, Machine Learning and Very Large Da...
Yandex School of Data Analysis conference, Machine Learning and Very Large Da...
 
Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1
 
Excel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for AuditorsExcel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for Auditors
 
Don't Call It a Comeback: Attribute Grammars for Big Data Visualization
Don't Call It a Comeback: Attribute Grammars for Big Data VisualizationDon't Call It a Comeback: Attribute Grammars for Big Data Visualization
Don't Call It a Comeback: Attribute Grammars for Big Data Visualization
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Recently uploaded (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

Interaction-Based Feature Extraction: How to Convert Your Users’ Activity into Valuable Features with Shlomi Babluki

  • 1. Shlomi Babluki, Principal Data Scientist October, 2018 Interaction Based Feature Extraction How to Convert User Activity into Valuable Features
  • 2. User Activity Business Proprietary & Confidential | 2 LikeClick Share Play Buy
  • 3. Every Website Every Mobile App Ranking Engagement App Store Category Keywords Traffic Metrics Traffic Sources Audience Industry Content Provides Digital Insights for 190+ countries ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 3
  • 6. Millions of user in almost every country. Our Data 6 International Panel Learning Set Direct measurement data (like Google Analytics) for ~50,000 Websites.
  • 7. Gender Distribution - Standard Solution ● For each website in our panel: ○ Count the number of males ○ Count the number of females ○ Calculate the gender distribution ● Use the learning set to improve the estimation Our panel is completely anonymous!
  • 8. Website B Gender Distribution 77% - 88% Females Website B: ● Panel: 90 Users ● Learning set: N/A Website A - 80 Females and 20 Males Our Idea - Example A B Website A: ● Panel: 100 Users ● Learning set: 80% Females / 20% Males 8
  • 9. Our Idea Estimate website gender distribution based on its user engagement with the other sites in the learning set. 9
  • 10. ● An indicator matrix of websites (S) and users (U): P(i, j) = 1 if user j visited website i P(i, j) = 0 Otherwise Our Panel Matrix (P) 10 Convert Our Panel Into An Interaction Matrix ● |S| = Millions, |U| = Millions dim(P) = |S| x |U| → P is a very large sparse matrix!
  • 11. - Principal Component Analysis - Matrix Factorization (ALS Model) - Word2Vec - ... ‫״‬The Curse of Dimensionality‫״‬ We need to reduce the dimension of the panel matrix (P): Feature Extraction / Dimension Reduction Algorithms: dim(P) = |S| x |U| dim(F) = |S| x K Business Proprietary & Confidential | 11 The standard algorithms didn’t solve our problem
  • 12. The standard algorithms reduce the dimension without taking into account our problem. Dimension Reduction - Conclusion Business Proprietary & Confidential | 12 An algorithm that reduces the dimension in a way that is optimized to solve our problem. The Problem The Solution
  • 13. Interaction Based Feature Extraction 13
  • 14. Interaction Based Feature Extraction - Step 1/3 ● Convert our learning set (L) into a matrix (D1): ○ Split the gender percentiles into K (=10) “buckets”: [0.0-0.1, 0.1-0.2, 0.2-0.3, ... 0.9-1.0] ○ Map each value from the learning set into an indicator vector: Website A, 0.73 → [0, 0, 0, 0, 0, 0, 0, 1, 0, 0] Website B, 0.26 → [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] ... dim(D1) = |L| x K 14
  • 15. Interaction Based Feature Extraction - Step 1/3 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 ... ... ... ... ... ... Learning Set Vector (K) LearningSetWebsites D1 15
  • 16. ● Multiply the D2 matrix by D1: D = D2 * D1 dim(D) = (|U| x |L|) * (|L| x K) → |U|x K Interaction Based Feature Extraction - Step 2/3 ● Create another matrix (D2) of users (U) and only the learning set websites (L): D2(i, j) = 1 if user i visited learning set website j D2(i, j) = 0 Otherwise dim(D2) = |U| x |L| ● Normalize each row (user) in the matrix D to 1.0 16
  • 17. Interaction Based Feature Extraction - Step 2/3 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 ... ... ... ... ... X Learning Set Websites Learning Set Vector (K) Users LearningSetWebsites D2 D1 Normalize Each Row D = 17
  • 18. Interaction Based Feature Extraction - Step 2/3 ● D Matrix - Example: User 1: [0.0, 0.0, 0.0, 0.0, 0.2, 0.3, 0.1 0.1, 0.2, 0.1] User 2: [0.1, 0.0, 0.2, 0.1, 0.5, 0.0, 0.1, 0.0, 0.0, 0.0] Business Proprietary & Confidential | 18
  • 19. ● Multiply the panel matrix (P) by D: F = P * D dim(F) = (|S| x |U|) * (|U| x K) → |S|x K Interaction Based Feature Extraction - Step 3/3 ● Normalize each row (website) in the matrix F to 1.0 19
  • 20. Interaction Based Feature Extraction - Step 3/3 0.0 0.0 1.0 0.0 ... 1.0 1.0 0.0 0.0 ... ... ... ... ... ... X X Users Learning Set Websites Learning Set Vector (K) Users Websites LearningSetWebsites P D2 D1 Normalize Each RowNormalize Each Row F = 20
  • 21. Interaction Based Feature Extraction - Features 21 F1 F3 F4 F5 F6 F7 F8 F9 F10F2
  • 22. Train a Regressor Male Traffic ShareAllWebsites F Features (K=10) LearningSetWebsites X Y 22 Train Predict
  • 23. Random Forest Regression - Results 23
  • 25. ● 27,000 Movies ● 138,000 Users ● ~20M Ratings (Explicit Feedback: 1-5) ● Multiple Genres per Movie Can We Expand It To Other Domains? MovieLens Dataset: 25 F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
  • 26. ● Action ● Adventure ● Animation ● Children ● Comedy ● Crime ● Documentary ● Drama ● Fantasy Predicting Movie Genres MovieLens Movie Genres (18): ● Film-Noir ● Horror ● Musical ● Mystery ● Romance ● Sci-Fi ● Thriller ● War ● Western 26
  • 27. Predicting Movie Genres 27 movies.csv ratings.csv movieId,title,genres 1,Toy Story (1995),Adventure|Animation|Children 2,Jumanji (1995),Adventure|Children|Fantasy 3,Grumpier Old Men (1995),Comedy|Romance ... userId,movieId,rating,timestamp 1,2,3.5,1112486027 1,29,3.5,1112484676 1,32,3.5,1112484819 ... International Panel Learning Set
  • 28. Predicting Movie Genres - Genre Features 3.0 0.0 1.0 4.0 ... ... ... ... ... ... 2.0 0.0 0.0 5.0 ... ... ... ... ... ... X X Users Train Set Movies Genre Vector (18) Users Movies TrainSetMovies P D2 D1 Normalize Each RowNormalize Each Row F = 1.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 1.0 0.0 ... ... ... ... ... ... ... 28
  • 29. Predicting Movie Genres - Genre Features 29
  • 30. Genre: Adventure ● Total Accuracy: 86.9% (6559) ● True Positive Rate: 70.0% (682) ● True Negative Rate: 88.9% (5877) Predicting Movie Genres - Results 30 Genre: Animation ● Total Accuracy: 95.5% (6611) ● True Positive Rate: 93.4% (273) ● True Negative Rate: 95.6% (6338) * Movies with more than 20 users / votes Overall Accuracy: ~83%
  • 31. ● High Accuracy ● Very Low Dimension ● Scalable ● Simple & Explainable (not a “black box” solution) Business Proprietary & Confidential | 31 Interaction Based Feature Extraction
  • 33. Interaction Based Feature Extraction - SQL Implementation 0.0 0.0 1.0 0.0 ... 1.0 1.0 0.0 0.0 ... ... ... ... ... ... X X Users Learning Set Websites Learning Set Vector (K) Users Websites LearningSetWebsites P D2 D1 Normalize Each RowNormalize Each Row F = Business Proprietary & Confidential | 33 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 ... ... ... ... ... ... 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 ... ... ... ... ...
  • 34. Table D1: <ls_site, k1, k2... k10> Table D2: <user, ls_site> Table P: <site, user> Interaction Based Feature Extraction - SQL Implementation Business Proprietary & Confidential | 34
  • 35. SELECT site AVG(f1) as f1, ... AVG(f10) as f10 FROM ( SELECT site, user, D.k1 / (D.k1 + ... + D.k10) as f1, ... D.k10 / (D.k1 + ... + D.k10) as f10 FROM ( SELECT P.site, D.user, D.ls_site, D.k1... D.k10 FROM P JOIN ( SELECT D2.user, D1.ls_site, D1.k1... D1.k10 FROM D1 JOIN D2 ON D1.ls_site = D2.ls_site ) AS D ON P.user = D.user AND P.site <> D.ls_site ) as F1 GROUP BY site, user ) as F GROUP BY site Interaction Based Feature Extraction - SQL Implementation Business Proprietary & Confidential | 35
  • 36. SELECT site AVG(f1) as f1, ... AVG(f10) as f10 FROM ( SELECT site, user, D.k1 / (D.k1 + ... + D.k10) as f1, ... D.k10 / (D.k1 + ... + D.k10) as f10 FROM ( SELECT P.site, D.user, D.ls_site, D.k1... D.k10 FROM P JOIN ( SELECT D2.user, D1.ls_site, D1.k1... D1.k10 FROM D1 JOIN D2 ON D1.ls_site = D2.ls_site ) AS D ON P.user = D.user AND P.site <> D.ls_site ) as F1 GROUP BY site, user ) as F GROUP BY site Interaction Based Feature Extraction - SQL Implementation Business Proprietary & Confidential | 36 Avoid Data Leakage