SlideShare a Scribd company logo
A Cognitive Psychologist's-
Approach to Data Mining
- How I beat Netflix Cinematch
Maggie Xiong
April 22, 2014
Parallel Frameworks
Cognitive psychology & data mining
Case Study
The Netflix Prize Project
General Outline
Abstraction and Generalization
Categorization
Prototype
Exemplar
Decision boundary
Theory-based categories
Semantic space / LSA
Connectionism
Abstraction
Linguistic ideas (Bransford & Franks, 1971)
“The ants in the kitchen ate the sweet jelly which was on the table.”
“The ants in the kitchen ate the sweet jelly.”
“The ants in the kitchen ate the jelly.”
“The ants were in the kitchen.”
Participants were more confident in “recognizing” fuller sentences.
Prototype (Posner & Keele, 1968)
Participants studied instances generated from distortions of prototypes.
They showed the same accuracy and response time for never-seen
prototypes and memorized instances in a later test.
Categorization
Category structure (Collins & Quillian, 1969)
Economy of organization
Participants takes longer to respond to statements across category levels.
Typicality
Exemplar
(Jacoby & Brooks, 1984)
Decision Boundary
Theory-based Categories
Decision boundary (Ashby & Gott, 1988)
Theory-based categories (Murphy & Medin, 1985)
Categories organized around theories about the world.
clean vs unclean foods; apples and prime numbers
Semantic Space
Latent Semantic Analysis
Shepard, 1987
Probability of generalization decays exponentially with
distance.
Osgood, 1957
Factor analysis
Evaluative, potency, activity
Dumais et al., 1988
SVD, cosine similarity
Landauer & Dumais, 1997
Semantic Space
Latent Semantic Analysis
Shepard, 1987
Probability of generalization decays exponentially with
distance.
Osgood, 1957
Factor analysis
Evaluative, potency, activity
Dumais et al., 1988
SVD, cosine similarity
Landauer & Dumais, 1997
Connectionism
Selfridge, 1958
Pandemonium
Rumelhart, McClelland, & PDP Research Group, 1986
Parallel Distributed Processing, 2 Vol Set
Connectionism
Rumelhart & Todd, 1993
Common Ground
Prototype
Kmeans
Exemplar
K-Nearest Neighbor
Theory-based categories
Collaborative filtering, decision-tree
Decision boundary
Support Vector Machine
Semantic space / LSA
Connectionism - artificial neural net
How Cognitive Psychologists
Analyze Data
Task completion rate:
Main effect of coffee
avg(10,8,10,23,18,15) - avg(12,13,10,14,15,12)
Main effect of time-of-day
avg(14,15,12,23,18,15) - avg(12,13,10,10,8,10)
Interaction
[avg(23,18,15) - avg(14,15,12)]
- [avg(10,8,10) - avg(12,13,10)]
1 Cup 3 Cups
Morning 12,13,10 10,8,10
Evening 14,15,12 23,18,15
Graph It
Main effects and interaction
Rate
Evening
Morning
Cups of Coffee
Training set
17770 movies, 500K users, 100M ratings
user_id, movie_id, rating, date_of_rating
movie_id, title, year
Probe set (1.4M ratings)
Qualifying set (2.8M ratings)
user_id, movie_id, date_of_rating
RMSE
sqrt( sum(X - X.pred)2
/ N )
Cinematch: 0.9514
The Netflix Prize Problem, 2006/10/02
0.8563 => $1M
Standard Deviation and RMSE
The Netflix Problem, Interpreted
Overall average movie rating: 3.620*
Main effect of movie:
Miss Congeniality: avg(u1,u2,u3...)
Mission Impossible: avg(u1,u2,u3...)
Main effect of user:
Alex: avg(m1,m2,m3…)
Brian: avg(m2,m2,m3…)
Interaction:
Alex - Miss Congeniality, Mission Impossible, ...
Brian - Miss Congeniality, Mission Impossible, ...
RMSE, Appreciated
Overall standard deviation: 1.0822*
“Trivial approach” (main effect of movie):
1.0540
Main effects of movie and user: 0.9889*
R.pred = M.avg + U.dev
Cinematch: 0.9514
...
...
Prize: 0.8563
The Arithmetic Approach
R = M.avg + U.dev + interaction
interaction = R - (M.avg + U.dev)
R.pred = M.avg + U.dev + w.avg(interaction * sim(M.p, M))
Alex R M.avg dev interaction
Mission Impossible 4 4.3 -0.3 4 - [4.3 + (-1.4)] = 1.1
Coyote Ugly 1 3.5 -2.5 1 - [3.5 + (-1.4)] = - 1.1
Miss Congeniality ? 4.5
Alex U.dev = ((4 - 4.3) + (1 - 3.5)) / 2 = -1.4
sim(Miss Congeniality, Coyote Ugly) = 0.8
sim(Miss Congeniality, Mission Impossible) = 0.2
? = 4.5 + (-1.4) + (-1.1*0.8 + 1.1 * 0.2) / (|0.8| + |0.2|) = 2.44
Similarity Measures
Romesburg, 1984
Shape difference
vs.
Size displacement
Euclidean distance
Cosine similarity
Correlation coefficient
Movie Similarity
Similarity measures
Co-occurrence count
How often a person rented both movies.
Correlation
A function of the difference in ratings when a person rented both
movies.
Correlation weighted by probability (significance)
Mean Euclidean distance of movie x user interactions
interaction = R - (M.avg + U.dev)
Weighted by similarities in
movie release times, rental frequencies, mean ratings
User Clusters
Differentiate movie mean rating and similarity
R.pred = M.cluster_avg + U.cluster_dev
+ w.avg(interaction * sim_cluster(M,M.p))
By experience (number of movies rated)
[2,180], [81,180], [181,240], [240,400], [401,3000]
By gender
Inferred from preference for different movie clusters
By cluster analysis
PCA, Kmeans
Blend It
Generate different sets of predictions using
different movie similarity and user cluster
strategies
Use linear regression to combine the sets of
predictions into one final prediction
Weak learners are good too, as long as they
provide unique information.
RMSE, 2008/04/01
Overall standard deviation: 1.0822*
“Trivial approach” (main effect of movie):
1.0540
Main effects of movie and user: 0.9889*
R.pred = M.avg + U.dev
Cinematch: 0.9514
...
Naga FX: 0.9063
...
Prize: 0.8563
Cognitive theories and data mining methods
Prototype K-Means
Exemplar K-Nearest Neighbor
Theory-based categories Collaborative filtering, decision-tree
Decision boundary Support Vector Machine
Semantic space / LSA
Connectionism - artificial neural net
Abstraction and generalization
It’s all about similarity.
Tversky, 1977
Murphy & Medin, 1985
Looking Back

More Related Content

Similar to A cognitive psychologist's approach to data mining

[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting
YONG ZHENG
 
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
Raynor Vliegendhart
 
Crowdsearch2012 discoveringuserperceptionsofsemanticsimilarity
Crowdsearch2012 discoveringuserperceptionsofsemanticsimilarityCrowdsearch2012 discoveringuserperceptionsofsemanticsimilarity
Crowdsearch2012 discoveringuserperceptionsofsemanticsimilarity
CUbRIK Project
 
dissertation
dissertationdissertation
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
Paolo Missier
 
A hybrid recommender system user profiling from keywords and ratings
A hybrid recommender system user profiling from keywords and ratingsA hybrid recommender system user profiling from keywords and ratings
A hybrid recommender system user profiling from keywords and ratings
Aravindharamanan S
 
ensemble learning
ensemble learningensemble learning
ensemble learning
butest
 
Lec0
Lec0Lec0
Dr Lael Parrott at the Landscape Science Cluster Seminar, May 2009
Dr Lael Parrott at the Landscape Science Cluster Seminar, May 2009Dr Lael Parrott at the Landscape Science Cluster Seminar, May 2009
Dr Lael Parrott at the Landscape Science Cluster Seminar, May 2009
pdalby
 
Effects of relevant contextual features in the performance of a restaurant re...
Effects of relevant contextual features in the performance of a restaurant re...Effects of relevant contextual features in the performance of a restaurant re...
Effects of relevant contextual features in the performance of a restaurant re...
Blanca Alicia Vargas Govea
 
Phylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-EmondPhylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-Emond
Roderic Page
 
Strong Heredity Models in High Dimensional Data
Strong Heredity Models in High Dimensional DataStrong Heredity Models in High Dimensional Data
Strong Heredity Models in High Dimensional Data
sahirbhatnagar
 
S S S A2009 Simulation Study Of Segmentation
S S S A2009 Simulation Study Of SegmentationS S S A2009 Simulation Study Of Segmentation
S S S A2009 Simulation Study Of Segmentation
Wei Wang
 
Deformable and Shape Changing Interfaces
Deformable and Shape Changing InterfacesDeformable and Shape Changing Interfaces
Deformable and Shape Changing Interfaces
University of Sussex
 
Presentation of Visual Tracking
Presentation of Visual TrackingPresentation of Visual Tracking
Presentation of Visual Tracking
Yu-Sheng (Yosen) Chen
 
Debs 2010 context based computing tutorial
Debs 2010 context based computing tutorialDebs 2010 context based computing tutorial
Debs 2010 context based computing tutorial
Opher Etzion
 
Empathic Mixed Reality
Empathic Mixed RealityEmpathic Mixed Reality
Empathic Mixed Reality
Thammathip Piumsomboon
 
Hedonomics
HedonomicsHedonomics
Hedonomics
Lauren Murphy
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
kenluck2001
 
RussellOvansDissertation
RussellOvansDissertationRussellOvansDissertation
RussellOvansDissertation
Russell Ovans
 

Similar to A cognitive psychologist's approach to data mining (20)

[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting
 
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
 
Crowdsearch2012 discoveringuserperceptionsofsemanticsimilarity
Crowdsearch2012 discoveringuserperceptionsofsemanticsimilarityCrowdsearch2012 discoveringuserperceptionsofsemanticsimilarity
Crowdsearch2012 discoveringuserperceptionsofsemanticsimilarity
 
dissertation
dissertationdissertation
dissertation
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
A hybrid recommender system user profiling from keywords and ratings
A hybrid recommender system user profiling from keywords and ratingsA hybrid recommender system user profiling from keywords and ratings
A hybrid recommender system user profiling from keywords and ratings
 
ensemble learning
ensemble learningensemble learning
ensemble learning
 
Lec0
Lec0Lec0
Lec0
 
Dr Lael Parrott at the Landscape Science Cluster Seminar, May 2009
Dr Lael Parrott at the Landscape Science Cluster Seminar, May 2009Dr Lael Parrott at the Landscape Science Cluster Seminar, May 2009
Dr Lael Parrott at the Landscape Science Cluster Seminar, May 2009
 
Effects of relevant contextual features in the performance of a restaurant re...
Effects of relevant contextual features in the performance of a restaurant re...Effects of relevant contextual features in the performance of a restaurant re...
Effects of relevant contextual features in the performance of a restaurant re...
 
Phylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-EmondPhylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic Supertrees. ORP Bininda-Emond
 
Strong Heredity Models in High Dimensional Data
Strong Heredity Models in High Dimensional DataStrong Heredity Models in High Dimensional Data
Strong Heredity Models in High Dimensional Data
 
S S S A2009 Simulation Study Of Segmentation
S S S A2009 Simulation Study Of SegmentationS S S A2009 Simulation Study Of Segmentation
S S S A2009 Simulation Study Of Segmentation
 
Deformable and Shape Changing Interfaces
Deformable and Shape Changing InterfacesDeformable and Shape Changing Interfaces
Deformable and Shape Changing Interfaces
 
Presentation of Visual Tracking
Presentation of Visual TrackingPresentation of Visual Tracking
Presentation of Visual Tracking
 
Debs 2010 context based computing tutorial
Debs 2010 context based computing tutorialDebs 2010 context based computing tutorial
Debs 2010 context based computing tutorial
 
Empathic Mixed Reality
Empathic Mixed RealityEmpathic Mixed Reality
Empathic Mixed Reality
 
Hedonomics
HedonomicsHedonomics
Hedonomics
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
 
RussellOvansDissertation
RussellOvansDissertationRussellOvansDissertation
RussellOvansDissertation
 

Recently uploaded

Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 

Recently uploaded (20)

Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 

A cognitive psychologist's approach to data mining

  • 1. A Cognitive Psychologist's- Approach to Data Mining - How I beat Netflix Cinematch Maggie Xiong April 22, 2014
  • 2. Parallel Frameworks Cognitive psychology & data mining Case Study The Netflix Prize Project General Outline
  • 3. Abstraction and Generalization Categorization Prototype Exemplar Decision boundary Theory-based categories Semantic space / LSA Connectionism
  • 4. Abstraction Linguistic ideas (Bransford & Franks, 1971) “The ants in the kitchen ate the sweet jelly which was on the table.” “The ants in the kitchen ate the sweet jelly.” “The ants in the kitchen ate the jelly.” “The ants were in the kitchen.” Participants were more confident in “recognizing” fuller sentences. Prototype (Posner & Keele, 1968) Participants studied instances generated from distortions of prototypes. They showed the same accuracy and response time for never-seen prototypes and memorized instances in a later test.
  • 5. Categorization Category structure (Collins & Quillian, 1969) Economy of organization Participants takes longer to respond to statements across category levels. Typicality Exemplar (Jacoby & Brooks, 1984)
  • 6. Decision Boundary Theory-based Categories Decision boundary (Ashby & Gott, 1988) Theory-based categories (Murphy & Medin, 1985) Categories organized around theories about the world. clean vs unclean foods; apples and prime numbers
  • 7. Semantic Space Latent Semantic Analysis Shepard, 1987 Probability of generalization decays exponentially with distance. Osgood, 1957 Factor analysis Evaluative, potency, activity Dumais et al., 1988 SVD, cosine similarity Landauer & Dumais, 1997
  • 8. Semantic Space Latent Semantic Analysis Shepard, 1987 Probability of generalization decays exponentially with distance. Osgood, 1957 Factor analysis Evaluative, potency, activity Dumais et al., 1988 SVD, cosine similarity Landauer & Dumais, 1997
  • 9. Connectionism Selfridge, 1958 Pandemonium Rumelhart, McClelland, & PDP Research Group, 1986 Parallel Distributed Processing, 2 Vol Set
  • 11. Common Ground Prototype Kmeans Exemplar K-Nearest Neighbor Theory-based categories Collaborative filtering, decision-tree Decision boundary Support Vector Machine Semantic space / LSA Connectionism - artificial neural net
  • 12. How Cognitive Psychologists Analyze Data Task completion rate: Main effect of coffee avg(10,8,10,23,18,15) - avg(12,13,10,14,15,12) Main effect of time-of-day avg(14,15,12,23,18,15) - avg(12,13,10,10,8,10) Interaction [avg(23,18,15) - avg(14,15,12)] - [avg(10,8,10) - avg(12,13,10)] 1 Cup 3 Cups Morning 12,13,10 10,8,10 Evening 14,15,12 23,18,15
  • 13. Graph It Main effects and interaction Rate Evening Morning Cups of Coffee
  • 14. Training set 17770 movies, 500K users, 100M ratings user_id, movie_id, rating, date_of_rating movie_id, title, year Probe set (1.4M ratings) Qualifying set (2.8M ratings) user_id, movie_id, date_of_rating RMSE sqrt( sum(X - X.pred)2 / N ) Cinematch: 0.9514 The Netflix Prize Problem, 2006/10/02 0.8563 => $1M
  • 16. The Netflix Problem, Interpreted Overall average movie rating: 3.620* Main effect of movie: Miss Congeniality: avg(u1,u2,u3...) Mission Impossible: avg(u1,u2,u3...) Main effect of user: Alex: avg(m1,m2,m3…) Brian: avg(m2,m2,m3…) Interaction: Alex - Miss Congeniality, Mission Impossible, ... Brian - Miss Congeniality, Mission Impossible, ...
  • 17. RMSE, Appreciated Overall standard deviation: 1.0822* “Trivial approach” (main effect of movie): 1.0540 Main effects of movie and user: 0.9889* R.pred = M.avg + U.dev Cinematch: 0.9514 ... ... Prize: 0.8563
  • 18. The Arithmetic Approach R = M.avg + U.dev + interaction interaction = R - (M.avg + U.dev) R.pred = M.avg + U.dev + w.avg(interaction * sim(M.p, M)) Alex R M.avg dev interaction Mission Impossible 4 4.3 -0.3 4 - [4.3 + (-1.4)] = 1.1 Coyote Ugly 1 3.5 -2.5 1 - [3.5 + (-1.4)] = - 1.1 Miss Congeniality ? 4.5 Alex U.dev = ((4 - 4.3) + (1 - 3.5)) / 2 = -1.4 sim(Miss Congeniality, Coyote Ugly) = 0.8 sim(Miss Congeniality, Mission Impossible) = 0.2 ? = 4.5 + (-1.4) + (-1.1*0.8 + 1.1 * 0.2) / (|0.8| + |0.2|) = 2.44
  • 19. Similarity Measures Romesburg, 1984 Shape difference vs. Size displacement Euclidean distance Cosine similarity Correlation coefficient
  • 20. Movie Similarity Similarity measures Co-occurrence count How often a person rented both movies. Correlation A function of the difference in ratings when a person rented both movies. Correlation weighted by probability (significance) Mean Euclidean distance of movie x user interactions interaction = R - (M.avg + U.dev) Weighted by similarities in movie release times, rental frequencies, mean ratings
  • 21. User Clusters Differentiate movie mean rating and similarity R.pred = M.cluster_avg + U.cluster_dev + w.avg(interaction * sim_cluster(M,M.p)) By experience (number of movies rated) [2,180], [81,180], [181,240], [240,400], [401,3000] By gender Inferred from preference for different movie clusters By cluster analysis PCA, Kmeans
  • 22. Blend It Generate different sets of predictions using different movie similarity and user cluster strategies Use linear regression to combine the sets of predictions into one final prediction Weak learners are good too, as long as they provide unique information.
  • 23. RMSE, 2008/04/01 Overall standard deviation: 1.0822* “Trivial approach” (main effect of movie): 1.0540 Main effects of movie and user: 0.9889* R.pred = M.avg + U.dev Cinematch: 0.9514 ... Naga FX: 0.9063 ... Prize: 0.8563
  • 24. Cognitive theories and data mining methods Prototype K-Means Exemplar K-Nearest Neighbor Theory-based categories Collaborative filtering, decision-tree Decision boundary Support Vector Machine Semantic space / LSA Connectionism - artificial neural net Abstraction and generalization It’s all about similarity. Tversky, 1977 Murphy & Medin, 1985 Looking Back