Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SIGIR 2014 Presentation


Published on

Presentation at ACM SIGIR 2014 on July 9, 2014. Joint work with Yinglong Zhang, Jin Zhang, and Jacek Gwizdka.

Published in: Technology, Education

Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SIGIR 2014 Presentation

  1. 1. Matt Lease • School of Information @mattlease University of Texas at Austin Joint work with with Yinglong Zhang Jin Zhang Jacek Gwizdka Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing slides:
  2. 2. Saracevic’s ‘97 Salton Award address “…the human-centered side was often highly critical of the systems side for ignoring users... [when] results have implications for systems design & practice. Unfortunately… beyond suggestions, concrete design solutions were not delivered. “…the systems side by and large ignores the user side and user studies… the stance is ‘tell us what to do and we will.’ But nobody is telling... “Thus, there are not many interactions…” Matt Lease <> 2/20
  3. 3. Primary Research Question • What is relevance? – What factors constitute it? Can we quantify their relative importance? How do they interact? • Old IR question, many studies, little agreement • Potential impacts? – Further understanding of cognitive relevance – Guide IR engineering toward inferring key factors – Foster multi-dimensional evaluation of IR systems Matt Lease <> 3/20
  4. 4. Secondary Research Question • How can we measure/ensure quality of subjective relevance judgments – How can we distinguish valid subjectivity vs. human error in judging disagreements (traditional or online)? • Potential impacts – Help explain/reduce judging disagreements – Enable evaluation wrt. distribution of opinions – Encourage other subjective data collection in HCOMP Matt Lease <> 4/20
  5. 5. Pscychology to the Rescue! • A Guide to Behavioral Experiments on Mechanical Turk – W. Mason and S. Suri (2010). SSRN online. • Crowdsourcing for Human Subjects Research – L. Schmidt (CrowdConf 2010) • Crowdsourcing Content Analysis for Behavioral Research: Insights from Mechanical Turk – Conley & Tosti-Kharas (2010). Academy of Management • Amazon's Mechanical Turk : A New Source of Inexpensive, Yet High-Quality, Data? – M. Buhrmester et al. (2011). Perspectives… 6(1):3-5. – see also: Amazon Mechanical Turk Guide for Social Scientists 5/20
  6. 6. August 12, 2012 6/20
  7. 7. Contributions • Describe a simple, reliable, scalable method for collecting diverse (subjective), multi-dimensional relevance judgments from online participants – Online survey techniques from pscyhometrics – Data available online • Describe a rigorous, positivist, data-driven framework for inferring & modeling multi-dimensional relevance – Structural equation modeling (SEM) from pscyhometrics – Run the experiment & let the data speak for itself! – Implemented in standard R libraries available online Matt Lease <> 7/20
  8. 8. An example model of multi-dimensional relevance Matt Lease <> 8/20
  9. 9. Experimental Design • Define some search tasks • Pick some documents to be judged • Hypothesize some relevance dimensions • Ask participants to answer some questions • Analyze data via Structural Equation Modeling (SEM) – Use Exploratory Factor Analysis (EFA) to assess question- factor relationships, then prune “bad” questions – Use Confirmatory Factor Analysis (CFA) to assess correlations, test significance, & compare models – Cousin to graphical models in statistics/AI Matt Lease <> 9/20
  10. 10. Collecting multi-dimensional relevance judgments • Participant picks one of several pre-defined topics – You want to plan a one week vacation in China • Participant assigned a Web page to judge – We wrote a query for each topic, submitted to a popular search engine, and did stratified sampling of results • Participant answers a set of likert-scale questions – I think the information in this page is incorrect – It’s difficult to understand the information in this page – … Matt Lease <> 10/20
  11. 11. What Questions might we ask? • What factors do you think impact relevance… • We hypothesize same 5 factors as Xu & Chen ’06 – Topicality, reliability, novelty, understability, & scope – Choose same to make revised mechanics & any difference in findings maximally clear • Assume factors are incomplete & imperfect – Positivist approach: do these factors explain observed data better than other alternatives: uni-dimensional relevance or another set of factors? Matt Lease <> 11/20
  12. 12. How do we ask the questions? • Ask 3+ questions per hypothesized dimension – Ask repeated, similar questions, & change polarity – Randomize question order (don’t group questions) – Over-generate questions to allow for later pruning – Exclude participants failing self-consistency checks • Usual stuff – Use clear, familiar, non-leading wording – Balance likert response scale, – Pre-test survey in-house, then pilot study online Matt Lease <> 12/20
  13. 13. Structural Equation Modeling (SEM) • Based on Sewell Wright’s path analysis (1921) – A factor model is parameterized by factor loadings, covariances, & residual error terms • Graphical representation: path diagram – Observed variables in boxes – Latent variables in ovals – Directed edges denote causal relationships – Residual error terms implicitly assumed Matt Lease <> 13/20
  14. 14. Exploratory Factor Analysis (EFA) – 1 of 2 • Is the sample large enough for EFA? – Kaiser-Mayer-Olkin (KMO) Measure of Adequacy – Bartlett’s Test of Sphericity • Principal Axis Factoring (PAF) to find eigenvalues – Assume some large, constant # of latent factors – Assume each factor has a connecting edge to each question – Estimate factor model parameters by least-squares (ML) • Promax (oblique) rotation to maximize correlations • Prune factors via Parallel Analysis – Create random data with same # factors & questions – Create correlation matrix and find eigenvalues Matt Lease <> 14/20
  15. 15. • Perform Parallel Analysis – Create random data w/ same # of factors & questions – Create correlation matrix and find eigenvalues • Create Scree Plot of Eigenvalues • Re-run EFA for reduced factors • Compute Pearson correlations • Discard questions with: – Weak factor loading – Strong cross-factor loading – Lack of logical interpretation • Kenny’s Rule: need >= 2 questions per factor for EFA Exploratory Factor Analysis (EFA) – 2 of 2 Matt Lease <> 15/20
  16. 16. Question-Factor Loadings (Weights) Matt Lease <> 16/20
  17. 17. CFA: Assess and Compare Models • F First-order baseline model uses a single latent factor to explain observed data Posited hierarchical factor model uses 5 relevance dimensions Matt Lease <> 17/20
  18. 18. • Null model assume observations independent – Covariance between questions fixed at 0 and all means and coveriances left free • Comparison stats – Non-Normed Fit Index (NNFI) – Comparative Fit Index (CFI) – Root-Mean Squared Error of Approximation (RMSEA) – Standardized-root Mean-Square Residual (SMSR) Confirmatory Factor Analysis (CFA) Matt Lease <> 18/20
  19. 19. Our model of multi-dimensional relevance Matt Lease <> 19/20
  20. 20. Future Directions • More data-driven positivist research into factors – Different user groups, search scenarios, devices, etc. – Need more data to support normative claims • Train/test operational systems for varying factors – Identify/extend detected features for each dimension – Personalize search results for individual preferences • Improve judging agreement by making task more natural and/or assessing impact of latent factors? • Intra-subject vs. inter-subject aggregation? – Other methods for ensuring subjective data quality? 20/20
  21. 21. Thank You! 21 Slides: