Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Keynote: Bias in Search and Recommender Systems

247 views

Published on

Ricardo Baeza-Yates's slides from TechSEO Boost 2019

Published in: Marketing
  • Be the first to comment

Keynote: Bias in Search and Recommender Systems

  1. 1. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost #TechSEOBoost | @CatalystSEM THANK YOU TO THIS YEAR’S SPONSORS Keynote: Bias in Search and Recommender Systems Ricardo Baeza-Yates, NTENT
  2. 2. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Ricardo Baeza-Yates CTO, NTENT Biases in Search & Recommender Systems
  3. 3. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost NTENT ntent.com Marketing Engineering Operations International Applied Research ntent.com
  4. 4. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Prologue
  5. 5. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost A Bit of History Data Volume Complexity IR DB Two different points of view for data
  6. 6. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Data Understanding Data Query Unstructured Structured Explicit Information Retrieval (Relational) Databases Implicit Recommender Systems Unknown Data Mining
  7. 7. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost What is Bias? • Statistical: significant systematic deviation from a prior (unknown) distribution; • Cultural: interpretations and judgments phenomena acquired through our life; • Cognitive: systematic pattern of deviation from norm or rationality in judgment;
  8. 8. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost So (Observational) Human Data has Bias • Gender • Racial • Sexual • Age • Religious • Social • Linguistic • Geographic • Political • Educational • Economic • Technological ▪ Gathering process ▪ Sampling process ▪ Validity (e.g. temporal) ▪ Completeness ▪ Noise, spam Many people extrapolate results of a sample to the whole population (e.g., social media analysis) In addition there is bias when measuring bias as well as bias towards measuring it! Attempt of an unbiased (personal) view on bias in Search & RS Cultural Biases Statistical Biases Cognitive Biases Self-selection
  9. 9. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Impact in Search and Recommender Systems • Most web systems are optimized by using implicit user feedback • However, user data is partly biased to the choices that these systems make • Clicks can only be done on things that are shown to us • As those systems are usually based in ML, they learn to reinforce their own biases, yielding self-fulfilled prophecies and/or sub-optimal solutions • For example, personalization and filter bubbles for users • but also echo chambers for (recommender) systems • Moreover, sometimes these systems compete among themselves, learning also biases of other systems rather than real user behavior • Even more, an improvement in one system might be just a degradation in another system that uses a different (even inversely correlated) optimization function • For example, user experience vs. monetization
  10. 10. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost A Non-Technical Question Algorithm Biased Data Neutral? Fair? Same Bias Garbage In Garbage Out
  11. 11. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost What is being fair?
  12. 12. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost A Non-Technical Question Algorithm Biased Data Neutral? Fair? Same Bias Not Always! Debias the input Tune the algorithm Debias the output Bias awareness!
  13. 13. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost January 2017 ACM US Statement on Algorithm Transparency and Accountability 1. Awareness 2. Access and redress 3. Accountability 4. Explanation 5. Data Provenance 6. Auditability 7. Validation and Testing Systems do not need to be perfect, they just need to be better than us
  14. 14. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Biases Everywhere!
  15. 15. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Data bias Biases on Search & RS: Web Case Study Web Spam
  16. 16. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost [Baeza-Yates, Castillo & López, Cybermetrics, 2005] Number of linked domains Exports(thousandsofUS$) Economic Bias in Links
  17. 17. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost 17 [Baeza-Yates & Castillo, WWW2006] Economic Bias in Links
  18. 18. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost 18 Minimal effortShameCultural Bias in Websites [Baeza-Yates, Castillo, Efthimiadis, TOIT 2007]
  19. 19. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Language Bias
  20. 20. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost [Bolukbasi at al, NIPS 2016] • Word embedding’s in w2vNEWS Yes, about 60 to 70% at work although at college is the inverse Gender Bias in Content Most journalists are men?
  21. 21. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost [E. Graells-Garrido et al,. ACM Hypertext’15] Systemic bias? Equal opportunity? Gender Bias in Content Wikipedia Partial information
  22. 22. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Data bias Activity bias Bias on Usage Actions People We are all in the long tail! [Goel et al., WSDM 2010]
  23. 23. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Popularity Bias in Recommender Systems Items Users Popular items Rest of items (long tail) • Take care to recommended items that are not too popular • Metrics • Novelty enhancement • Problem solved! …really? 𝑛𝑜𝑣 𝑖 = 1 − # ratings of 𝑖 # users Items #interactions More novel Less novel 𝑎 𝑏 [Vargas & Castells, RecSys 2011]
  24. 24. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost A Self-fulfilling Prophecy? Popular items (short head) Rest of items (long tail) Observed user-item interaction Unobserved preference Items Users Ratings are missing not at random (MNAR)
  25. 25. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Test data (relevant items) Training data Unobserved preference Items Users Popular items (short head) Rest of items (long tail) avg P@𝑘 ∼ + 𝑘 Judgments are missing not at random (MNAR) Worse yet: user-system reinforcement loop (more later) A Self-fulfilling Prophecy? [Marlin et al., RecSys 2010] [Steck, RecSys 2010, 2011] [Fleder & Hossanagar, Management Sciences 2009]
  26. 26. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost 1.E-06 1.E-05 1.E-04 1.E-03 0 0.2 0.4 0.6 0.8 1 A Problem for IR Evaluation Methodology! 30 TREC collections Items / documents #ratings/judgments (infraction) [Bellogín, Castells & Cantador, IRJ 2017] To how many queries is a document relevant? 25% queries can be answered with less than 1% of the URLs! [Baeza-Yates, Boldi, Chierichetti, WWW 2015]
  27. 27. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Get Rid of the Popularity Bias! • In the rating split [Bellogín, Castells & Cantador, IRJ 2017] • In the metrics • Stratified recall [Steck, RecSys 2011] • Importance propensity scoring [Yang et al., RecSys 2018] • In the algorithms [Steck, RecSys 2011] [Lobato et al., ICML 2014] [Jannach et al., UMUAI 2015] [Cañamares & Castells, SIGIR 2018, best paper award] Test data (relevant items) Training data Unobserved preference Items Items #ratings Flat test Popularity strata #ratings Time Temporal split
  28. 28. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Activity Bias also Affects Content [Baeza-Yates & Saez-Trumper, ACM Hypertext 2015] Most users are passive (i.e., more than 90% are lurkers) Then, which percentage of active users produce 50% of the content?
  29. 29. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost October 2015
  30. 30. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost [Baeza-Yates & Saez-Trumper, ACM Hypertext 2015] Which percentage of active users produce 50% of the content? Wisdom of crowds is a partial illusion Activity Bias: The Wisdom of a Few
  31. 31. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Content Never Seen [Baeza-Yates & Saez-Trumper, ACM Hypertext 2015] The Digital Desert
  32. 32. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Data bias Activity bias Sampling (size) bias Algorithmic bias Search or Recommender System Bias on the Web
  33. 33. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost [E. Graells-Garrido & M. Lalmas, ACM Hypertext’14] Geographical Bias in Recommender Systems
  34. 34. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost • If we want to estimate the frequency of queries that appear with probability at least p with a certain relative error, ∊, we can use the standard binomial error formula which works well for p near ½ but not for p near 0 • Better is the Agresti-Coull technique (also called Take 2) which gives: where Z is the inverse of the standard normal distribution, 1 − 𝛼 is the confidence interval and • If p = 0.1, 1 − 𝛼 is 80% and ∊ is 10%, the standard formula gives n = 900, while with A-C we get n = 2342. [Brown, Cai & DasGupta, Statistical Science, 2001] [Baeza-Yates, SIGIR 2015, Industry track] Sample Size?
  35. 35. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost • Standard technique: • A good sample should cover well all the items distribution, but this does not work with very skewed distributions. 10 0 10 1 10 2 10 3 Rank 10 0 10 1 10 2 10 3 10 4 10 5 Frequency 50M 10M 100K 1K 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Rank 10 0 10 1 10 2 10 3 10 4 10 5 Frequency 50M 10M 100K 1K [Zaragoza et al, CIKM 2010] Sampling Queries
  36. 36. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost 36 Stratified Sampling Example [Baeza-Yates, SIGIR 2015, Industry track]
  37. 37. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Data bias Activity bias Sampling bias Interaction bias (Self) selection bias Bias in the User Interaction Search or Recommender System
  38. 38. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Position bias Ranking bias Presentation or exposure bias Social bias Interaction bias Bias in the Interaction Amazon.com
  39. 39. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Position bias Presentation bias Social bias Interaction bias Ranking bias Click bias Scrolling bias Mouse movement bias Data and algorithmic bias Self-selection bias Dependencies: A Cascade of Biases!
  40. 40. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Ranking Bias in Web Search [Mediative Study, 2014]
  41. 41. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Ranking Bias: Click Bias in Web Search • Ranking & next page bias Navigational queries
  42. 42. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost CTR (log) 1 11 21 Rank Learning to Rank with bias [Joachims et al., WSDM 2017, best paper] + many other papers Fair rankings [Zehlike et al., CIKM 2017] Clicks as implicit positive user feedback Debiasing Search Clicks and Other Biases [Dupret & Piwowarski, SIGIR 2008] [Chapelle & Zhang, WWW 2009] [Dupret & Liao, WSDM 2010, best paper] Debias the input Tune the algorithm Debias the output
  43. 43. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Data bias Activity bias Sampling bias Interaction bias (Self) selection bias Second-order bias Vicious Cycle of Bias Search or Recommender System
  44. 44. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost [Baeza-Yates, Pereira & Ziviani, WWW 2008] Person Web content is redundant (> 20%) Query Ranking bias in new content Redundancy grows (35%) Search results New page Second Order Bias in Web Content [Fortunato, Flammini, Menczer & Vespignani. PNAS 2006]
  45. 45. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Data bias Activity bias Sampling bias Interaction bias (Self) selection bias Vicious Cycle of Bias Search or Recommender System Feedback loop bias
  46. 46. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Bias due to Personalization • The effect of self-selection bias • Avoid the rich get richer and poor get poorer syndrome • Avoid the echo chamber by empowering the tail Cold start problem solution: Explore & Exploit Partial solutions: • Diversity • Novelty • Serendipity • My dark side Wikipedia [Eli Pariser, Penguin 2011]
  47. 47. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Users’ Eco Chambers in Recommender Systems ▪ Filter bubbles ▪ Degenerate feedback loops (e.g., YouTube autoplay) [Jiang et al., AAAI 2019]
  48. 48. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Eco Chamber of the Recommender System • Short-term greedy optimization, partial knowledge of the world • Long-term revenue optimization is not achieved • Views from new users should balance the exploration for new items • Disparate impact: unfair ecommerce/information markets [Baeza-Yates & Ribaudo, to appear]
  49. 49. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Epilogue
  50. 50. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Recap Bias Type Statistical Cultural Cognitive Algorithmic  ? ? Presentation  Position    Data   Sampling    Activity  Self-selection   Interaction   Social   Second order   
  51. 51. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost [Silberzahn et al., COS, Univ. of Virginia, 2015] Professional Bias? ➔ 61 analysts, 29 teams: 20 yes and 9 no
  52. 52. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost If Systems Reflects Its Designers: What we can/should do? ▪ Data ▪ Analyze for known and unknown biases, debias when possible/needed ▪ Recollect more data for difficult/sparse regions of the problem ▪ Delete attributes associated directly/indirectly with harmful bias ▪ Interaction ▪ Make sure that the user is aware of the biases all the time ▪ Give more control to the user ▪ Design and Implementation • Let experts/colleagues/users contest every step of the process ▪ Evaluation • Do not fool yourself!
  53. 53. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Final Take-Home Message ▪ Systems are a mirror of us, the good, the bad and the ugly ▪ The Web amplifies everything, but always leaves traces ▪ We need to be aware of our own biases! ▪ We must be aware of the biases and contrarrest them to stop the vicious bias cycle ▪ Plenty of open (research) problems! Big Data of People is huge….. ….. but it is tiny compared to the future Big Data of the Internet of Things (IoT) No activity bias!
  54. 54. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Thank You! Any Questions? – rbaeza@acm.org | http://www.baeza.cl/ | http://fairness-measures.org Biased Questions? ASIST 2012 Book of the Year Award (Biased Ad)
  55. 55. Ricardo Baeza-Yates © | @polarbearby | #TechSEOBoost Thanks for Viewing the Slideshare! – Watch the Recording: https://youtube.com/session-example Or Contact us today to discover how Catalyst can deliver unparalleled SEO results for your business. https://www.catalystdigital.com/

×