Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning for Q&A Sites: The Quora Example

4,482 views

Published on

Talk I gave at the Question Answering Workshop at WWW2016 Conference in Montreal, Canada

Published in: Technology

Machine Learning for Q&A Sites: The Quora Example

  1. 1. Machine Learning for Q&A Sites: The Quora Example Xavier Amatriain (@xamat) 04/11/2016
  2. 2. “To share and grow the world’s knowledge” • Millions of questions & answers • Millions of users • Thousands of topics • ...
  3. 3. DemandQuality Relevance
  4. 4. Data
  5. 5. Machine Learning Applications for Q&A Sites
  6. 6. Answer Ranking
  7. 7. Goal • Given a question and n answers, come up with the ideal ranking of those n answers
  8. 8. What is a good Quora answer? • truthful • reusable • provides explanation • well formatted • ...
  9. 9. How are those dimensions translated into features? • Features that relate to the text quality itself • Interaction features (upvotes/downvotes, clicks, comments…) • User features (e.g. expertise in topic)
  10. 10. Feed Ranking
  11. 11. • Goal: Present most interesting stories for a user at a given time • Interesting = topical relevance + social relevance + timeliness • Stories = questions + answers • ML: Personalized learning-to-rank approach • Relevance-ordered vs time-ordered = big gains in engagement • Challenges: • potentially many candidate stories • real-time ranking • optimize for relevance
  12. 12. Feed dataset: impression logs click upvote downvote expand share click answer pass downvote follow
  13. 13. ● Value of showing a story to a user, e.g. weighted sum of actions: v = ∑a va 1{ya = 1} ● Goal: predict this value for new stories. 2 possible approaches: ○ predict value directly v_pred = f(x) ■ pros: single regression model ■ cons: can be ambiguous, coupled ○ predict probabilities for each action, then compute expected value: v_pred = E[ V | x ] = ∑a va p(a | x) ■ pros: better use of supervised signal, decouples action models from action values ■ cons: more costly, one classifier per action What is relevance?
  14. 14. ● Essential for getting good rankings ● Better if updated in real-time (more reactive) ● Main sets of features: ○ user (e.g. age, country, recent activity) ○ story (e.g. popularity, trendiness, quality) ○ interactions between the two (e.g. topic or author affinity) Feature engineering
  15. 15. ● Linear ○ simple, fast to train ○ manual, non-linear transforms for richer representation (buckets, ngrams) ● Decision trees ○ learn non-linear representations ● Tree ensembles ○ Random forests ○ Gradient boosted decision trees ● In-house C++ training code, third-party libraries for prototyping new models Models
  16. 16. Ask2Answer
  17. 17. ● Given a question and a viewer rank all other users based on how “well-suited” they are. ○ “Well-suited” = likelihood of viewer sending a request + likelihood of the candidate adding a good answer. ● A2A = extension of CTR-prediction ○ Not only care about the viewer’s probability of sending a request, but also the recipient’s probability of writing a good answer A2A
  18. 18. ● Example labels: ○ Binary label: 0 if no request was sent or no answer was added and 1 if a request was sent and yielded an answer with a goodness score above some threshold. ○ Continuous label: w1⋅had_request+w2⋅had_answer+w3⋅answer_ goodness+⋯w1⋅had_request+w2⋅had_answer+ w3⋅answer_goodness+⋯ A2A
  19. 19. ● Features ○ Based on what the viewer or candidate has done in the past. ○ Historical features that encapsulate the relationship of the viewer to the candidate. ○ In addition to historical features, other features can be devised (e.g. a binary feature saying whether the viewer follows the candidate) ● Many more features are possible. Feature engineering is a crucial component of any ML system. A2A
  20. 20. Topics & Users Recommendations
  21. 21. Goal: Recommend new topics for the user to follow ● Based on ○ Other topics followed ○ Users followed ○ User interactions ○ Topic-related features ○ ...
  22. 22. Goal: Recommend new users to follow ● Based on: ○ Other users followed ○ Topics followed ○ User interactions ○ User-related features ○ ...
  23. 23. Related Questions
  24. 24. ● Given interest in question A (source) what other questions will be interesting? ● Not only about similarity, but also “interestingness” ● Features such as: ○ Textual ○ Co-visit ○ Topics ○ … ● Important for logged-out use case
  25. 25. Duplicate Questions
  26. 26. ● Important issue for Q&A Sites ○ Want to make sure we don’t disperse knowledge to the same question ● Solution: binary classifier trained with labelled data ● Features ○ Textual vector space models ○ Usage-based features ○ ...
  27. 27. User Trust
  28. 28. Goal: Infer user’s trustworthiness in relation to a given topic ● We take into account: ○ Answers written on topic ○ Upvotes/downvotes received ○ Endorsements ○ ... ● Trust/expertise propagates through the network ● Must be taken into account by other algorithms
  29. 29. Trending Topics
  30. 30. Goal: Highlight current events that are interesting for the user ● We take into account: ○ Global “Trendiness” ○ Social “Trendiness” ○ User’s interest ○ ... ● Trending topics are a great discovery mechanism
  31. 31. Moderation
  32. 32. ● Very important for Quora to keep quality of content ● Pure manual approaches do not scale ● Hard to get algorithms 100% right ● ML algorithms detect content/user issues ○ Output of the algorithms feed manually curated moderation queues
  33. 33. Content Creation Prediction
  34. 34. ● Quora’s algorithms not only optimize for probability of reading ● Important to predict probability of a user answering a question ● Parts of our system completely rely on that prediction ○ E.g. A2A (ask to answer) suggestions
  35. 35. Models
  36. 36. ● Logistic Regression ● Elastic Nets ● Gradient Boosted Decision Trees ● Random Forests ● (Deep) Neural Networks ● LambdaMART ● Matrix Factorization ● LDA ● ... ●
  37. 37. Experimentation
  38. 38. ⚫ Extensive A/B testing, data-driven decision- making ⚫ Separate, orthogonal “layers” for different parts of the system ⚫ Experiment framework showing comparisons for various metrics
  39. 39. Conclusions
  40. 40. • Q&A sites have not only Big, but also “rich” data • Algorithms need to understand and optimize complex aspects such as quality, interestingness, or user expertise • ML is one of the keys to success • Many interesting problems, and many unsolved challenges
  41. 41. Questions?

×