What Questions are Worth Answering?Ehren ReillySr. Product Manager, Content,Ask.comSentiment Analysis InnovationSummitSan ...
Overview• Our challenge: What queries deserve an editorialanswer?• Our approach to cost-effectively figuring this out• Adv...
Our Challenge• Ask.com Snapshot:• Q&A service combining the power of search, qualityeditorial content and content from the...
Our ChallengeWhat type of information does the query deserve?• Entities & services (people, things, websites, products, me...
Our ChallengeNot answer requests Wants dynamic dataanswerWants evergreen expertanswer• Facebook login• Barack Obama• Ticke...
Our ChallengeEditorial Answerability SpectrumNavigationalDynamicFactsEntitiesShoppingEvergreenFacts
How do we pick out these answerableevergreen fact queries?• Let the editors do it themselves?o Valuable editorial time was...
Our Hybrid Approach1. Filter out the obvious stuff (e.g., “Facebook.com”, “What time isit”, “What does „looking a gift hor...
Advantages to This ApproachEvergreenFacts
Advantages to This ApproachEvergreenFacts
Advantages to This ApproachEvergreenFacts
Advantages to This ApproachEvergreenFactsDon‟tSend toEditorial
Advantages to This ApproachEvergreenFactsDon‟tSend toEditorialRequires HumanReview
Advantages to This Approach• Filtering and partial automation first makes human reviewmuch less costlyo Tasks requiring hu...
Human Rater Biases• Two very different tasks:o Look for attribute X, which occurs in 1% of data.o Look for attribute X, wh...
Human Rater BiasesThought experiment:“Listen for any naughty words or phrases” Corpus 1: Nationally televised sports colo...
Human Rater Biases• We gave two sets of crowdsource workers (same agency,same pay rate) the same data, mixed in with two d...
What Crowdsource Writers Will and Won‟tDo for You• Don‟t rely on crowdsource workers to self-select whichtasks are viableo...
Easy Filters: Dynamic
Easy Filters: Dumb
What to Include in Training Data• Some question patterns are almost universally answerablequestionso Who invented [NP]?o W...
Conclusions• If you have a firehose of data, don‟t just:o Send it to crowdsourcerso Try to build a ML model• Instead, figu...
THANK YOUEhren Reillyehren.reilly@ask.com@ehrenreilly
Upcoming SlideShare
Loading in …5
×

What Questions Are Worth Answering?

312 views
237 views

Published on

Intent analysis and its role in content strategy at Ask.com.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
312
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

What Questions Are Worth Answering?

  1. 1. What Questions are Worth Answering?Ehren ReillySr. Product Manager, Content,Ask.comSentiment Analysis InnovationSummitSan Francisco, CAApril 25, 2013
  2. 2. Overview• Our challenge: What queries deserve an editorialanswer?• Our approach to cost-effectively figuring this out• Advantages of our approach• Details of our approach
  3. 3. Our Challenge• Ask.com Snapshot:• Q&A service combining the power of search, qualityeditorial content and content from the web• Top 10 US internet site (according to comScore)• 100 million unique users globally; 70 million uniqueusers in the US• Search  Q&A• When you come to Ask.com and ask a question, we giveyou the answer to your question.• Ask.com editors create answers to questions that are askedfrequently, based on search query data• Problem: Not every query is suitable for evergreen, staticeditorial answers
  4. 4. Our ChallengeWhat type of information does the query deserve?• Entities & services (people, things, websites, products, media,resources)o General web search, shopping search, Wikipedia data,tools, applications• Dynamic data and frequently-changing facts (e.g., theweather)o Data partnerships• Static, evergreen information, suitable for editorial expertanswerso Writers, editors, crowd labor, etc. $$• Extremely detailed/technical answer which needs long articleby a true expert
  5. 5. Our ChallengeNot answer requests Wants dynamic dataanswerWants evergreen expertanswer• Facebook login• Barack Obama• Tickets Seattle toMiami• Olympus Has Fallen• Sonicare HX6711/02• Selena Gomez photos• Philippines map• German Shepherds• Chichen Itza timelapse• Salary calculator• What time is it inBangkok• Dollars to pounds• SF Giants score• Weather in Cleveland• What’s my IP address• Kim Kardashianpregnant?• NBA assists leader• Where is JustinBieber?• Oldest living person• Gay marriage states• When was the MingDynasty• Tom Cruise baby name• How long to bake chicken• What is Renaissance art• Highest alcohol beer• Head gasket repair cost• Abraham Lincoln’s wife• Parachute material• Most reliable dishwashers• How to remove hair dyeWhat type of information does the query want?
  6. 6. Our ChallengeEditorial Answerability SpectrumNavigationalDynamicFactsEntitiesShoppingEvergreenFacts
  7. 7. How do we pick out these answerableevergreen fact queries?• Let the editors do it themselves?o Valuable editorial time wasted considering obvious stuffo For crowd editorial labor, conflict of interest  “OK” bias• Crowd labor vetting?o Hard to communicate tasko Still very costly• Template-based filters?o Coverage is too lowo Lots of work to develop these• Machine learning?o Very fuzzy problemo Target set is a small segment of huge search spaceo Hard to achieve high accuracy
  8. 8. Our Hybrid Approach1. Filter out the obvious stuff (e.g., “Facebook.com”, “What time isit”, “What does „looking a gift horse in the mouth‟ mean?”)2. Dedicated classifiers to filter out specific types of non-suitablequeries• Duplicates & near-duplicates• Navigational• Adult / profane / creepy• Temporal / dynamic / timely• Shopping / product search• Wiki / entity exact match3. Build machine learning “answerability” model for the trickyremaining cases4. Where the model returns low confidence, send those queries tocrowd labor for classification
  9. 9. Advantages to This ApproachEvergreenFacts
  10. 10. Advantages to This ApproachEvergreenFacts
  11. 11. Advantages to This ApproachEvergreenFacts
  12. 12. Advantages to This ApproachEvergreenFactsDon‟tSend toEditorial
  13. 13. Advantages to This ApproachEvergreenFactsDon‟tSend toEditorialRequires HumanReview
  14. 14. Advantages to This Approach• Filtering and partial automation first makes human reviewmuch less costlyo Tasks requiring human scoring reduced by 97%• Domain of ML model is narrower than entire query mix, whichimproves accuracy• Making the model better over timeo Human rating data becomes training data for algorithmo Gradually, algorithm gets better, you need fewer humanratings
  15. 15. Human Rater Biases• Two very different tasks:o Look for attribute X, which occurs in 1% of data.o Look for attribute X, which occurs in 50% of data.• The harder you have to look for instances of X, the more thingsstart to look like X.o Your sensitivity increases. You get trigger-happy.
  16. 16. Human Rater BiasesThought experiment:“Listen for any naughty words or phrases” Corpus 1: Nationally televised sports color commentary Corpus 2: Gangster rap music• Some words sound bad in the nationally televised sportscontext, but wouldn‟t in the gangster rap context.• Cognitive psychologists call this the Contrast Effect.
  17. 17. Human Rater Biases• We gave two sets of crowdsource workers (same agency,same pay rate) the same data, mixed in with two differentsurrounding data setso Group A: Raw query fileo Group B: Filtered with heuristics and templates first• Of the queries that group A thought were answerable, GroupB only though 64% of those were answerable• Queries where the two groups disagreed whereoverwhelmingly false positives by Group A, rather than falsenegatives from Group B:• how you spell a word• how much does a book of stamps cost• is randy fenoli married• when does the alabama football game start• where to donate old magazines
  18. 18. What Crowdsource Writers Will and Won‟tDo for You• Don‟t rely on crowdsource workers to self-select whichtasks are viableo “Only answer the answerable queries” (and we only payyou for what you answer)o Writers biased towards everything being answerable• Exception: If the task is too big, they are happy to flag thoseo How to repair a transmissiono History of Chinao US senators all timeo How does organic chemistry work
  19. 19. Easy Filters: Dynamic
  20. 20. Easy Filters: Dumb
  21. 21. What to Include in Training Data• Some question patterns are almost universally answerablequestionso Who invented [NP]?o Where was [person] born?o How to [cooking verb] a [food item]o What does […] mean?• We grab these queries using template filters, and don‟t need ML• Should we included these in our training data?• This is an empirical question. Does the algorithm perform betteror worse if the “easy” data is included in the training data?• In this specific case, the model is more accurate when trainedwithout “easy” data
  22. 22. Conclusions• If you have a firehose of data, don‟t just:o Send it to crowdsourcerso Try to build a ML model• Instead, figure out what the “easy” cases are, and deal withthose separately, using common sense rules• Put your crowdsourcing and machine learning efforts on justthe hard part of the problem
  23. 23. THANK YOUEhren Reillyehren.reilly@ask.com@ehrenreilly

×