This document discusses Ask.com's challenge of determining which search queries deserve editorial answers. It presents Ask.com's hybrid approach which first filters out queries that are obviously not suitable for editorial answers. It then uses dedicated classifiers and machine learning to further filter queries, with any low confidence queries sent for human review. This reduces the workload for human reviewers by 97% compared to no filtering. The approach improves the machine learning model's accuracy by focusing its domain and allows it to gradually improve using human ratings as training data. Certain human rater biases are also discussed, showing how pre-filtering data can improve the reliability of human reviews.
Human Factors of XR: Using Human Factors to Design XR Systems
What Questions Are Worth Answering?
1. What Questions are Worth Answering?
Ehren Reilly
Sr. Product Manager, Content,
Ask.com
Sentiment Analysis Innovation
Summit
San Francisco, CA
April 25, 2013
2. Overview
• Our challenge: What queries deserve an editorial
answer?
• Our approach to cost-effectively figuring this out
• Advantages of our approach
• Details of our approach
3. Our Challenge
• Ask.com Snapshot:
• Q&A service combining the power of search, quality
editorial content and content from the web
• Top 10 US internet site (according to comScore)
• 100 million unique users globally; 70 million unique
users in the US
• Search Q&A
• When you come to Ask.com and ask a question, we give
you the answer to your question.
• Ask.com editors create answers to questions that are asked
frequently, based on search query data
• Problem: Not every query is suitable for evergreen, static
editorial answers
4. Our Challenge
What type of information does the query deserve?
• Entities & services (people, things, websites, products, media,
resources)
o General web search, shopping search, Wikipedia data,
tools, applications
• Dynamic data and frequently-changing facts (e.g., the
weather)
o Data partnerships
• Static, evergreen information, suitable for editorial expert
answers
o Writers, editors, crowd labor, etc. $$
• Extremely detailed/technical answer which needs long article
by a true expert
5. Our Challenge
Not answer requests Wants dynamic data
answer
Wants evergreen expert
answer
• Facebook login
• Barack Obama
• Tickets Seattle to
Miami
• Olympus Has Fallen
• Sonicare HX6711/02
• Selena Gomez photos
• Philippines map
• German Shepherds
• Chichen Itza timelapse
• Salary calculator
• What time is it in
Bangkok
• Dollars to pounds
• SF Giants score
• Weather in Cleveland
• What’s my IP address
• Kim Kardashian
pregnant?
• NBA assists leader
• Where is Justin
Bieber?
• Oldest living person
• Gay marriage states
• When was the Ming
Dynasty
• Tom Cruise baby name
• How long to bake chicken
• What is Renaissance art
• Highest alcohol beer
• Head gasket repair cost
• Abraham Lincoln’s wife
• Parachute material
• Most reliable dishwashers
• How to remove hair dye
What type of information does the query want?
7. How do we pick out these answerable
evergreen fact queries?
• Let the editors do it themselves?
o Valuable editorial time wasted considering obvious stuff
o For crowd editorial labor, conflict of interest “OK” bias
• Crowd labor vetting?
o Hard to communicate task
o Still very costly
• Template-based filters?
o Coverage is too low
o Lots of work to develop these
• Machine learning?
o Very fuzzy problem
o Target set is a small segment of huge search space
o Hard to achieve high accuracy
8. Our Hybrid Approach
1. Filter out the obvious stuff (e.g., “Facebook.com”, “What time is
it”, “What does „looking a gift horse in the mouth‟ mean?”)
2. Dedicated classifiers to filter out specific types of non-suitable
queries
• Duplicates & near-duplicates
• Navigational
• Adult / profane / creepy
• Temporal / dynamic / timely
• Shopping / product search
• Wiki / entity exact match
3. Build machine learning “answerability” model for the tricky
remaining cases
4. Where the model returns low confidence, send those queries to
crowd labor for classification
13. Advantages to This Approach
Evergreen
Facts
Don‟t
Send to
Editorial
Requires Human
Review
14. Advantages to This Approach
• Filtering and partial automation first makes human review
much less costly
o Tasks requiring human scoring reduced by 97%
• Domain of ML model is narrower than entire query mix, which
improves accuracy
• Making the model better over time
o Human rating data becomes training data for algorithm
o Gradually, algorithm gets better, you need fewer human
ratings
15. Human Rater Biases
• Two very different tasks:
o Look for attribute X, which occurs in 1% of data.
o Look for attribute X, which occurs in 50% of data.
• The harder you have to look for instances of X, the more things
start to look like X.
o Your sensitivity increases. You get trigger-happy.
16. Human Rater Biases
Thought experiment:
“Listen for any naughty words or phrases”
Corpus 1: Nationally televised sports color commentary
Corpus 2: Gangster rap music
• Some words sound bad in the nationally televised sports
context, but wouldn‟t in the gangster rap context.
• Cognitive psychologists call this the Contrast Effect.
17. Human Rater Biases
• We gave two sets of crowdsource workers (same agency,
same pay rate) the same data, mixed in with two different
surrounding data sets
o Group A: Raw query file
o Group B: Filtered with heuristics and templates first
• Of the queries that group A thought were answerable, Group
B only though 64% of those were answerable
• Queries where the two groups disagreed where
overwhelmingly false positives by Group A, rather than false
negatives from Group B:
• how you spell a word
• how much does a book of stamps cost
• is randy fenoli married
• when does the alabama football game start
• where to donate old magazines
18. What Crowdsource Writers Will and Won‟t
Do for You
• Don‟t rely on crowdsource workers to self-select which
tasks are viable
o “Only answer the answerable queries” (and we only pay
you for what you answer)
o Writers biased towards everything being answerable
• Exception: If the task is too big, they are happy to flag those
o How to repair a transmission
o History of China
o US senators all time
o How does organic chemistry work
21. What to Include in Training Data
• Some question patterns are almost universally answerable
questions
o Who invented [NP]?
o Where was [person] born?
o How to [cooking verb] a [food item]
o What does […] mean?
• We grab these queries using template filters, and don‟t need ML
• Should we included these in our training data?
• This is an empirical question. Does the algorithm perform better
or worse if the “easy” data is included in the training data?
• In this specific case, the model is more accurate when trained
without “easy” data
22. Conclusions
• If you have a firehose of data, don‟t just:
o Send it to crowdsourcers
o Try to build a ML model
• Instead, figure out what the “easy” cases are, and deal with
those separately, using common sense rules
• Put your crowdsourcing and machine learning efforts on just
the hard part of the problem