Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Predicting User Satisfaction with
Intelligent Assistants
Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah,
Aidan C. C...
From Queries to Dialogues
Q1: how is the weather in Chicago
Q2: how is it this weekend
Q3: find me hotels
Q4: which one of...
From Queries to Dialogues
Q1: find me a pharmacy nearby
Q2: which of these is highly rated
Q3: show more information about...
Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
Cortana:
...
Main Research Question
How can we automatically predict user
satisfaction with search dialogues on
intelligent assistants ...
User:
“Do I need
to have a
jacket
tomorrow?”
Cortana: “You
could probably
go without one.
The forecast
shows …”
Single Tas...
Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
Cortana:
...
How to define user satisfaction with
with search dialogues?
Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
Cortana:
...
Cortana:
“Here are ten
restaurants
near you”
Cortana:
“Here are ten
restaurants near
you that have
good reviews”
Cortana:
...
User Frustration
Q1: what's the weather like in San Francisco
Q2: what's the weather like in Mountain View
Q3: can you fin...
What interaction signals can track
during search dialogues?
Tracking User Interaction:
Click Signals
• Number of queries in a dialogue
• Number of clicks in a dialogue
• Number of SA...
Tracking User Interaction:
Acoustic Signals
Phonetic Similarity
between consecutive requests
Tracking User Interaction
3 seconds 6 seconds
33% of
ViewPort
66% of
ViewPort
ViewPortHeight
2 seconds
20% of
ViewPort
1s 4s 0.4s 5.4s+ + =
Tracking...
• Number of Swipes
• Number of up-swipes
• Number of down-swipes
• Total distance swiped (pixels)
• Number of swipes norma...
How to collect data?
User Study Participants
75%
25%
GENDER
Male Female
55%
45%
LANGUAGE
English Other
82%
8%
2% 8%
EDUCATION Computer
Science
...
You are planning a
vacation. Pick a place.
Check if the weather is
good enough for the
period you are planning
the vacatio...
You are planning a
vacation. Pick a place.
Check if the weather is
good enough for the
period you are planning
the vacatio...
Questionnaire
• Were you able to complete the task?
o Yes/No
• How satisfied are you with your experience in this task?
o ...
Questionnaire
• Were you able to complete the task?
o Yes/No
• How satisfied are you with your experience in this task?
o ...
Search Dialog Dataset
• Total amount of queries is 2, 040
• Amount of unique queries is 1, 969
• The average query-length ...
Search Dialog Dataset
• Total amount of queries is 2, 040
• Amount of unique queries is 1, 969
• The average query-length ...
How can we predict user
satisfaction
with search dialogues using
interaction signals?
Q1: what do you have medicine for the
stomach ache
Q2: stomach ache medicine over the counter
General
Web
SERP
User’s dial...
Q1: what do you have medicine for the
stomach ache
Q2: stomach ache medicine over the counter
Q3: show me the nearest phar...
General Web and Structured SERP
General Web and Structured SERP
Aggregating Touch Interactions
I( )
1.
Aggregating Touch Interactions
I( ) I( , )
1. 2.
Aggregating Touch Interactions
I( ) I( ),I( )I( , )
1. 2. 3.
Quality of Interaction Model
Method Accuracy (%) Average F1 (%)
Baseline 70.62 61.38
Interaction Model 1 78.78*
(+11.55)
8...
Which interaction signals have
the highest impact on predicting
user satisfaction with search
dialogues?
Predicting User Satisfaction
• F1: The SERP for a query is ordered by a measure of relevance as
determined by the system, ...
Predicting User Satisfaction
• F1: The SERP for a query is ordered by a measure of relevance as
determined by the system, ...
Predicting User Satisfaction
• F1: The SERP for a query is ordered by a measure of relevance as
determined by the system, ...
How can we define user satisfaction with search dialogues?
• User satisfaction with search dialogues is defined in the gen...
• User satisfaction with search dialogues is defined in the generalized form,
which showed understanding the nature of use...
Upcoming SlideShare
Loading in …5
×

Predicting User Satisfaction with Intelligent Assistants

695 views

Published on

There is a rapid growth in the use of voice-controlled intelligent
personal assistants on mobile devices, such as Microsoft’s Cortana,
Google Now, and Apple’s Siri.
They significantly change the way users interact with search systems,
not only because of the voice control use and touch gestures,
but also due to the dialogue-style nature of the interactions and their
ability to preserve context across different queries. Predicting success
and failure of such search dialogues is a new problem, and
an important one for evaluating and further improving intelligent
assistants. While clicks in web search have been extensively used
to infer user satisfaction, their significance in search dialogues is
lower due to the partial replacement of clicks with voice control,
direct and voice answers, and touch gestures.
In this paper, we propose an automatic method to predict user
satisfaction with intelligent assistants that exploits all the interaction
signals, including voice commands and physical touch gestures
on the device.
First, we conduct an extensive user study to measure user satisfaction
with intelligent assistants, and simultaneously record all
user interactions. Second, we show that the dialogue style of interaction
makes it necessary to evaluate the user experience at the
overall task level as opposed to the query level. Third, we train a
model to predict user satisfaction, and find that interaction signals
that capture the user reading patterns have a high impact: when including
all available interaction signals, we are able to improve the
prediction accuracy of user satisfaction from 71% to 81% over a
baseline that utilizes only click and query features.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Predicting User Satisfaction with Intelligent Assistants

  1. 1. Predicting User Satisfaction with Intelligent Assistants Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, Tasos Anastasakos Eindhoven University of Technology Pennsylvania State University Microsoft SIGIR’16, Pisa, Italy
  2. 2. From Queries to Dialogues Q1: how is the weather in Chicago Q2: how is it this weekend Q3: find me hotels Q4: which one of these is the cheapest Q5: which one of these has at least 4 stars Q6: find me directions from the Chicago airport to number one User’s dialogue with Cortana: Task is “Finding a hotel in Chicago”
  3. 3. From Queries to Dialogues Q1: find me a pharmacy nearby Q2: which of these is highly rated Q3: show more information about number 2 Q4: how long will it take me to get there Q5: Thanks User’s dialogue with Cortana: Task is “Finding a pharmacy”
  4. 4. Cortana: “Here are ten restaurants near you” Cortana: “Here are ten restaurants near you that have good reviews” Cortana: “Getting you direction to the Mayuri Indian Cuisine” User: “show restauran ts near me” User: “show the best ones” User: “show directions to the second one” From Queries to Dialogues
  5. 5. Main Research Question How can we automatically predict user satisfaction with search dialogues on intelligent assistants using click, touch, and voice interactions?
  6. 6. User: “Do I need to have a jacket tomorrow?” Cortana: “You could probably go without one. The forecast shows …” Single Task Search Dialogue
  7. 7. Cortana: “Here are ten restaurants near you” Cortana: “Here are ten restaurants near you that have good reviews” Cortana: “Getting you direction to the Mayuri Indian Cuisine” User: “show restauran ts near me” User: “show the best ones” User: “show directions to the second one” Multi-Task Search Dialogues
  8. 8. How to define user satisfaction with with search dialogues?
  9. 9. Cortana: “Here are ten restaurants near you” Cortana: “Here are ten restaurants near you that have good reviews” Cortana: “Getting you direction to the Mayuri Indian Cuisine” User: “show restauran ts near me” User: “show the best ones” User: “show directions to the second one” No Clicks ???
  10. 10. Cortana: “Here are ten restaurants near you” Cortana: “Here are ten restaurants near you that have good reviews” Cortana: “Getting you direction to the Mayuri Indian Cuisine” User: “show restauran ts near me” User: “show the best ones” User: “show directions to the second one” SAT? SAT? SAT ? Overall SAT? ? SAT? SAT? SAT ?
  11. 11. User Frustration Q1: what's the weather like in San Francisco Q2: what's the weather like in Mountain View Q3: can you find me a hotel close to Mountain View Q4: can you show me the cheapest ones Q5: show me the third one Q6: show me the directions from SFO to this hotel Q6: show me the directions from SFO to this hotel Q7: go back to first hotel (misrecognition) Q8: show me hotels in Mountain View Q9: show me cheap hotels in Mountain View Q10: show me more about the third one   Dialog with Intelligent Assistant Task is “Planning a weekend ” RestartsearchAuserissatisfied 
  12. 12. What interaction signals can track during search dialogues?
  13. 13. Tracking User Interaction: Click Signals • Number of queries in a dialogue • Number of clicks in a dialogue • Number of SAT clicks (> 30 sec. dwell time) in a dialogue • Number of DSAT clicks (< 15 sec. dwell time) in a dialogue • Time (seconds) until the first click in a dialogue
  14. 14. Tracking User Interaction: Acoustic Signals Phonetic Similarity between consecutive requests
  15. 15. Tracking User Interaction
  16. 16. 3 seconds 6 seconds 33% of ViewPort 66% of ViewPort ViewPortHeight 2 seconds 20% of ViewPort 1s 4s 0.4s 5.4s+ + = Tracking User Interaction
  17. 17. • Number of Swipes • Number of up-swipes • Number of down-swipes • Total distance swiped (pixels) • Number of swipes normalized by time • Total distance divided by num. of swipes • Total swiped distance divided by time • Number of swipe direction changes • SERP answer duration (seconds) which is shown on screen (even partially) • Fraction of visible pixels belonging to SERP answer • Attributed time (seconds) to viewing a particular element (answer) on SERP • Attributed time (seconds) per unit height (pixels) associated with a particular element on SERP • Attributed time (milliseconds) per unit area (square pixels) associated with a particular element on SERP Tracking User Interaction: Touch Signals
  18. 18. How to collect data?
  19. 19. User Study Participants 75% 25% GENDER Male Female 55% 45% LANGUAGE English Other 82% 8% 2% 8% EDUCATION Computer Science Electrical Engineering Mathematics Other • 60 Participants • 25.53 +/- 5.42 years
  20. 20. You are planning a vacation. Pick a place. Check if the weather is good enough for the period you are planning the vacation. Find a hotel that suits you. Find the driving directions to this place.
  21. 21. You are planning a vacation. Pick a place. Check if the weather is good enough for the period you are planning the vacation. Find a hotel that suits you. Find the driving directions to this place.
  22. 22. Questionnaire • Were you able to complete the task? o Yes/No • How satisfied are you with your experience in this task? o If the task has sub-tasks participants indicate their graded satisfaction e.g. o a. How satisfied are you with your experience in finding a hotel? o b. How satisfied are you with your experience in finding directions? • How well did Cortana recognize what you said? o 5-point Likert scale • Did you put in a lot of effort to complete the task? o 5-point Likert scale
  23. 23. Questionnaire • Were you able to complete the task? o Yes/No • How satisfied are you with your experience in this task? o If the task has sub-tasks participants indicate their graded satisfaction e.g. o a. How satisfied are you with your experience in finding a hotel? o b. How satisfied are you with your experience in finding directions? • How well did Cortana recognize what you said? o 5-point Likert scale • Did you put in a lot of effort to complete the task? o 5-point Likert scale 8 Tasks: 1 simple, 4 with 2 subtasks, 3 with 3 subtasks ~ 30 Minutes
  24. 24. Search Dialog Dataset • Total amount of queries is 2, 040 • Amount of unique queries is 1, 969 • The average query-length is 7.07
  25. 25. Search Dialog Dataset • Total amount of queries is 2, 040 • Amount of unique queries is 1, 969 • The average query-length is 7.07 • The simple task generated 130 queries • Tasks with 2 context switches generated 685 queries • Tasks with 3 context switches generated 1, 355 queries
  26. 26. How can we predict user satisfaction with search dialogues using interaction signals?
  27. 27. Q1: what do you have medicine for the stomach ache Q2: stomach ache medicine over the counter General Web SERP User’s dialogue about the ‘stomach ache’
  28. 28. Q1: what do you have medicine for the stomach ache Q2: stomach ache medicine over the counter Q3: show me the nearest pharmacy Q4: more information on the second one General Web SERP Structured SERP User’s dialogue about the ‘stomach ache’
  29. 29. General Web and Structured SERP
  30. 30. General Web and Structured SERP
  31. 31. Aggregating Touch Interactions I( ) 1.
  32. 32. Aggregating Touch Interactions I( ) I( , ) 1. 2.
  33. 33. Aggregating Touch Interactions I( ) I( ),I( )I( , ) 1. 2. 3.
  34. 34. Quality of Interaction Model Method Accuracy (%) Average F1 (%) Baseline 70.62 61.38 Interaction Model 1 78.78* (+11.55) 83.59* (+35.90) Interaction Model 2 80.21* (+13.58) 83.31* (+35.44) Interaction Model 3 80.81* (14.43) 79.08* (28.83) * Statistically significant improvement (p < 0,05 )
  35. 35. Which interaction signals have the highest impact on predicting user satisfaction with search dialogues?
  36. 36. Predicting User Satisfaction • F1: The SERP for a query is ordered by a measure of relevance as determined by the system, then additional exploration is unlikely to achieve user satisfaction, but is more likely an indication that the best-provided results (i.e. the SERP top) are insufficient to address the user intent
  37. 37. Predicting User Satisfaction • F1: The SERP for a query is ordered by a measure of relevance as determined by the system, then additional exploration is unlikely to achieve user satisfaction, but is more likely an indication that the best-provided results (i.e. the SERP top) are insufficient to address the user intent • F2: In the converse case of F1, when users find content that satisfies their intent, their likelihood of scrolling is reduced, and they dwell for an extended period on the top viewport
  38. 38. Predicting User Satisfaction • F1: The SERP for a query is ordered by a measure of relevance as determined by the system, then additional exploration is unlikely to achieve user satisfaction, but is more likely an indication that the best-provided results (i.e. the SERP top) are insufficient to address the user intent • F2: In the converse case of F1, when users find content that satisfies their intent, their likelihood of scrolling is reduced, and they dwell for an extended period on the top viewport • F3: When users are involved in a complex task, they are dissatisfied when redirected to a general web SERP. Unlike F2, the absence of scrolling on this landing page is an indication of dissatisfaction
  39. 39. How can we define user satisfaction with search dialogues? • User satisfaction with search dialogues is defined in the generalized form, which showed understanding the nature of user satisfaction as an aggregation of satisfaction with all dialogue’s tasks and not as a satisfaction with all dialogue’s queries separately How can we predict user satisfaction with search dialogues using interaction signals? • We showed that features derived from voice and especially from touch and voice interactions add significant gain in accuracy over the baseline How can we predict user satisfaction with search dialogues using interaction signals? • Our analysis showed a strong negative correlation between user satisfaction and swipe actions Conclusion
  40. 40. • User satisfaction with search dialogues is defined in the generalized form, which showed understanding the nature of user satisfaction as an aggregation of satisfaction with all dialogue’s tasks and not as a satisfaction with all dialogue’s queries separately • We showed that features derived from voice and especially from touch and voice interactions add significant gain in accuracy over the baseline • Our analysis showed a strong negative correlation between user satisfaction and swipe actions Thank you! Questions?

×