Text versus Speech: A Comparison of Tagging Input Modalities for Camera Phones

1,154 views

Published on

Speech and typed text are two common input modalities for mobile phones. However, little research has compared them in their ability to support annotation and retrieval of digital pictures on mobile devices. In this paper, we report the results of a month-long field study in which participants took pictures with their camera phones and had the choice of adding annotations using speech, typed text,
or both. Subsequently, the same subjects participated in a controlled experiment where they were asked to retrieve images based on annotations as well as retrieve annotations based on images in order to study the ability of each modality to effectively support users' recall of the previously captured pictures. Results
demonstrate that each modality has advantages and shortcomings for the production of tags and retrieval of pictures. Several guidelines are suggested when designing tagging applications for portable devices.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,154
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Text versus Speech: A Comparison of Tagging Input Modalities for Camera Phones

  1. 1. Research & Development Text vs. Speech A Comparison of Tagging Input Modalities for Camera Phones Mauro Cherubini, Xavier Anguera, Nuria Oliver, and Rodrigo de Oliveira
  2. 2. people do not want to tag their pictures intro → hypotheses → methodology → results → implications
  3. 3. research question: Assuming that users are willing to input at least one tag, which input modality can help the production and retrieval of the pictures? intro → hypotheses → methodology → results → implications
  4. 4. hypothesis 1 Speech is preferred to text as an annotation mechanism on mobile phones (objective measure) Support: - Mitchard and Winkles (2002) intro → hypotheses → methodology → results → implications
  5. 5. hypothesis 1-bis Speech annotations are preferred by users even if this means spending more time on the task (subjective measure) Support: - Perakakis and Potamianos (2008) intro → hypotheses → methodology → results → implications
  6. 6. hypothesis 2 The longer the tag the larger the advantage of voice over text for annotating pictures on mobile phones Support: - Hauptmann and Rudnicky (1990) intro → hypotheses → methodology → results → implications
  7. 7. hypothesis 3 Retrieving pictures on mobile phones with speech is not faster than with text (objective measure) Support: - Mills et al. (2000) intro → hypotheses → methodology → results → implications
  8. 8. the user study field study controlled (4 weeks) experiment T1 - T2 - T3 - T4 3 experimental conditions: a. Speech only b. Text only c. Speech and Text intro → hypotheses → methodology → results → implications
  9. 9. MAMI intro → hypotheses → methodology → results → implications
  10. 10. features of MAMI •  processing is done entirely on the mobile phone •  speech is not transcribed •  to compare the waveforms of the audio tags, MAMI uses algorithm of Dynamic Time Warping intro → hypotheses → methodology → results → implications
  11. 11. task 1: remember the tag stimulus retrieval Pictures taken during the field trial intro → hypotheses → methodology → results → implications
  12. 12. task 2: remember the context stimulus retrieval TASK 2 PICTURE 1 three little bushes Garden Tree Stairs intro → hypotheses → methodology → results → implications
  13. 13. task 3: remember the picture stimulus retrieval Text Audio tags were converted into textual tags and vice versa intro → hypotheses → methodology → results → implications
  14. 14. task 4: remember the sequence assignment retrieval TASK 4 Three pictures among the oldest and three pictures among the newest. intro → hypotheses → methodology → results → implications
  15. 15. metrics •  time to completion •  false positives •  retrieval errors intro → hypotheses → methodology → results → implications
  16. 16. results H1 intro → hypotheses → methodology → results → implications
  17. 17. results H1-bis All participants in the BOTH group felt that tagging with text was more effective than tagging with voice. Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD]) 1 = completely agree; 5 = completely disagree intro → hypotheses → methodology → results → implications
  18. 18. results H2 intro → hypotheses → methodology → results → implications
  19. 19. results H3 intro → hypotheses → methodology → results → implications
  20. 20. results H3 - continued
  21. 21. take away 1: speech is not a given the advantage of audio as an input modality for tagging pictures on mobile phones is not a given why? 1. retrieval precision 2. privacy intro → hypotheses → methodology → results → implications
  22. 22. take away 2: input mistakes we address text input mistakes immediately. on the contrary mistakes in audio recordings are less frequently addressed intro → hypotheses → methodology → results → implications
  23. 23. take away 3: memory speech does not help memorizing the tags intro → hypotheses → methodology → results → implications
  24. 24. implication 1: allow multiple modalities © Pixar, 2008 intro → hypotheses → methodology → results → implications
  25. 25. implication 2: enable audio inspection intro → hypotheses → methodology → results → implications
  26. 26. implication 3: enable modality synesthesia © Disney, 1940 intro → hypotheses → methodology → results → implications
  27. 27. Research Development end thanks martigan@gmail.com mauro@tid.es http://www.i-cherubini.it/mauro/blog/ http://research.tid.es/multimedia/

×