Research on Tagging Photos with Text vs. Speech Input

Research & Development

Text vs. Speech
A Comparison of Tagging Input Modalities
for Camera Phones

Mauro Cherubini, Xavier Anguera,
Nuria Oliver, and Rodrigo de Oliveira

people do not want to tag
their pictures
intro → hypotheses → methodology → results → implications

research question:

Assuming that users are willing to
input at least one tag, which input
modality can help the production and
retrieval of the pictures?


hypothesis 1

Speech is preferred to text as an
annotation mechanism on mobile
phones (objective measure)

Support:
- Mitchard and Winkles (2002)


hypothesis 1-bis

Speech annotations are preferred by
users even if this means spending more
time on the task (subjective measure)

Support:
- Perakakis and Potamianos (2008)


hypothesis 2

The longer the tag the larger the
advantage of voice over text for
annotating pictures on mobile phones

Support:
- Hauptmann and Rudnicky (1990)


hypothesis 3

Retrieving pictures on mobile phones
with speech is not faster than with text
(objective measure)

Support:
- Mills et al. (2000)


the user study
ﬁeld study
controlled
(4 weeks)
experiment

T1 - T2 - T3 - T4

3 experimental conditions:
a. Speech only
b. Text only
c. Speech and Text


MAMI


features of MAMI

•  processing is done entirely on the mobile
phone
•  speech is not transcribed
•  to compare the waveforms of the audio tags,
MAMI uses algorithm of Dynamic Time
Warping


task 1: remember the tag
stimulus
retrieval

Pictures taken during the ﬁeld trial


task 2: remember the context
stimulus
retrieval

TASK 2
PICTURE 1

three little bushes
Garden
Tree
Stairs


task 3: remember the picture
stimulus
retrieval

Text
Audio tags were converted into
textual tags and vice versa


task 4: remember the
sequence
assignment
retrieval

TASK 4

Three pictures among
the oldest and three
pictures among the
newest.


metrics

•  time to completion
•  false positives
•  retrieval errors


results H1


results H1-bis
All participants in the BOTH group felt that tagging
with text was more effective than tagging with voice.

Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD])
1 = completely agree; 5 = completely disagree


results H2


results H3


take away 1:
speech is not a given

the advantage of audio as an input modality for tagging
pictures on mobile phones is not a given

why?
1. retrieval precision
2. privacy


take away 2:
input mistakes
we address text input mistakes immediately.
on the contrary mistakes in audio recordings are less
frequently addressed


take away 3:
memory

speech does not help memorizing the tags


implication 2:
enable audio inspection


Research Development

end
thanks

martigan@gmail.com
mauro@tid.es

http://www.i-cherubini.it/mauro/blog/
http://research.tid.es/multimedia/

Research on Tagging Photos with Text vs. Speech Input

Recommended

Recommended

More Related Content

Similar to Research on Tagging Photos with Text vs. Speech Input

Similar to Research on Tagging Photos with Text vs. Speech Input (8)

Recently uploaded

Recently uploaded (20)

Research on Tagging Photos with Text vs. Speech Input