SlideShare a Scribd company logo
1 of 27
Download to read offline
Research & Development




   Text vs. Speech
A Comparison of Tagging Input Modalities 
         for Camera Phones



      Mauro Cherubini, Xavier Anguera, 
     Nuria Oliver, and Rodrigo de Oliveira
people do not want to tag
             their pictures
intro → hypotheses → methodology → results → implications
research question:

 Assuming that users are willing to
 input at least one tag, which input
modality can help the production and
      retrieval of the pictures?


intro → hypotheses → methodology → results → implications
hypothesis 1

   Speech is preferred to text as an
   annotation mechanism on mobile
     phones (objective measure)

Support: 
- Mitchard and Winkles (2002)

intro → hypotheses → methodology → results → implications
hypothesis 1-bis

  Speech annotations are preferred by
users even if this means spending more
 time on the task (subjective measure)

 Support: 
 - Perakakis and Potamianos (2008)

intro → hypotheses → methodology → results → implications
hypothesis 2

  The longer the tag the larger the
  advantage of voice over text for
annotating pictures on mobile phones

Support: 
- Hauptmann and Rudnicky (1990)

intro → hypotheses → methodology → results → implications
hypothesis 3

 Retrieving pictures on mobile phones
with speech is not faster than with text
          (objective measure)

 Support: 
 - Mills et al. (2000)

 intro → hypotheses → methodology → results → implications
the user study
   field study
                          controlled
   (4 weeks)
                           experiment

                                     T1 - T2 - T3 - T4

  3 experimental conditions:
         a. Speech only
           b. Text only
      c. Speech and Text

intro → hypotheses → methodology → results → implications
MAMI




intro → hypotheses → methodology → results → implications
features of MAMI
                         

    •  processing is done entirely on the mobile
       phone
    •  speech is not transcribed
    •  to compare the waveforms of the audio tags,
       MAMI uses algorithm of Dynamic Time
       Warping


intro → hypotheses → methodology → results → implications
task 1: remember the tag
            stimulus
                    retrieval




Pictures taken during the field trial


intro → hypotheses → methodology → results → implications
task 2: remember the context
          stimulus
                      retrieval

      TASK 2
      PICTURE 1

      three little bushes
      Garden
      Tree
      Stairs




intro → hypotheses → methodology → results → implications
task 3: remember the picture
          stimulus
                      retrieval




                      Text
  Audio tags were converted into
    textual tags and vice versa

intro → hypotheses → methodology → results → implications
task 4: remember the
                         sequence
        assignment
                      retrieval

     TASK 4

     Three pictures among
     the oldest and three 
     pictures among the 
     newest.




intro → hypotheses → methodology → results → implications
metrics

     •  time to completion
     •  false positives
     •  retrieval errors


intro → hypotheses → methodology → results → implications
results H1




intro → hypotheses → methodology → results → implications
results H1-bis
 All participants in the BOTH group felt that tagging
 with text was more effective than tagging with voice.

   Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD])
    1 = completely agree; 5 = completely disagree




intro → hypotheses → methodology → results → implications
results H2




intro → hypotheses → methodology → results → implications
results H3




intro → hypotheses → methodology → results → implications
results H3 - continued
take away 1: 
       speech is not a given

the advantage of audio as an input modality for tagging
       pictures on mobile phones is not a given


                           why?
                  1. retrieval precision
                        2. privacy

intro → hypotheses → methodology → results → implications
take away 2: 
              input mistakes
     we address text input mistakes immediately. 
 on the contrary mistakes in audio recordings are less
                frequently addressed




intro → hypotheses → methodology → results → implications
take away 3: 
                  memory

      speech does not help memorizing the tags




intro → hypotheses → methodology → results → implications
implication 1:
   allow multiple modalities




                       © Pixar, 2008


intro → hypotheses → methodology → results → implications
implication 2:
    enable audio inspection




intro → hypotheses → methodology → results → implications
implication 3: 
enable modality synesthesia




                       © Disney, 1940
intro → hypotheses → methodology → results → implications
Research  Development




              end
              thanks

        martigan@gmail.com
          mauro@tid.es


http://www.i-cherubini.it/mauro/blog/
  http://research.tid.es/multimedia/

More Related Content

Similar to Research on Tagging Photos with Text vs. Speech Input

CarterCritique1
CarterCritique1CarterCritique1
CarterCritique1amyecarter
 
CarterCritique1
CarterCritique1CarterCritique1
CarterCritique1amyecarter
 
Pennymotsett ppquiz
Pennymotsett ppquizPennymotsett ppquiz
Pennymotsett ppquizPennyCM
 
Cognitive principles of instruction (edet 722) ctml
Cognitive principles of instruction (edet 722) ctmlCognitive principles of instruction (edet 722) ctml
Cognitive principles of instruction (edet 722) ctmlacademic3
 
GloCALL 2013 conference presentation
GloCALL 2013 conference presentationGloCALL 2013 conference presentation
GloCALL 2013 conference presentationTakeshi Sato
 

Similar to Research on Tagging Photos with Text vs. Speech Input (8)

CarterCritique1
CarterCritique1CarterCritique1
CarterCritique1
 
CarterCritique1
CarterCritique1CarterCritique1
CarterCritique1
 
Clark ch 5 and 6
Clark ch 5 and 6Clark ch 5 and 6
Clark ch 5 and 6
 
Pennymotsett ppquiz
Pennymotsett ppquizPennymotsett ppquiz
Pennymotsett ppquiz
 
Cognitive principles of instruction (edet 722) ctml
Cognitive principles of instruction (edet 722) ctmlCognitive principles of instruction (edet 722) ctml
Cognitive principles of instruction (edet 722) ctml
 
GloCALL 2013 conference presentation
GloCALL 2013 conference presentationGloCALL 2013 conference presentation
GloCALL 2013 conference presentation
 
Science.1207745.full
Science.1207745.fullScience.1207745.full
Science.1207745.full
 
Blenderbot
BlenderbotBlenderbot
Blenderbot
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Research on Tagging Photos with Text vs. Speech Input

  • 1. Research & Development Text vs. Speech A Comparison of Tagging Input Modalities for Camera Phones Mauro Cherubini, Xavier Anguera, Nuria Oliver, and Rodrigo de Oliveira
  • 2. people do not want to tag their pictures intro → hypotheses → methodology → results → implications
  • 3. research question: Assuming that users are willing to input at least one tag, which input modality can help the production and retrieval of the pictures? intro → hypotheses → methodology → results → implications
  • 4. hypothesis 1 Speech is preferred to text as an annotation mechanism on mobile phones (objective measure) Support: - Mitchard and Winkles (2002) intro → hypotheses → methodology → results → implications
  • 5. hypothesis 1-bis Speech annotations are preferred by users even if this means spending more time on the task (subjective measure) Support: - Perakakis and Potamianos (2008) intro → hypotheses → methodology → results → implications
  • 6. hypothesis 2 The longer the tag the larger the advantage of voice over text for annotating pictures on mobile phones Support: - Hauptmann and Rudnicky (1990) intro → hypotheses → methodology → results → implications
  • 7. hypothesis 3 Retrieving pictures on mobile phones with speech is not faster than with text (objective measure) Support: - Mills et al. (2000) intro → hypotheses → methodology → results → implications
  • 8. the user study field study controlled (4 weeks) experiment T1 - T2 - T3 - T4 3 experimental conditions: a. Speech only b. Text only c. Speech and Text intro → hypotheses → methodology → results → implications
  • 9. MAMI intro → hypotheses → methodology → results → implications
  • 10. features of MAMI •  processing is done entirely on the mobile phone •  speech is not transcribed •  to compare the waveforms of the audio tags, MAMI uses algorithm of Dynamic Time Warping intro → hypotheses → methodology → results → implications
  • 11. task 1: remember the tag stimulus retrieval Pictures taken during the field trial intro → hypotheses → methodology → results → implications
  • 12. task 2: remember the context stimulus retrieval TASK 2 PICTURE 1 three little bushes Garden Tree Stairs intro → hypotheses → methodology → results → implications
  • 13. task 3: remember the picture stimulus retrieval Text Audio tags were converted into textual tags and vice versa intro → hypotheses → methodology → results → implications
  • 14. task 4: remember the sequence assignment retrieval TASK 4 Three pictures among the oldest and three pictures among the newest. intro → hypotheses → methodology → results → implications
  • 15. metrics •  time to completion •  false positives •  retrieval errors intro → hypotheses → methodology → results → implications
  • 16. results H1 intro → hypotheses → methodology → results → implications
  • 17. results H1-bis All participants in the BOTH group felt that tagging with text was more effective than tagging with voice. Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD]) 1 = completely agree; 5 = completely disagree intro → hypotheses → methodology → results → implications
  • 18. results H2 intro → hypotheses → methodology → results → implications
  • 19. results H3 intro → hypotheses → methodology → results → implications
  • 20. results H3 - continued
  • 21. take away 1: speech is not a given the advantage of audio as an input modality for tagging pictures on mobile phones is not a given why? 1. retrieval precision 2. privacy intro → hypotheses → methodology → results → implications
  • 22. take away 2: input mistakes we address text input mistakes immediately. on the contrary mistakes in audio recordings are less frequently addressed intro → hypotheses → methodology → results → implications
  • 23. take away 3: memory speech does not help memorizing the tags intro → hypotheses → methodology → results → implications
  • 24. implication 1: allow multiple modalities © Pixar, 2008 intro → hypotheses → methodology → results → implications
  • 25. implication 2: enable audio inspection intro → hypotheses → methodology → results → implications
  • 26. implication 3: enable modality synesthesia © Disney, 1940 intro → hypotheses → methodology → results → implications
  • 27. Research Development end thanks martigan@gmail.com mauro@tid.es http://www.i-cherubini.it/mauro/blog/ http://research.tid.es/multimedia/