1. W4A 2012 Lyon, April 17 2012
Enhancing Learning
Accessibility through Fully
Automatic Captioning
Maria Federico Marco Furini
Servizio Accoglienza Studenti Disabili Dipartimento di Comunicazione ed Economia
Università di Modena e Reggio Emilia Università di Modena e Reggio Emilia
2. The traditional learning scenario
Traditional solutions:
- Sign interpreters
- Stenographers
- Student note takers
- Respeaking
Video
Audio
classroom remote Disabled students:
• Hearing-impaired
Able-bodied students • Dyslexic
• Motion impaired
3. An accessible learning scenario
Automatic speech transcription
Video
Audio
Video
Audio
OUR SYSTEM
Textual transcript
Disabled students:
classroom remote • Hearing-impaired
• Dyslexic
Able-bodied students • Motion impaired
4. System Architecture
Architecture for the automatic production of video lesson captions
based on
Automatic speech recognition (ASR) technologies
A novel caption alignment mechanism that:
Introduces unique audio markups into the audio stream before
transcription by an ASR
Transforms the plain transcript produced by the ASR into a
timecoded transcript
5. Markup Insertion
Identification of silence periods (i.e., when the speaker
does not speak)
Insertion of a unique markup periodically in silence periods
It is important to find resonable values for silence length and
minimum distance between two consecutive markups in
order to have no truncated words in transcript and enough
timing information
6. Speech2text
Transcription of the audio stream coupled with unique
markup into plain text (including the textual form of the
markup)
Any existing automatic speech recognition technology can
be used
In the system prototype we used Dragon NaturallySpeaking
Support for Italian language
Availability of speech-to-text transcription from digital audio file
Easy access to product
High accuracy (99% for dictation)
7. Caption Alignment
Speech2text produced Transcript with timestamps
plain transcript
Timing information about where
markups have been inserted
by the Markup Insertion Module
8. Caption Alignment
Existing solutions:
1. Alignment of manual transcript with video
2. ASR runs twice
High computational
environment
Our solution:
Automatic: based on audio analysis
Efficient: ASR runs just one time
Technology transparent: any ASR can be used
9. Experimental study
Different Computer Science and Linguistics Professors
of the Communication Sciences degree of the University
of Modena and Reggio Emilia teaching in front of a live
audience
To tune the parameters used to locate the positions
where to insert audio markups
To find the most appropriate hardware (microphone) and
software (ASR) products to build the recording scenario
To investigate the transcription accuracy
10. Transcription accuracy
Minimum Markup Distance (sec)
The higher the values of silence length and minimum markup distance are,
the better the accuracy is, but these parameters affect the length of the produced
captions
11. Caption length
Desktop threshold = 375 char, ARIAL font family, 16 pt
The higher the values of silence length and minimum markup distance are,
the longer the captions are
13. Conclusions
Automatic
Video
Efficient
Audio
Technology transparent
Video
Audio
OUR SYSTEM
Textual transcript
Disabled students:
classroom remote • Hearing-impaired
• Dyslexic
Able-bodied students • Motion impaired
14. Contacts
Supported by
Servizio Accoglienza Studenti Disabili
University of Modena and Reggio Emilia
Further information:
Maria Federico, Ph.D.
maria.federico@unimore.it
Editor's Notes
Idee dopo presentazione: TODO: slide su confronto con sottotitolatura programmi tv fatta con respeaking: si potrebbe insistere dicendo nella prima slide che sign interpreters e stenographer sono persone terze che vengono pagate, oppure anche i respeakers….E questo è il tipo di approccio che si seguen anche per i sottotitoli televisivi per esempio. Poi nella seconda slide (quella sul SENTO scenario) dire che noi facciamo automaticamente e quindi senza persone terze coinvolte, per questo è una soluzione economica ed efficiente.
For years, universities have faced a number of challenges in making classroom lectures accessible to students who are deaf or hard of hearing. Traditional methods—sign interpreters, stenographers or student note takers—are often costly, difficult to procure or inconsistent. http://italy.nuance.com/naturallyspeaking/products/dns_livesub.html
Il progetto ha come obiettivo la progettazione e lo sviluppo di una piattaforma per la sottotitolazione automatica di materiale didattico audiovisivo e per la gestione e messa in fruizione di materiale multimediale da parte di studenti disabili.
We analyze audio, for instance we identify silences
Il progetto ha come obiettivo la progettazione e lo sviluppo di una piattaforma per la sottotitolazione automatica di materiale didattico audiovisivo e per la gestione e messa in fruizione di materiale multimediale da parte di studenti disabili.