The document discusses using data augmentation techniques to improve phoneme transcription of speech from aphasic patients. The researchers augmented data from an aphasic speech dataset (PSST) using techniques like pitch shifting, time stretching, and adding noise. They also augmented data from a manually transcribed non-aphasic speech dataset (TIMIT) using room impulse responses. Training models on both augmented aphasic and non-aphasic data led to improved performance over the baseline, with the best model achieving a 9.8% relative improvement in phone error rate. Pitch shifting and room impulse responses were among the most effective augmentation techniques.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
PSST challenge LREC
1. Speech data augmentation for improving
phoneme transcriptions of aphasic speech using
wav2vec 2.0 for the PSST Challenge
Birger Moëll, Jim O’Regan, Shivam Mehta, Ambika Kirkland, Harm Lameris,
Joakim Gustafson, Jonas Beskow
2. Experiments
2022-05-30 2
• Data augmentation:
– Augmenting the original
– Matching out-of-domain speech (TIMIT)
• (Phonetic) language models
• Voice conversion
• Synthetic voices
3. Data augmentations
2022-05-30 3
• Approach
• We propose data augmentations and training on non-
aphasiatic datasets to increase robustness and
accuracy of phoneme transcription models for aphasia.
• Steps:
• Data augmentations using pitch shift, gaussian
noise, time stretch, voice conversion and room
impulse response
• Joint training on non aphasiatic dataset (Common
Voice, Timit) and aphasiatic dataset (PSST)
4. Datasets
2022-05-30 4
• Augmenting with datasets with
manual transcriptions (TIMIT) was
successful
• Augmenting with automatically
phonetized dataset (common voice)
was unsuccessful.
• Acoustically aligning the TIMIT data to
PSST using Room Impulse
Response (RIR) improved
performance
Key findings
● PSST Dataset
○ Boston Naming Test - Short
Form (BNT-SF) and the Verb
Naming Test (VNT)
● TIMIT Dataset
○ Acoustic-Phonetic Speech
Corpus
● Common voice
○ Volunteer-based open-
source dataset for ML
5. Models
2022-05-30 5
We compared base and large
wav2vec2 models
• The bigger model performed better
• Training with the base model was still
useful for faster experimentations
6. Voice conversion
2022-05-30 6
VOICE CONVERSION WAS USED TO
AUGMENT DATA BY NEURAL VOICE
CLONING
EXPERIMENTS WITH VOICE CLONING
PROVED UNSUCCESSFUL, LIKELY
BECAUSE OF PSST DATA QUALITY
ISSUES
7. Pitch shift
2022-05-30 7
PITCH SHIFT WAS THE MOST
SUCCESSFUL PSST AUGMENTATION
IMPROVING RESULTS COMPARED TO
BASELINE
THE PITCH WAS BOTH LOWERED AND
RAISED RANDOMLY
12. Augmentations on augmentations
2022-05-30 12
We experimented adding
Pitch Shift + Time Stretch +
Gaussian Noise + Room
Impulse Response to
TIMIT.
Because that’s how we roll.
15. Result
2022-05-30 15
• Data augmentations improve model performance.
• Increasing the size of the model decreases FER and PER.
• Manually-transcribed speech from non-aphasic speakers (TIMIT) improves
performance
– when Room Impulse Response is used to augment the data.
• The best performing model combines aphasic and non-aphasic data
– 21.0% PER
– 9.2% FER
– relative improvement of 9.8%
• Data augmentation, larger model size, and additional non-aphasic data
sources can be helpful