Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Build your own speech to text dataset in 30 days
1. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
REPORTS FROM THE BATTLEFRONT, OR: BUILD YOUR
OWN SPEECH TO TEXT DATASET FROM SCRATCH IN
30 DAYS
Dmytro Naumov (dima@m-ailabs.bayern)
March 2018
Munich Artificial Intelligence Laboratories GmbH (M-AILABS)
3. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
M-AILABS IN A NUTSHELL
• Founded April 1st*
• Designs & develops ML/AI-based software engines
• 100% owner-financed and profitable
• Currently 4 FTE (3x Machine Learning/Neuronal Networks; 1 Administrative)
3
*: Yes, it was in fact founded on April 1st, but the actual registration was on April 3rd because … April 1st 2017 is a Saturday and the registration office doesn’t work on weekends…
4. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
ARTIFICIAL INTELLIGENCE AT THE HEART OF EUROPE…
Our mission is to enable
European organizations to take
advantage of AI & ML without
having to give up control or
know-how
4
5. VISION: APPLIED AI/ML
We use state-of-the art technologies to develop solutions for, e.g.:
5
MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
Speech Recognition Speech Synthesis Document Classification Object Recognition
Text Classification Specialized OCR Signal Processing Specialized Translation
6. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
TECHNOLOGIES WE USE (ALL OPEN SOURCE)
• Classical ML-Algorithms: scikit_learn, numpy, scipy, …
• Classical Neural Network Algorithms: scikit_learn, …
• Deep Learning-based NN algorithms: s. below
• NLP in all its incarnations: NLTK, spaCy, GenSim, …
6
Deep Learning
• Torch & PyTorch
• Caffe2
• TensorFlow (less and less)
• Keras 2 (on top of TensorFlow)
10. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
MACHINE LEARNING PROCESS
10
E.g. Raw images
(>1,000,000)
Categorized Data
Real & Simulated
Tagged Data
(E.g.: 100,000 Images)
Trained Model
(Days, weeks,…)
…
Data Labeler Data Scientist Algorithm / Optimization Developer
Optimize Operate
Validate
Accuracy ➠ ➠Collect Data
Establish
Ground Truth
Select
DataSet
Design &
Train Models➠ ➠ ➠ ➠
11. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
DATA COLLECTION & LABELING (ESTABLISHING GROUND TRUTH)
• Thousands of sample data
• Each need to be labeled
• Each label represents a category
• Result: Training material to teach the
Machine our own Knowledge
11
12. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
TASK: DEVELOP A SPEECH-RECOGNITION & SPEECH
SYNTHESIS SOLUTION FOR GERMAN (AND SPANISH, AND …)
12
13. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
“A KINGDOM FOR DATA…”
• Read papers, papers, papers, and more
papers…
• Write an SR-system based on DeepSpeech
(Baidu)
• Train on LJSpeech-data (english)
• … works great! Now to German…
13
• Problems, problems, problems
• No public German voice data available
• Universities / Colleges have data, but…
• … even for testing they want € € €
• … you are not allowed to use it for
commercial purposes
• … and there are other issues as well …
“I have an idea: Let’s create
our own dataset…”
14. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
RAW DATA
• How much data do we need?
• Hundreds of hours per language
• How do we generate them?
• Hire people to read books sentence by
sentence
• … or… checkout LibriVox and split them
sentence by sentence
14
15. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
WOW, MORE DIFFICULT THAN THOUGHT… WHERE IS THE
SENTENCE?
Inhalt
Das grüne Haus Die goldne Spinne
Mariechen und die Sonne
Als es nicht regnen wollte
…
Das grüne Haus
Ja, es ist ein grünes Haus, in dem ich wohne; und alle
Märchen und Geschichten, die in diesem Buche stehn, sind darin
geschrieben worden. Es ist nicht etwa grün angestrichen wie ein
Gartenzaun oder eine …
15
?
?
?
16. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
APPROACH 1: USE A HUMAN TO ESTABLISH GROUND-TRUTH FOR
EACH AUDIO SEQUENCE MANUALLY
• Take the original audio
• Using ‘sox’, split audio files separated by
pauses
• Generate a list of audio-files
• Have a person listen to each audio-file and
assign the original text to it (CSV-file)
16
• Results:
• Quality: excellent
• Speed: … well… it is slow, repetitive and boring
work
• It took very long (at least the same time as the
audio-length) - but normally 2x - 4x the audio
length
• Not really scalable…
17. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
APPROACH 2: USE OWN TOOLS + GOOGLE/APPLE SPEECH API
• Use in-house developed script to split audio files separated by
pauses (analyze the whole audio and decide what is pause and
what is not)
• Generate a list of audio-files
• Use Google and Apple Speech APIs to transcribe these files
• Use in-house developed ‘trqa’ to perform ‘transcription-QA’ in
order to find the “original text”
• Results: extremely QA-heavy, but still doable (‘trqa’ partially
automated recognition, thus requiring only around 20-30%
real QA)
17
18. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
WE ARE ACTUALLY INTERESTED ABOUT THE END OF THE
SENTENCE, NOT BEGINNING
Inhalt
Das grüne Haus Die goldne Spinne
Mariechen und die Sonne
Als es nicht regnen wollte
…
Das grüne Haus
Ja, es ist ein grünes Haus, in dem ich wohne; und alle
Märchen und Geschichten, die in diesem Buche stehn, sind darin
geschrieben worden. Es ist nicht etwa grün angestrichen wie ein
Gartenzaun oder eine …
18
?
?
?
19. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
APPROACH 3: SPLIT FILES BASED ON TEXT-SENTENCES AND
SEARCH FOR THE ENDS OF THOSE TEXTS…
• Split original text into manageable “sentences”
• Make sure that each “sentence” does not result in audio longer than
15 seconds
• Generate audio out of text (TTS)
• Compare generated audio (spectrograms, etc) with begin of original
recording
• … thus finding the end of the original text
• Cut the text and the respective audio from beginning of the
recording
• Rinse, clean, repeat…
19
• Results:
• Quite good and fast results
• Requires manually preparing the original
text (adding/removing some tags)
• Remove “intro” and “outro” from audio-files
• Reliability depends very much on audio-
quality
20. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
BUT WAIT, WHY DOESN’T MY DNN CONVERGE WELL? I HAVE TONS
OF TRAINING DATA, …
• Just aligning texts to audio is not enough
• We also need to do language-specific post-processing, e.g. for numbers, abbreviations,
inflections, etc.
• Problem: some languages are more di"cult than others
20
22. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
LEARNINGS, TONS OF LEARNINGS…
• Google/Apple transcriptions are not really good
(crowd-sourced)
• Be careful which tools you use
• When developing tools for data labelers,
concentrate mostly on UX and then on UX and
lastly on UX (for the data labeler)
• Make sure you fulfill UX-related wishes of the
data labelers
22
• Check your original data
• E.g.: audio and text-files must match 100%
• Standardize on sample rate, amplification, channels, audio
file-formats (but document it, so you can replicate during
inference if needed)
• Do NOT throw away data (e.g. punctuations, etc. in text) - do
it during training or preprocessing for a specific training task
• Data preparation & labeling is hard work - do not
underestimate it…
26. MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
THE LICENSE
Copyright (c) 2017-2018 MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES GmbH
Redistribution and use in any form, including any commercial use, with or without modification are permitted - bar the exceptions listed below -
provided that the following conditions are met:
1. Redistributions of source data must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this downloaded data, source-code or
binary-code without specific prior written permission.
3. ANY USE BY ANY UNIVERSITY, COLLEGE, RESEARCH INSTITUTE OR SIMILAR HIGHER EDUCATION INSTITUTION IN EUROPE, INCLUDING BY MEMBERS OF SUCH
INSTITUTIONS (including but not limited to the students, tutors and teachers at those institutions), REQUIRES A SEPARATE (free-of-charge) LICENSE AND IS NOT COVERED
BY THIS LICENSE AGREEMENT. PLEASE CONTACT US FOR DETAILS AT info@m-ailabs.bayern.
THIS DATA IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE and/
or DATA, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
26