Build your own speech to text dataset in 30 days

MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES
REPORTS FROM THE BATTLEFRONT, OR: BUILD YOUR
OWN SPEECH TO TEXT DATASET FROM SCRATCH IN
30 DAYS
Dmytro Naumov (dima@m-ailabs.bayern)
March 2018
Munich Artiﬁcial Intelligence Laboratories GmbH (M-AILABS)

LET’S START WITH SOME ADVERTISING…
2

M-AILABS IN A NUTSHELL
• Founded April 1st*
• Designs & develops ML/AI-based software engines
• 100% owner-ﬁnanced and proﬁtable
• Currently 4 FTE (3x Machine Learning/Neuronal Networks; 1 Administrative)
3
*: Yes, it was in fact founded on April 1st, but the actual registration was on April 3rd because … April 1st 2017 is a Saturday and the registration office doesn’t work on weekends…

ARTIFICIAL INTELLIGENCE AT THE HEART OF EUROPE…
Our mission is to enable
European organizations to take
advantage of AI & ML without
having to give up control or
know-how
4

VISION: APPLIED AI/ML
We use state-of-the art technologies to develop solutions for, e.g.:
5
Speech Recognition Speech Synthesis Document Classiﬁcation Object Recognition
Text Classiﬁcation Specialized OCR Signal Processing Specialized Translation

TECHNOLOGIES WE USE (ALL OPEN SOURCE)
• Classical ML-Algorithms: scikit_learn, numpy, scipy, …
• Classical Neural Network Algorithms: scikit_learn, …
• Deep Learning-based NN algorithms: s. below
• NLP in all its incarnations: NLTK, spaCy, GenSim, …
6
Deep Learning
• Torch & PyTorch
• Caffe2
• TensorFlow (less and less)
• Keras 2 (on top of TensorFlow)

A SELECTION OF OUR DEEP LEARNING-HARDWARE
7

8

OK, BACK TO THE TOPIC: SOME BACKGROUND
REFRESHER…
9

MACHINE LEARNING PROCESS
10
E.g. Raw images 
(>1,000,000)
Categorized Data
Real & Simulated
Tagged Data 
(E.g.: 100,000 Images)
Trained Model 
(Days, weeks,…)
…
Data Labeler Data Scientist Algorithm / Optimization Developer
Optimize Operate
Validate
Accuracy ➠ ➠Collect Data
Establish
Ground Truth
Select
DataSet
Design &
Train Models➠ ➠ ➠ ➠

DATA COLLECTION & LABELING (ESTABLISHING GROUND TRUTH)
• Thousands of sample data
• Each need to be labeled
• Each label represents a category
• Result: Training material to teach the
Machine our own Knowledge
11

TASK: DEVELOP A SPEECH-RECOGNITION & SPEECH
SYNTHESIS SOLUTION FOR GERMAN (AND SPANISH, AND …)
12

“A KINGDOM FOR DATA…”
• Read papers, papers, papers, and more
papers…
• Write an SR-system based on DeepSpeech
(Baidu)
• Train on LJSpeech-data (english)
• … works great! Now to German…
13
• Problems, problems, problems
• No public German voice data available
• Universities / Colleges have data, but…
• … even for testing they want € € €
• … you are not allowed to use it for
commercial purposes
• … and there are other issues as well …
“I have an idea: Let’s create
our own dataset…”

RAW DATA
• How much data do we need?
• Hundreds of hours per language
• How do we generate them?
• Hire people to read books sentence by
sentence
• … or… checkout LibriVox and split them
sentence by sentence
14

WOW, MORE DIFFICULT THAN THOUGHT… WHERE IS THE
SENTENCE?
Inhalt
Das grüne Haus Die goldne Spinne
Mariechen und die Sonne
Als es nicht regnen wollte
…
Das grüne Haus
Ja, es ist ein grünes Haus, in dem ich wohne; und alle
Märchen und Geschichten, die in diesem Buche stehn, sind darin
geschrieben worden. Es ist nicht etwa grün angestrichen wie ein
Gartenzaun oder eine …
15
?
?
?

APPROACH 1: USE A HUMAN TO ESTABLISH GROUND-TRUTH FOR
EACH AUDIO SEQUENCE MANUALLY
• Take the original audio
• Using ‘sox’, split audio files separated by
pauses
• Generate a list of audio-files
• Have a person listen to each audio-file and
assign the original text to it (CSV-file)
16
• Results:
• Quality: excellent
• Speed: … well… it is slow, repetitive and boring
work
• It took very long (at least the same time as the
audio-length) - but normally 2x - 4x the audio
length
• Not really scalable…

APPROACH 2: USE OWN TOOLS + GOOGLE/APPLE SPEECH API
• Use in-house developed script to split audio files separated by
pauses (analyze the whole audio and decide what is pause and
what is not)
• Generate a list of audio-files
• Use Google and Apple Speech APIs to transcribe these files
• Use in-house developed ‘trqa’ to perform ‘transcription-QA’ in
order to find the “original text”
• Results: extremely QA-heavy, but still doable (‘trqa’ partially
automated recognition, thus requiring only around 20-30%
real QA)
17

WE ARE ACTUALLY INTERESTED ABOUT THE END OF THE
SENTENCE, NOT BEGINNING
Inhalt
Das grüne Haus Die goldne Spinne
Mariechen und die Sonne
Als es nicht regnen wollte
…
Das grüne Haus
Ja, es ist ein grünes Haus, in dem ich wohne; und alle
Märchen und Geschichten, die in diesem Buche stehn, sind darin
geschrieben worden. Es ist nicht etwa grün angestrichen wie ein
Gartenzaun oder eine …
18
?
?
?

APPROACH 3: SPLIT FILES BASED ON TEXT-SENTENCES AND
SEARCH FOR THE ENDS OF THOSE TEXTS…
• Split original text into manageable “sentences”
• Make sure that each “sentence” does not result in audio longer than
15 seconds
• Generate audio out of text (TTS)
• Compare generated audio (spectrograms, etc) with begin of original
recording
• … thus ﬁnding the end of the original text
• Cut the text and the respective audio from beginning of the
recording
• Rinse, clean, repeat…
19
• Results:
• Quite good and fast results
• Requires manually preparing the original
text (adding/removing some tags)
• Remove “intro” and “outro” from audio-ﬁles
• Reliability depends very much on audio-
quality

BUT WAIT, WHY DOESN’T MY DNN CONVERGE WELL? I HAVE TONS
OF TRAINING DATA, …
• Just aligning texts to audio is not enough
• We also need to do language-speciﬁc post-processing, e.g. for numbers, abbreviations,
inﬂections, etc.
• Problem: some languages are more di"cult than others
20

POST-PROCESSING (NUMBERS -> TEXT)
21
# de en tr
13 dreizehn (three-ten) thirteen On-üç (ten-three)
27 Sieben-und-zwanzig (seven-and-twenty) twenty-seven Yirmi-yedi (twenty-seven)
127 Hundert-sieben-und-zwanzig (hundred-seven-and-twenty) hundred-twenty-seven Yüz-yirmi-yedi (hundred-twenty-seven)
1127
Tausend-ein-hundert-sieben-und-zwanzig (thousand-one-hundred-
seven-and-twenty)
Eleven-hundred-twenty-seven
Thousand-one-hundred-twenty-seven
Bin-yüz-yirmi-yedi (thousand-hundred-twenty-
seven)
111127
hundert-elf-tausend-ein-hundert-sieben-und-zwanzig (hundred-
eleven-thousand-one-hundred-seven-and-twenty)
Hundred-eleven-thousand-one-hundred-twenty-
seven
Yüz-on-bir-bin-yüz-yirmi-yedi (hundred-ten-
one-thousand-hundred-twenty-seven)
1981
Neun-zehn-hundert-ein-und-achtzig (year between 1100-1999)
Tausend-neun-hundert-ein-und-achtzig (non-year)
Nineteen-hundred-eighty-one (US: sometimes up
til 9999)
Bin-dokuz-yüz-seksen-bir (thousand-nine-
hundred-eighty-one)
9.
“Am 9. Juni” = “Am neunten Juni”
“9. Juni: …” = “Neunter Juni: …”
“9. Haus” = “Neuntes Haus”
ninth Dokuz-uncu

LEARNINGS, TONS OF LEARNINGS…
• Google/Apple transcriptions are not really good
(crowd-sourced)
• Be careful which tools you use
• When developing tools for data labelers,
concentrate mostly on UX and then on UX and
lastly on UX (for the data labeler)
• Make sure you fulfill UX-related wishes of the
data labelers
22
• Check your original data
• E.g.: audio and text-files must match 100%
• Standardize on sample rate, amplification, channels, audio
file-formats (but document it, so you can replicate during
inference if needed)
• Do NOT throw away data (e.g. punctuations, etc. in text) - do
it during training or preprocessing for a specific training task
• Data preparation & labeling is hard work - do not
underestimate it…

RESULTS…
23

24
TRAINING DATA (HOURS:MINS) - BY LANGUAGE (V1.0)
de_DE en_UK en_US es_ES it_IT uk_UK ru_RU fr_FR*
Female 150:56 45:34 63:43 10:37 8:23 10:28 16:04 108:10
Male 39:44 0:00 38:23 72:24 31:45 62:32 30:43 93:32
Mixed 46:42 0:00 00:00 25:33 87:53 0:00 0:00 15:33
TOTAL 237:22 45:34 102:07 108:34 127:40 73:00 46:47 217:15
Quantity +++ + +++ +++ +++ ++ + +++
Quality +++ +++ +++ +++ +++ +++ +++ ++
*: French is still “work-in-progress”
Grand Total: 741:04 (~82GB) - excluding fr_FR
Additionally planned: pl_PL, nl_NL, tr_TR, pt_PT, pt_BR, ar_…, plus some dialects (de_AT, de_CH, de_DE_SX, de_DE_BY, …)

SO, WHAT? WHAT DOES THIS MEAN FOR YOU?
25

THE LICENSE
Copyright (c) 2017-2018 MUNICH ARTIFICIAL INTELLIGENCE LABORATORIES GmbH
Redistribution and use in any form, including any commercial use, with or without modification are permitted - bar the exceptions listed below -
provided that the following conditions are met:
1. Redistributions of source data must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this downloaded data, source-code or
binary-code without speciﬁc prior written permission.
3. ANY USE BY ANY UNIVERSITY, COLLEGE, RESEARCH INSTITUTE OR SIMILAR HIGHER EDUCATION INSTITUTION IN EUROPE, INCLUDING BY MEMBERS OF SUCH
INSTITUTIONS (including but not limited to the students, tutors and teachers at those institutions), REQUIRES A SEPARATE (free-of-charge) LICENSE AND IS NOT COVERED
BY THIS LICENSE AGREEMENT. PLEASE CONTACT US FOR DETAILS AT info@m-ailabs.bayern.
THIS DATA IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE and/
or DATA, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
26

WHERE TO GET IT
Available as of today at:
27
http://data.m-ailabs.eu/speech

… mia san do, um zu helfen
info@m-ailabs.bayern
28

Build your own speech to text dataset in 30 days

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Build your own speech to text dataset in 30 days

Similar to Build your own speech to text dataset in 30 days (20)

Recently uploaded

Recently uploaded (20)

Build your own speech to text dataset in 30 days