1. www.tu-chemnitz.de
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Data Augmentation
Md Tajul Islam
Miriam Kreher
Mentor: Mahda Noura
2. www.tu-chemnitz.de
2
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
● What is Data Augmentation?
● Why do we need it?
● How does data augmentation work?
● Data Augmentation on Pictures
● Where to use Data augmentation?
● Data Augmentation in NLP
● Demonstration of a NLP problem
● Pros and cons
● Key takeaways
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Contents
3. www.tu-chemnitz.de
3
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
- Expand the size of a training set
by creating modified data from
the existing one.
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
What is Data Augmentation?
4. www.tu-chemnitz.de
4
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Purpose of Data Augmentation
Enlarge
dataset
Prevent
overfitting
Improve
performance model
5. www.tu-chemnitz.de
5
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
What to Augment!
Audio Texts Images Any other
types
6. www.tu-chemnitz.de
6
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Data Augmentation techniques : For Images
Geometric transformations
Color space transformations
Mixing images
7. www.tu-chemnitz.de
7
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
In my presentation focus topic is Data Augmentation.
In my presentation theme topic is Data Augmentation -- Synonym Replacement
In my presentation focus topic is Data Augmentation techniques -- Random
Insertion
In my presentation topic focus is Data Augmentation -- Random Swap
In my presentation focus topic Data Augmentation --Random Deletion
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Data Augmentation techniques : For Text
8. www.tu-chemnitz.de
8
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
- Noise Injection
- Shifting
- Changing the speed of the
Tape
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Data Augmentation techniques : For Audio
9. www.tu-chemnitz.de
9
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Workflow of Data Augmentation
data set
training set
enlarged training set
validation set
Data
Augmentation
performance test
and evaluation
10. www.tu-chemnitz.de
10
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
When do I use data augmentation?
small data set complex problem transformation of
data would be
effective and
easy
You know how to
find the correct
method for your
data
You have a
strategy for
dealing with
overfitting
11. www.tu-chemnitz.de
11
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Data Augmentation Applications
Natural Language Processing
Human-computer interaction, Understanding human emotion
Scientific research, medical fields
Image
classification Speech recognition, sound classification text classification, text generation
12. www.tu-chemnitz.de
12
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Natural Language Processing: some tasks
Translation tasks Speech emotion
recognition
Speech
recognition
Text
classification
Text-to-speech
syntax and
semtantic
analysis
Text generation,
dialogue
management
demo
13. www.tu-chemnitz.de
13
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
NLP and Data augmentation in speech emotion recognition
C. Etienne and B. Schmauch, "Speech Emotion Recognition with Data Augmentation and Layer-wise
Learning Rate Adjustment”
Problem:
class imbalance and small dataset
Solution:
Data augmentation by rescaling of
spectograms
Results:
Improvement in accuracy of
predictions
Baseline Best
model
Augmentation during
training
- - + +
Oversampling (×2) of
happiness and anger
- + + +
Frequency range
(kHz)
4 4 4 8
Weighted accuracy 66.4 63.5 64.2 64.5
Unweighted
accuracy
57.7 59.8 60.9 61.7
10-cross validation scores depending on the techniques applied (for each
experiment we present the results corresponding to its best run).
14. www.tu-chemnitz.de
14
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
NLP and Data augmentation in translation tasks
T. Potapczyk, P. Przybysz, M. Chochowski, A. Szumaczuk: “Samsung’s System for the IWSLT 2019 End-to-
End Speech Translation Task“
Problem:
Translate English to German TED lectures
Solution:
Data augmentation by applying audio
effects
Results:
Improvement in BLEU score
Model tst2010
ASR Transformer for SLT baseline 12.76
+ Model averaging 12.99
+ Dense FF in Decoder 13.20
+ Spectogram masking 13.74
+ Data augmentation (speed x1) 14.85
+ 8 head attention 15.31
+ Data augmentation (speed +
echo x1 )
15.56
BLEU scores for models trained on LibriSpeech data. Each
subsequent model includes all the previous techniques in the
table.
15. www.tu-chemnitz.de
15
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
NLP and Data augmentation in classification tasks
Problem: Performing text classification
depends on quality and quantity of data
Solution: Data augmentation by
application of multiple transformations
on text
Results: Performance gain of model if
right amount of data augmentation
chosen Performance on benchmark text classification tasks with and
without EDA, for various dataset sizes used for training. [1]
J. Wei and K. Zou, "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text
Classification Tasks”
[1] J. Wei and K. Zou, "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks", in Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong
Kong, 2019, p. 6384.
16. www.tu-chemnitz.de
16
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Demo: NLP text classification and Data Augmentation
dataset of movie
reviews
positive negative
- Create a model with
scikitlearn
- Process the data
- Split the data into test
and training set
Data Augmentation with
nlpaug
- Apply classifier:
RandomForestClassifier
- Evaluate the model
https://github.com/mimmimkr/nlp_dataaug
17. www.tu-chemnitz.de
17
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Demo: Data augmentation in nlpaug
import nlpaug
import nlpaug.augmenter.word as naw
def write_vars_to_file(type):
for i in range(len(textdata)):
…
with open(path, “w”) as
file
file.write(doaugment(file,
type))
def doaugment(file, augtype):
...
if(augtype==’spell’):
aug = naw.SpellingAug()
for r in range(len(sentences)):
augsentence =
aug.augment(sentences[x])
augsentences.append(augsentence)
…
return ' '.join(augsentences)
18. www.tu-chemnitz.de
18
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Demo: performing Data Augmentation
19. www.tu-chemnitz.de
19
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Demo: Result of Augmentation with synonym replacement
a sci fi/comedy starring jack nicholson , pierce brosnan ,
annette benning , glenn close , martin short and other stars.
a warner bros picture
the martians have landed in this hillarous tim burton movie.
before entering the cinema , i was initially a little bit nervous
about what this film would be like .
many people were saying that this film was silly rubbish , and
there was no point to it all .
how wrong they were .
i left this film feeling much happier than i was before i entered
the cinema .
the story is about martians attacking earth .
using ray guns ( hooray ! )
they generally cause havoc around the u . s and other
countries.
a sci fi / funniness star knave nicholson, president pierce
brosnan, annette benning, john herschel glenn jr. close, dino
paul crocetti short and other stars.
a charles dudley warner bros picture
the martians hold landed in this hillarous tim burton motion
picture. before entering the movie theater, i was ab initio a little
bit nervous about what this plastic film would comprise similar.
many people were saying that this film be silly rubbish, and at
that place was no point to it all. how wrong they were.
i left this motion picture show feeling much happier than i was
before single entered the cinema.
the chronicle is about martians attacking worldly concern.
using shaft of light gun (hooray! )
they generally cause havoc around the u. due south and other
state.
original review augmented review
20. www.tu-chemnitz.de
20
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Demo: Results and Discussion
accuracy of the predictions of the text classifier without augmentation
performed text
classification with
an 85% accuracy
augmented over
2000 text files
improved the model
with augmentation
21. www.tu-chemnitz.de
21
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Demo: BONUS - image augmentation with imgaug
Augmentation by: cropping, scaling, artistic filters, weather, blur, rotation, flip...
22. www.tu-chemnitz.de
22
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Demo: BONUS - image augmentation with imgaug
120
images
180
seconds
23. www.tu-chemnitz.de
23
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Data augmentation: pros and cons
+ -
easy implementation, low effort time consuming (sometimes)
tailored approach, no model
changes
trial and error
performance boost: easy to
measure
relational gaps between data and
augmented data
can help with overfitting can lead to overfitting if not done
right
24. www.tu-chemnitz.de
24
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Data augmentation: key takeaway
small data set model
enlarged data set
Data
Augmentation
25. www.tu-chemnitz.de
25
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam Parvaz ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Data augmentation: References
U. Malik, “Text Classification with Python and Scikit-Learn,” Stack Abuse. [Online]. Available: https://stackabuse.com/text-classification-with-
python-and-scikit-learn/. [Accessed: 03-Dec-2020].
T. Tran, T. Pham, G. Carneiro, L. Palmer and I. Reid, "A Bayesian Data Augmentation Approach for Learning Deep Models", in 31st
Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, California, 2020.
J. Wei and K. Zou, "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks", in Proceedings of the
2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong
Kong, 2019, pp. 6382-6388.
C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, and B. Schmauch, “CNN+LSTM Architecture for Speech Emotion Recognition with Data
Augmentation,” arXiv.org, 11-Sep-2018. [Online]. Available: https://arxiv.org/abs/1802.05630. [Accessed: 15-Dec-2020].
T. Potapczyk, P. Przybysz, M. Chochowski, and A. Szumaczuk, “Samsung's System for the IWSLT 2019 End-to-End Speech Translation
Task,” Zenodo, 02-Nov-2019. [Online]. Available: https://zenodo.org/record/3525498. [Accessed: 15-Dec-2020].
E. Ma, “Data Augmentation in NLP,” Medium, 04-Jun-2019. [Online]. Available: https://towardsdatascience.com/data-augmentation-in-nlp-
2801a34dfc28. [Accessed: 15-Dec-2020].
V. Iosifidis and E. Ntoutsi, "Dealing with Bias via Data Augmentation in Supervised Learning Scenarios", Semanticscholar.org, 2020. [Online].
Available: http://ceur-ws.org/Vol-2103/paper_5.pdf. [Accessed: 04- Dec- 2020].
26. www.tu-chemnitz.de
Chemnitz ∙ 16. December 2020 ∙ Md Tajul Islam ∙ Miriam Kreher
SEMINAR WEB ENGINEERING (WS 2020/2021)
DATA AUGMENTATION
Thank you all