Improving neural networks models for Russian NLP with synonyms
1. improving neural networks models
for natural language processing in russian
with synonyms
Ruslan Galinsky, Anton Alekseev and Sergey Nikolenko
AINL 2016
St. Petersburg, November 11, 2016
Steklov Institute of Mathematics at St. Petersburg
Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg
2. neural networks for nlp
• Recent success on numerous tasks: from text classification to
syntax parsing.
• Training ANNs and tuning parameters is hard.
• At many tasks there is not enough data for applying advanced
neural approaches.
2
3. data augmentation
• Widely applied in computer vision:
shifting, cropping, resizing images, adding noise.
• Not a denoising-aimed technique, but building a better, bigger
dataset for the task.
• Can be used for different NLP tasks.
PyData 2015 London, Python for Image Understanding: Deep Learning with Convolutional Neural Nets.
3
4. data augmentation for language
• One cannot simply insert/swap/remove random words.
• Paraphrases — hard to collect and expensive.
• Modify word order?
• Augment with synonyms?
• Augment with words of ’possibly unimportant parts of speech’?
4
5. quick glance at our study
• Character-level model for Russian language data.
• Text classification task: sentiment analysis.
• Augmentation with synonyms, adjectives and syntax-unaware
word shuffling.
5
6. why try character-level models for nlp
• Traditional (arguably) approach — various word level models for
different tasks
Conceptual flaws:
• dealing with different form of the same lexeme,
• unknown words meaning inference,
• word dictionaries can be large,
• spell-checking may be required.
• Hopefully all these can be addressed at the subword level,
especially for morphology-rich languages, such as Russian.
• Character-level models work with text as a sequence of symbols.
6
7. model
• Xiang Zhang, Junbo Zhao, Yann LeCun.
Character-level Convolutional Networks for Text Classification.
2015.
• The paper suggested NN architecture and augmentation with
thesauri; authors collected their own large datasets.
• Architecture key points:
• input: texts as padded sequences of one-hot-encoded symbols,
• 6 conv-layers, 3 max-pooling layers, 3 dense layers with 2 dropout
layers in between,
• output: one-hot-encoded label.
7
8. our study
• Sentiment analysis for Russian language datasets with
char-level ConvNet.
• Check which augmentation techniques are applicable:
synonyms, adjectives, shuffling.
• Test model on different domain data.
8
9. augmenting with synonyms
• Method:
1. Select all nouns and adjectives (e.g. pymorphy).
2. For each word
2.1 lemmatize the word,
2.2 look for replacement candidates in a pairs-of-synonyms dictionary
(empirical observation: ’synonymity relation’ should be reflexive),
2.3 choose the replacement sampling the synonym from multinomial
distribution over candidates (distribution reflects words occurence
statistics),
2.4 “un-lemmatize” the chosen word.
3. Insert the replacement into the appropriate position in the text.
• Synonyms sources:
• Abramov, N. Dictionary of Russian Synonyms and Synonymous
Phrases. Moscow: Russkie Slovari. 1999.
• Alexandrova, Z. E. Dictionary of Russian Synonyms. Moscow:
Russkii Yazyk. 2001.
9
10. augmenting with word shuffling
• We tried a simple bag-of-words-like approach: random
shuffling.
humpty dumpty sat on the wall → wall the on dumpty sat humpty
• Proposal for future research: syntax parsing, swapping certain parse
subtrees.
10
11. augmenting with word adjectives
• Suggested procedure:
1. Count all word bigrams of POS ”adjective-noun”, ”noun-adjective”.
2. For every noun 𝑤 that does not have an associated adjective in a
given text:
2.1 sample whether to add an adjective based on prior statistics of any
adjective’s occurence next to 𝑤,
2.2 if an adjective should be added, sample, which one, from a
frequency-based multinomial distribution,
2.3 decide whether it should be added before or after the noun 𝑤,
sampling from the corresponding distribution,
2.4 insert the chosen adjective into the appropriate place in the text.
11
13. evaluation: accuracy, test set
5 10 15 20 25 30 35 40 45 50
0.7
0.8
0.9
1
Training epochs
Accuracy
Basic dataset Reshuffled
Augm. with synonyms Augm. with adjectives
Figure 1: A comparison of test set accuracies of all models in the study
13
14. evaluation: test set vs different domain
Table 2: Experimental results
Dataset Best accuracy
Test set TripAdvisor set
Basic dataset 0.8457 0.7163
Basic with reshuffled words 0.8445 0.7160
Augmented with adjectives 0.7241 0.5430
Augmented with synonyms 0.8700 0.7020
14
15. summary
• Several approaches to data augmentation in the context of
character-level models were suggested.
• Simple augmentation with synonyms showed significant
improvements for a sentiment analysis task.
• Not every augmentation technique is beneficial, ones should be
tested carefully.
• Adding new words to reviews makes sense even if they slightly
violate grammatical rules.
15