Improving neural networks models for Russian NLP with synonyms

improving neural networks models
for natural language processing in russian
with synonyms
Ruslan Galinsky, Anton Alekseev and Sergey Nikolenko
AINL 2016
St. Petersburg, November 11, 2016
Steklov Institute of Mathematics at St. Petersburg
Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg

neural networks for nlp
• Recent success on numerous tasks: from text classiﬁcation to
syntax parsing.
• Training ANNs and tuning parameters is hard.
• At many tasks there is not enough data for applying advanced
neural approaches.
2

data augmentation
• Widely applied in computer vision:
shifting, cropping, resizing images, adding noise.
• Not a denoising-aimed technique, but building a better, bigger
dataset for the task.
• Can be used for different NLP tasks.
PyData 2015 London, Python for Image Understanding: Deep Learning with Convolutional Neural Nets.
3

data augmentation for language
• One cannot simply insert/swap/remove random words.
• Paraphrases — hard to collect and expensive.
• Modify word order?
• Augment with synonyms?
• Augment with words of ’possibly unimportant parts of speech’?
4

quick glance at our study
• Character-level model for Russian language data.
• Text classiﬁcation task: sentiment analysis.
• Augmentation with synonyms, adjectives and syntax-unaware
word shufﬂing.
5

why try character-level models for nlp
• Traditional (arguably) approach — various word level models for
different tasks
Conceptual ﬂaws:
• dealing with different form of the same lexeme,
• unknown words meaning inference,
• word dictionaries can be large,
• spell-checking may be required.
• Hopefully all these can be addressed at the subword level,
especially for morphology-rich languages, such as Russian.
• Character-level models work with text as a sequence of symbols.
6

model
• Xiang Zhang, Junbo Zhao, Yann LeCun.
Character-level Convolutional Networks for Text Classiﬁcation.
2015.
• The paper suggested NN architecture and augmentation with
thesauri; authors collected their own large datasets.
• Architecture key points:
• input: texts as padded sequences of one-hot-encoded symbols,
• 6 conv-layers, 3 max-pooling layers, 3 dense layers with 2 dropout
layers in between,
• output: one-hot-encoded label.
7

our study
• Sentiment analysis for Russian language datasets with
char-level ConvNet.
• Check which augmentation techniques are applicable:
synonyms, adjectives, shufﬂing.
• Test model on different domain data.
8

augmenting with synonyms
• Method:
1. Select all nouns and adjectives (e.g. pymorphy).
2. For each word
2.1 lemmatize the word,
2.2 look for replacement candidates in a pairs-of-synonyms dictionary
(empirical observation: ’synonymity relation’ should be reﬂexive),
2.3 choose the replacement sampling the synonym from multinomial
distribution over candidates (distribution reﬂects words occurence
statistics),
2.4 “un-lemmatize” the chosen word.
3. Insert the replacement into the appropriate position in the text.
• Synonyms sources:
• Abramov, N. Dictionary of Russian Synonyms and Synonymous
Phrases. Moscow: Russkie Slovari. 1999.
• Alexandrova, Z. E. Dictionary of Russian Synonyms. Moscow:
Russkii Yazyk. 2001.
9

augmenting with word shuffling
• We tried a simple bag-of-words-like approach: random
shufﬂing.
humpty dumpty sat on the wall → wall the on dumpty sat humpty
• Proposal for future research: syntax parsing, swapping certain parse
subtrees.
10

augmenting with word adjectives
• Suggested procedure:
1. Count all word bigrams of POS ”adjective-noun”, ”noun-adjective”.
2. For every noun 𝑤 that does not have an associated adjective in a
given text:
2.1 sample whether to add an adjective based on prior statistics of any
adjective’s occurence next to 𝑤,
2.2 if an adjective should be added, sample, which one, from a
frequency-based multinomial distribution,
2.3 decide whether it should be added before or after the noun 𝑤,
sampling from the corresponding distribution,
2.4 insert the chosen adjective into the appropriate place in the text.
11

evaluation: datasets
Table 1: Dataset statistics
Dataset Reviews
Positive Negative Total
Basic: torg.mail.ru + Restoclub 63088 35046 98134
Augmented with adjectives 126176 70092 196268
Augmented with synonyms 125523 69849 195372
Validation dataset: TripAdvisor 26807 11075 37882
12

evaluation: accuracy, test set
5 10 15 20 25 30 35 40 45 50
0.7
0.8
0.9
1
Training epochs
Accuracy
Basic dataset Reshufﬂed
Augm. with synonyms Augm. with adjectives
Figure 1: A comparison of test set accuracies of all models in the study
13

evaluation: test set vs different domain
Table 2: Experimental results
Dataset Best accuracy
Test set TripAdvisor set
Basic dataset 0.8457 0.7163
Basic with reshufﬂed words 0.8445 0.7160
Augmented with adjectives 0.7241 0.5430
Augmented with synonyms 0.8700 0.7020
14

summary
• Several approaches to data augmentation in the context of
character-level models were suggested.
• Simple augmentation with synonyms showed signiﬁcant
improvements for a sentiment analysis task.
• Not every augmentation technique is beneﬁcial, ones should be
tested carefully.
• Adding new words to reviews makes sense even if they slightly
violate grammatical rules.
15

thank you!
Thank you for your attention!
16

Improving neural networks models for Russian NLP with synonyms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Improving neural networks models for Russian NLP with synonyms

Similar to Improving neural networks models for Russian NLP with synonyms (20)

More from Lidia Pivovarova

More from Lidia Pivovarova (7)

Recently uploaded

Recently uploaded (20)

Improving neural networks models for Russian NLP with synonyms