2. Why to analyze sentiment in Serbian?
●
Great industrial need
–
–
Automated market research
–
●
Ads websites
Customer satisfaction
NLP tools for Serbian are not developed
–
Need for tools and resources
–
Almost no accessible tools through API
3. Serbian language
●
Belongs to Indo-Europian language group
●
Slavic language
●
Highly inflectional
●
3 pronunciation types
●
3 dialect groups
●
Write as you speak
●
Latin and Cyrillic
writing system
5. Tokenization and preprocessing
●
Process of breaking a stream of text up into
words
●
Stop-word filtering
●
Negation handling
–
–
●
Adding NE_ prefix after negation
All words before punctuation
Irregular verbs
6. Stemming
●
Process for reducing inflected words to their
stem, base or root form
●
Kešelj and Šipka (2008)
●
Hand crafted rule based stemmer
●
~300 rules
7. Sentiment analysis
●
Aim to build binary sentiment analysis
●
General Serbian language
●
No annotated corpus for Serbian
●
Annotation work (~1000 small texts)
●
Supervised machine learning
8. Naive Bayes
●
Algorithm that learns fast
●
Bag of words approach
●
Assumption of conditional independence
●
Laplace smoothing
9. Implementation
●
Web API with presentation layer
●
JSON communication
●
Secured page for annotating
●
Build using PHP and MySQL
●
Web & Android
10. Results
●
Stemmer
–
–
90% correct on news articles
–
●
Smallest and most precise stemmer
Problems: small words, irregular inflections,
voice changes
Sentiment analyzer
–
80% correct
–
Problems: Irony, ambiguity, small training
data