Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sentiment Analysis in Python - Waseda Data Science Week 2019

270 views

Published on

Slides from the Sentiment Analysis in Python workshop, held as part of the Data Science Week at Waseda, January 2019. The accompanying Jupyter Notebook code can be found at http://www.robfahey.co.uk/blog/social-media-data-workshop-waseda/

Published in: Education
  • Hello! I can recommend a site that has helped me. It's called ⇒ www.HelpWriting.net ⇐ They helped me for writing my quality research paper.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Yes you are right. There are many research paper writing services available now. But almost services are fake and illegal. Only a genuine service will treat their customer with quality research papers. ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Sentiment Analysis in Python - Waseda Data Science Week 2019

  1. 1. Workshop: Sentiment Analysis with Python Rob Fahey robfahey@fuji.waseda.jp @robfahey Data Science Week at Waseda, January 2019
  2. 2. ”How does it make you feel?” Sentiment Analysis Also called “Tone Analysis” (Grimmer & Stewart 2013) or “Opinion Mining” (Dave, Lawrence & Pennock 2003) Whatever you call it, the question it aims to answer is always the same:
  3. 3. THE OBJECTIVE • In the Internet age, humans create and publish billions of pieces of content (text, movies, images etc.) every single day. • Many of those data express a sentiment about a subject of some kind. • By selecting data related to a subject (a person, a country, a brand, etc.), we can measure public sentiment in a very detailed way. • We can even see how sentiment changes minute-by-minute, or day-by-day – giving us unprecedented insights into political trends, marketing campaigns or financial market movements.
  4. 4. THE CHALLENGE • Sentiment Analysis is easy for humans, but hard for computers. • Humans: can process complex texts, images or videos with an understanding of cultural and social contexts, allowing us to quickly and naturally judge the sentiment or emotion being expressed. • Computers: can count things really, really fast. • Sentiment Analysis methodologies all try to overcome the weaknesses of computers (no context, no understanding) by using their strengths (counting very fast!).
  5. 5. TWO APPROACHES UNSUPERVISED METHODS • Dictionary / Lexicon Methods • Word Embeddings SUPERVISED METHODS • Classification Algorithms • Aggregate Algorithms Requires Training DataNo Training Data Required
  6. 6. HOW A MACHINE LEARNS • To carry out “Machine Learning”, the machine needs something to learn from. • In dictionary approaches, you teach the computer a lexicon – a set of words that are associated with different sentiments. • This approach can be improved (or at least complicated) by using techniques like word embeddings, which try to estimate the sentiment of unknown words by seeing how frequently they occur in proximity to known words; • Or by trying to consider the grammatical context in which a word appears. great +1 awful -1
  7. 7. HOW A MACHINE LEARNS (2) • In supervised approaches, the computer instead learns from a set of sample data which you have categorized by hand, using human coding. • There are lots of different algorithms and approaches for supervised learning, but they all have this in common – you need to create training data first. • The algorithms try to learn the patterns which are associated with each sentiment. “This movie was terrible - why would Brad Pitt agree to star in this rubbish? It’s not like he needs the money.” Negativ e “Just had a great time at the cinema, what a fantastic movie! I don’t want to ruin the ending but it’s a crazy surprise. Well worth the money.” Positive
  8. 8. PREPARING YOUR DATA: WORD SEGMENTATION • The first challenge is how to divide sentences in your data into words. • In English or other European languages, this is fairly easy – These / languages / have / spaces / between / the / words. • It’s not quite that simple – a process called stemming is often used to change every word back to its most simple form by removing plurals, tenses etc. • Otherwise the computer won’t know that ”dog” and “dogs”, or “go” and “going”, express the same concept!
  9. 9. PREPARING YOUR DATA: WORD SEGMENTATION (IN OTHER LANGUAGES) • In other languages like Japanese, word segmentation is more challenging. • 日本語の文書は言葉と言葉の間にスペースがないから、形態素解析をし ないといけない。 Where do the words begin and end in that sentence? • Thankfully there is software to help with this process in many languages. • Japanese: MeCab, ChaSen, Janome (Python package) • Chinese (and Arabic): Stanford Word Segmenter • Korean: Open-Korean-Text (looks good, but I haven’t tried it)
  10. 10. DICTIONARY APPROACHES • To use a dictionary approach, you need to start by acquiring a dictionary (or “lexicon”) which you’ll use to calculate sentiment. • There are many of these available for the English language and other major languages. In minority languages, however, these resources might not be available – or might be of very dubious quality. • Your dictionary needs to be appropriate to your text. Using a dictionary full of Twitter slang on newspaper texts will yield bad results – and vice versa.
  11. 11. A SIMPLE EXAMPLE Just had a great time at the cinema, what a fantastic movie! I don’t want to ruin the ending but it’s a crazy surprise. Well worth the money. “This movie was terrible - why would Brad Pitt agree to star in this rubbish? It’s not like he needs the money.”
  12. 12. A SIMPLE EXAMPLE…? This movie has a fantastic cast, an interesting concept and amazing special effects – but the end result is utterly boring.
  13. 13. DICTIONARY APPROACHES PLEASE OPEN JUPYTER LAB!
  14. 14. THE BAG OF WORDS • You may have noticed something about the examples we looked at – the order of the words doesn’t matter. • This is actually true of (almost) every sentiment analysis approach (and text mining approaches in general). • It’s counter-intuitive, but computers are much better at treating texts as a ”bag of words” than they are at understanding grammar, word order etc.
  15. 15. VECTOR REPRESENTATIONS • Often, after dividing the sentence into words, we represent it using a vector of word frequencies. An entire corpus of documents can be represented in a single matrix: the term- document matrix (TDM). I like to eat sushi You like to eat burgers She doesn’t like sushi I Like To Eat Sushi You Burgers She Doesn’t 1 1 1 1 1 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1 0 0 1 0 0 1 1
  16. 16. FEATURE SELECTION • A term-document matrix could easily get VERY big – overwhelming a computer’s memory and taking a very long time to process. We often need to focus somehow on the most relevant terms in the vocabulary. How? • Stopwords: Very commonly used words are of little value in distinguishing documents, so we can remove them. • Document Frequency: Ignoring words which appear in too many or too few documents allows us to focus only on words useful to our research. • TF-IDF: Less useful for short documents (e.g. Twitter), but “Term Frequency / Inverse Document Frequency” points out words that are especially good at distinguishing differences between texts.
  17. 17. CLASSIFICATION ALGORITHMS • Classification algorithms are the most commonly used tool in machine learning – not just in text mining, but also in fields like voice recognition, computer vision or predicting behaviour. • They are essentially tools for pattern recognition – you show them a number of labelled examples of vector representations (in our case, term-document matrices) and they try to find the patterns which maximise the probability of a vector belonging to a certain label.
  18. 18. CHOOSING AN ALGORITHM • There are many kinds of classification algorithm – from simple statistical methods like Naïve Bayes, to evolutions of regression-based approaches like Support Vector Machines, to science-fiction sounding approaches like Random Forest (which constructs a “forest” of “decision trees” and uses them to vote of classification) and Neural Networks (which were designed to emulate the decision-making behavior of neurons in the human brain). • How do you pick the best one for your research? • Simple answer: try them all and see what works best. Luckily,
  19. 19. CLASSIFICATION APPROACHES PLEASE GO BACK TO JUPYTER LAB!
  20. 20. AGGREGATE ALGORITHMS • There is one final group of sentiment analysis approaches which has been gaining in popularity in recent years. • Aggregate algorithms are similar to classification algorithms in many ways (they need training data and function on pattern recognition), but different in one crucial way – they do not classify individual documents, but instead aim to give an accurate measurement of the distribution of classes in the overall corpus.
  21. 21. AGGREGATE ALGORITHMS • This has some serious advantages! Aggregate algorithms tend to be able to give accurate results with a much smaller amount of training data, for example. • Aggregate algorithms are also really good at handling data with a lot of “off-topic” texts. • Classification algorithms have a statistical problem with this data – when the “off-topic” category is very common, there is a bias towards mis- classifying a lot of texts as off-topic. • But… You can’t see classifications for individual texts, so they’re not appropriate for every kind of research.
  22. 22. AGGREGATE APPROACHES PLEASE GO BACK TO JUPYTER LAB!
  23. 23. PITFALLS AND WARNINGS • Clean your Data! Data accessed from the internet often includes a lot of texts you didn’t actually mean to analyse – check carefully to make sure your data isn’t full of bots reposting garbage, or posts about a totally different topic. • Read your Data! Don’t just take the results of any algorithm to be accurate – even if it agrees with your hypothesis. At some point you’re going to need to dive in and read samples of the data you’ve collected, to confirm that you’re really observing
  24. 24. WRAPPING UP • This workshop can really only introduce a few of the most commonly used approaches in sentiment analysis. This is a rapidly changing field and new algorithms and approaches are being developed all the time. • There are some approaches which require a lot more technical skill than the ones we looked at today – for example, creating your own sentiment dictionary and analyser that’s perfectly appropriate for your corpus of texts is possible, but difficult unless you’re a skilled programmer. • The approaches we looked at today are very mainstream and commonly used in a lot of academic studies – I hope they’ll be
  25. 25. THANK YOU! • Questions, ideas or feedback? • Email: robfahey@fuji.waseda.jp • Twitter: @robfahey • Website: robfahey.co.uk

×