Sentiment analysis:
Incremental learning to build domain-models
Raimon Bosch (@raimonbosch)
TALN, DTIC, UPF
What is sentiment analysis?
[Liu, 2010] Proposes a quintuple (oj, fjk, ooijkl, hi, tj). Text
unstructured data to structur...
What is sentiment analysis?
(oj, fjk, ooijkl, hi, tj) examples:
("easyjet", "baggage", "too expensive" => -5, "John", "01-...
State-of-the-art
- Twitter as a corpus [Pak and Paroubek, 2010]: Text-
classification problem. Features for machine learni...
State-of-the-art
- Pointwise Mutual Information [Su and Xiang, 2006]: We can
have the probability of certain words in a ph...
State-of-the-art
- Sentiment dictionaries: Sentiwordnet [Baccianella and Esuli,
2010]. Positive score and Negative score f...
State-of-the-art
- Cross-domain models [Pan, 2010]: Bipartite graph.
State-of-the-art
- Twitter prediction [O’Connor, 2010]: Correlation between
tweets and polls. Real-time information.
Not developed in state-of-the-art
Structured N-grams.
Most of the work is done with N-grams.
Buzz detection.
Aspect identi...
Technology stack
Technology stack
- Simplicity. Ruby.
- Integration with Java (JRuby, Hadoop Streaming).
- Big Data ready. Hadoop.
Hypothesis
H1: We can create groups of N-grams that influence specifically
to one aspect in a negative or a positive orien...
Hypothesis (H1) - Sentigrams
We define as sentigram the relation between sentiwords and
aspects that define if a tweet is ...
Hypothesis (H1) - Sentigrams
- Mark opinion orientations. Not only if they are positive or
negative, also which aspect are...
Hypothesis (H2) - Incremental learning
By using incremental learning the system improves in each
iteration. Increasing pre...
Hypothesis (H3) - Automatization
After certain number of iterations is reached we can assign
sentigrams to a tweet automat...
Hypothesis (H3) - ML
- Convert a multiclass problem in a binary problem
(i.e. "ryanair is a joke").
0,801829636,-
54540368...
Hypothesis (H3) - Dependency parsing
- Mate Tools
1 ryanair _ ryanair _ NN _ _ -1 2 _ SBJ _ _
2 is _ be _ VBZ _ _ -1 0 _ R...
Conclusions
- Sentiwordnet version was not very adapted to our domain.
Accuracy 47%. Random-walk necessary.
- Design of in...
Conclusions
- Focus on aspect identification. Not only +/-. We detect what
the user is complaining about.
- Convert a mult...
What's next?
- Finish integration with dependency parsing.
- Data visualization. Comparison between several topics.
Positi...
Thanks!
Questions?
Sentiment analysis: Incremental learning to build domain-models
Sentiment analysis: Incremental learning to build domain-models
Upcoming SlideShare
Loading in …5
×

Sentiment analysis: Incremental learning to build domain-models

872 views
817 views

Published on

Are we really in control of our decisions? If we are in a supermarket in front of two equally cheap products. Which of them do we choose? Usually we choose the product that gives us better feeling. But those sensations are built with advertising campaigns that can hide bad practices with workers, clients and governments.

Aggregating opinions is a very powerful idea because it allows people to make more informed decisions about what they do and change a little bit the world with those small decisions.

Read the thesis here >> http://bit.ly/1nlyE68

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
872
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • I want to start this presentation with a little bit of thinking. I want you to read this quote and think about it for a few seconds. Is this really true? If for instance we are in a supermarket and we have to choose between two products with similar prices. Normally we buy from the brand that gives us better feeling. And this feeling is connected with its advertising campaign and its power to create this good feeling. But is this good feeling real? Behind a nice and inspiring ad it could be thousands of reasons equally important to not buy this product. Other values such as how well this company interact with its workers, how well this company interacts with its clients or how many non-resolved reclamations they have. SA can give us access to this information. The aggregation of opinions is a way of giving people the power of taking more informed decisions. Because they can analyze which kind of opinions other users have about a brand and if it is worth it to buy from them. I see also SA as a way of creating real change. If we buy from brands with better social values, we will be able to evolve to a better society.
  • Bing Liu does a very good definition of SA. He defines this as a quintuple where we have 5 fields. An object or main topic, the different object aspects which the opinion is referring to. A set of different opinion orientations that could be positive or negative and with a determined degree of intensity. And finally we can have an opinion holder and a specific time.
  • So we can see some examples of this quintuples here. We could have an opinion about easyjet that considers that the baggage is too expensive. Another one about a house renting company that says that they are horrible people, and maybe some positive opinions here we have one about jazztel that says that there are no problems. We can see how each opinion is defined with a different degree of intesnisty.
  • But what we can do with those tuples of information. What if we aggregate all of them in one place? What if we have one place where in seconds we can know how a brand treats its clients? This idea is very powerful because it will be a way to force companies to be more human and respond to certain values if they want to survive. Informed citizens are smart citizens.
  • But what we do when texts are from different domains. The negative words in the domain of airlines are not the same that in the domain of politics. Can we build cross-domain solutions? Pan proposed a solution for that, basically dividing in two groups of words. On the left we have domain-independent words and in the right domain-dependent words. As you will see this organization of information creates little groups such as never_buy with blurry and boring. With a system like that we can detect new domain-dependent opinion words by checking its co-ocurrence with words on the left side.
  • One of the main techniques is Pointwise Mutual Information. This method consists in using the World Wide Web as a database. Basically if we want to query if a "phrase" is positive or negative we have to take first N results in a search engine of this "phrase" and calculate how many co-occurrences we have in positive contexts and how many in negative contexts. Depending on that we can guess the orientation of this "phrase".
  • Other state-if -the art technique are sentiment dictionary. this consists in databases with words where each word has a positive and a negative score. We can use this information in our programs to build a sentiment score for any text. One of the main is Sentiwordnet that as you will see it has a pos score, a neg score, also a little gloss to understand, and the specific words affected for this meaning. Obviously we can have the same word in different meanings. This is way we use this hashtag and number at the end of each word.
  • But what we do when texts are from different domains. The negative words in the domain of airlines are not the same that in the domain of politics. Can we build cross-domain solutions? Pan proposed a solution for that, basically dividing in two groups of words. On the left we have domain-independent words and in the right domain-dependent words. As you will see this organization of information creates little groups such as never_buy with blurry and boring. With a system like that we can detect new domain-dependent opinion words by checking its co-ocurrence with words on the left side.
  • And yes, SA has been used to predict. In 2008 was used for the Obama's election process, it has been used also in Germany and also to predcit stock market. Is possible to find indicators that anticipate the tendencies seen in polls. So twitter allows us to see the tendencies in real-time. Sometimes to find this tendencies we need to work in other dimensionality sentiment spaces different from positive/negative such as calm/anger/joy/happiness/....
  • This slides are to explain what is not very developed yet in state-of-the-art of SA. Basically we saw that N-grams are very exploited to detect opinions, but there is not exploitation of combinations of N-grams as new units. Finding correlations between similar N-grams is a very interesting line of investigation. Yes, another thing not seen. Is treat opinions as a problem. So what if we want to read a newspaper without the writer opinion? What if we only want to read the facts, the data? SA has not been very exploited to remove opinions from texts. Which I think in some cases would be interesting
  • After reviewing the material we chose this architecture. Basically we donwload some tweets from the Twitter Api (about any topic that interest us), we merge this information with a dictionary through Hadoop so we get an score for each tweet depending on how many words of the dictionary are inside each tweet. And finally we cans how this information in a Rails interface and compare different topics, create statistics, and so on. But at the same time, we can improve the system performing annotations on tweets, and little corrections. Those corrections are reused to create new words in the dictionary and improving this tweet score. And at the same time, this annotations can be used to create Weka models that can help to create this statistics that we want to show in the interface.
  • We choose Ruby because its simplicity. We do not need to compile. We do not need to deploy. Maintenance is simple. And at the same time we can use Java when needed with Jruby or Hadoop Streaming library. Hadoop allows us to perform this agroupation between tweets and dictionary without wasting memory. So all this "GROUP BY" can be done in disk (writing sequentally). In a iterative version we would need to save on memory all tweets and dictionary and check them there. What if we have 10 millions of tweets, will fit in memory?
  • We work with three hypothesis here. 1/ That is possible to create groups of N-grams called sentigrams. Groups that indicate if a tweet is positive or negative and that refer to a specific aspect. 2/ That the system allows to do incremental learning and improve this tweet score in each iteration. 3/ That we can learn sentigrams as the number of interations increases and at certain point we will be able to dectect if the tweet is positive or negative and why.
  • Read text. As we can see there we mark the aspects in black and the sentiwords in red. So we have that ryanair is a nightmare, and that is ridiculous to pay extra for baggage. Those two sentigrams will tell to us that the message is basically negative.
  • After that we have to mark opinion orientation independently of the score given by our system (that could be wrong). So here we have a postive message that says that this two airlines are always on time. So we mark as good. And in the negative message we mark as negative.
  • The second hypothesis is about the idea of "incremental learning". That was needed because original dictionary had an accuracy below 50%. To fix that we can use random-walk algorithm to rebalance the scores of the words.
  • Third hypothesis. Automatization of sentigram detection. As we will see this is a multiclass problem, because we have to choose between several strings. Working with text is not like working with numbers, is different.
  • To solve this problem we transform this multiclass problem in a binary problem. So we ceate 4 partial observations each one in a different position of the text. First, second, third and fourth. We transform words in numbers through hash codes. And then we determine if a word is an aspect, a sentiword, or it is not relevant by adding three codes (0,1,2). This idea is similar to Viterbi algorithm that works with partial observations to guess next state.
  • We are currently investigating other techniques such as dependency parsing. So we want to see if providing a surface structure can help to classificate those sentigrams. We are still working on it. So basically the ML approach is giving us better results (85%), (94% if we focus individually in aspects or sentiwords)
  • That the original dictionary was useless and we needed to perform random-walk. So we designed a screen to perform interactive corrections.
  • And in this third iteration that is not finished yet we are working in sentigram identification through machine learning and dependency parsing. Our accuracy right now is 85%.
  • Read text.
  • Sentiment analysis: Incremental learning to build domain-models

    1. 1. Sentiment analysis: Incremental learning to build domain-models Raimon Bosch (@raimonbosch) TALN, DTIC, UPF
    2. 2. What is sentiment analysis? [Liu, 2010] Proposes a quintuple (oj, fjk, ooijkl, hi, tj). Text unstructured data to structured data. oj: Object fjk: Object features (Aspect) ooijkl: Opinion orientations (positive/negative), (calm/anger/joy/happiness), intensity, ... hi: Opinion holder tj: Time frame
    3. 3. What is sentiment analysis? (oj, fjk, ooijkl, hi, tj) examples: ("easyjet", "baggage", "too expensive" => -5, "John", "01-07- 2013") ("rentaz", "house rent", "horrible people" => -10, "John", "02-07- 2013") ... ("jazztel", "internet", "no problems" => +4, "John", "03-07- 2013")
    4. 4. State-of-the-art - Twitter as a corpus [Pak and Paroubek, 2010]: Text- classification problem. Features for machine learning techniques. - Emoticons :) - N-grams - Negations - Pos-tagging - Syntax - Twitter specific features.
    5. 5. State-of-the-art - Pointwise Mutual Information [Su and Xiang, 2006]: We can have the probability of certain words in a phrase of being positive or negative depending on their co-occurrences in the WWW.
    6. 6. State-of-the-art - Sentiment dictionaries: Sentiwordnet [Baccianella and Esuli, 2010]. Positive score and Negative score for each meaning (#N). Calculated with Random-walk algorithm.
    7. 7. State-of-the-art - Cross-domain models [Pan, 2010]: Bipartite graph.
    8. 8. State-of-the-art - Twitter prediction [O’Connor, 2010]: Correlation between tweets and polls. Real-time information.
    9. 9. Not developed in state-of-the-art Structured N-grams. Most of the work is done with N-grams. Buzz detection. Aspect identification is not a main focus.
    10. 10. Technology stack
    11. 11. Technology stack - Simplicity. Ruby. - Integration with Java (JRuby, Hadoop Streaming). - Big Data ready. Hadoop.
    12. 12. Hypothesis H1: We can create groups of N-grams that influence specifically to one aspect in a negative or a positive orientation. This is what we call sentigrams. H2: By using incremental learning the system improves in each iteration. User interaction increases precision. H3: After certain number of iterations is reached we can assign sentigrams to a tweet automatically.
    13. 13. Hypothesis (H1) - Sentigrams We define as sentigram the relation between sentiwords and aspects that define if a tweet is postive or negative. - Sentigram is an evolution from N-grams. Which could be considered as structured N-gram. - Detect aspects and sentiwords inside a text.
    14. 14. Hypothesis (H1) - Sentigrams - Mark opinion orientations. Not only if they are positive or negative, also which aspect are they referring to.
    15. 15. Hypothesis (H2) - Incremental learning By using incremental learning the system improves in each iteration. Increasing precision. - Original sentiwordnet version was not very adapted to our domain. - We include new sentiwords from annotations in our dictionary with scores (pos_score: 0, neg_score: 0). - Random-walk update word scores until accuracy converges.
    16. 16. Hypothesis (H3) - Automatization After certain number of iterations is reached we can assign sentigrams to a tweet automatically without manual intervention. - Multi class problem!! Each tweet has several words to guess. Text-classification problem!!
    17. 17. Hypothesis (H3) - ML - Convert a multiclass problem in a binary problem (i.e. "ryanair is a joke"). 0,801829636,- 545403680,1561023766,2119008529,11,801829636,- 545403680,1561023766,2119008529,0 2,801829636,-545403680,1561023766,2119008529,0 3,801829636,-545403680,1561023766,2119008529,2 - Focus the problem by position: (0..N). N partial observations from each tweet. - Numerical codes for words. Three classes available {0,1,2}
    18. 18. Hypothesis (H3) - Dependency parsing - Mate Tools 1 ryanair _ ryanair _ NN _ _ -1 2 _ SBJ _ _ 2 is _ be _ VBZ _ _ -1 0 _ ROOT _ _ 3 a _ a _ DT _ _ -1 4 _ NMOD _ _ 4 joke _ joke _ NN _ _ -1 2 _ PRD _ _ - Still noisy. Work in progress. - ML approach: Accuracy is 85% against our gold standard. Focusing only on aspects we can get 94% accuracy.
    19. 19. Conclusions - Sentiwordnet version was not very adapted to our domain. Accuracy 47%. Random-walk necessary. - Design of interface to perform interactive annotations. Semi- supervised approach. - With words from annotations pos scores and neg scores are changed randomly until accuracy is optimized. Convergence reached. Accuracy 89%.
    20. 20. Conclusions - Focus on aspect identification. Not only +/-. We detect what the user is complaining about. - Convert a multi class problem in a binary problem. Divide & conquer!! - Machine-learning & dependency parsing of tweets to detect patterns. Accuracy 85%
    21. 21. What's next? - Finish integration with dependency parsing. - Data visualization. Comparison between several topics. Positive aspects and negative aspects of each topic. - Train the system for several domains: airlines, politics, tv, telecommunications, etc...
    22. 22. Thanks! Questions?

    ×