Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ohio Center of Excellence in Knowledge-Enabled Computing
Automatic Emotion Identification
from Text
Wenbo Wang
Kno.e.sis C...
Ohio Center of Excellence in Knowledge-Enabled Computing 2Sadness
Anger
Fear
Joy
Your emotions are the slaves to your thou...
Ohio Center of Excellence in Knowledge-Enabled Computing 3
S&P 500 dropped 1% …
Jon C. Ogg, credit
Stock Market
Ohio Center of Excellence in Knowledge-Enabled Computing 4
Employee Productivity
Credit, credit
Ohio Center of Excellence in Knowledge-Enabled Computing 5
Subjective Well-being
Credit, credit
Happiness IndexECG
Physica...
Ohio Center of Excellence in Knowledge-Enabled Computing 6
Ohio Center of Excellence in Knowledge-Enabled Computing 7
Ohio Center of Excellence in Knowledge-Enabled Computing
Emotion Identification
• Emotion
– “a strong feeling (such as lov...
Ohio Center of Excellence in Knowledge-Enabled Computing
Proposed Questions
• How to glean people’s emotions from their te...
Ohio Center of Excellence in Knowledge-Enabled Computing
1. EMOTION CLASSIFICATION
10
Wenbo Wang, Lu Chen, Ming Tan, Shaoj...
Ohio Center of Excellence in Knowledge-Enabled Computing
Background - Classification
Credit: nltk 11
Ohio Center of Excellence in Knowledge-Enabled Computing
Dataset Description
• Suicide notes
– 15 fine-grained emotions
– ...
Ohio Center of Excellence in Knowledge-Enabled Computing
Suicide Notes Dataset
13
Sentence example:
“I loved you and was p...
Ohio Center of Excellence in Knowledge-Enabled Computing
Suicide Notes Dataset
14
Sentence example:
“I loved you and was p...
Ohio Center of Excellence in Knowledge-Enabled Computing
Suicide Notes Dataset
15
Sentence example:
“I loved you and was p...
Ohio Center of Excellence in Knowledge-Enabled Computing
Twitter Dataset – Supervised Classifier
16
Applying only adjectiv...
Ohio Center of Excellence in Knowledge-Enabled Computing
Twitter Dataset – Supervised Classifier
17
The combination of uni...
Ohio Center of Excellence in Knowledge-Enabled Computing
Twitter Dataset – Supervised Classifier
18
Knowledge features and...
Ohio Center of Excellence in Knowledge-Enabled Computing
Challenge: The Lack of Training Data
• Emotion annotation is typi...
Ohio Center of Excellence in Knowledge-Enabled Computing
Why do We Need More Training Data? (I)
20
speech. The memory-base...
Ohio Center of Excellence in Knowledge-Enabled Computing
Why do We Need More Training Data? (II)
• Emotions arise in vario...
Ohio Center of Excellence in Knowledge-Enabled Computing
The Use of Hashtags on Twitter
22
“I hate when my mom compares me...
Ohio Center of Excellence in Knowledge-Enabled Computing
2. SELF-LABELED DATA
CREATION
23
Wenbo Wang, Lu Chen, Krishnapras...
Ohio Center of Excellence in Knowledge-Enabled Computing
Emotion Hashtags
• From existing psychology literature (Shaver et...
Ohio Center of Excellence in Knowledge-Enabled Computing
Removing Irrelevant Tweets
25
Hashtag count > 2
Emotion hashtag i...
Ohio Center of Excellence in Knowledge-Enabled Computing
Results with Increasing Training Data
0.4
0.45
0.5
0.55
0.6
0.65
...
Ohio Center of Excellence in Knowledge-Enabled Computing
Results with Increasing Training Data
0.4
0.45
0.5
0.55
0.6
0.65
...
Ohio Center of Excellence in Knowledge-Enabled Computing
For three popular emotions (76.2% of the tweets), the classifier
...
Ohio Center of Excellence in Knowledge-Enabled Computing
Detailed Results
29
For three less popular emotions (22.8% of the...
Ohio Center of Excellence in Knowledge-Enabled Computing
What Have We Learned?
• We can automatically create training data...
Ohio Center of Excellence in Knowledge-Enabled Computing
New Challenge
31
Lots of labeled tweets
Far less labeled data in
...
Ohio Center of Excellence in Knowledge-Enabled Computing
3. DOMAIN ADAPTATION FOR
EMOTION IDENTIFICATION
32
Wenbo Wang, Lu...
Ohio Center of Excellence in Knowledge-Enabled Computing
Problem Definition
• Input
– Large amount of emotion-labeled twee...
Ohio Center of Excellence in Knowledge-Enabled Computing
The Bootstrapping Framework
34
Self-labeled tweets
Target domain ...
Ohio Center of Excellence in Knowledge-Enabled Computing
The Bootstrapping Framework
35
Target domain labeled data
Credit1...
Ohio Center of Excellence in Knowledge-Enabled Computing
Informativeness Overview
36
Consistency Diversity Similarity
Ohio Center of Excellence in Knowledge-Enabled Computing
Consistency
• Fear: “Amazing night with my baby. Hope she liked o...
Ohio Center of Excellence in Knowledge-Enabled Computing
Diversity
• Sadness: “Searching for vinyl proved to be quite disa...
Ohio Center of Excellence in Knowledge-Enabled Computing
Similarity Intuition
• Inspired by domain adaptation for machine ...
Ohio Center of Excellence in Knowledge-Enabled Computing
Similarity Overview
40
Content
similarity
Label
similarity Uncert...
Ohio Center of Excellence in Knowledge-Enabled Computing
Content Similarity
• Upweight important words
– Source instance:
...
Ohio Center of Excellence in Knowledge-Enabled Computing
Label Similarity
• Target test sentence
• Disgust: “im sick of lo...
Ohio Center of Excellence in Knowledge-Enabled Computing
Uncertainty
Sentence Label
Predicted
Label
Classifier
confidence
...
Ohio Center of Excellence in Knowledge-Enabled Computing
Similarity Revisit
• Encourage the selection of source instances ...
Ohio Center of Excellence in Knowledge-Enabled Computing
Informativeness Revisit
• A tweet is informative when
– 1) its la...
Ohio Center of Excellence in Knowledge-Enabled Computing
Baseline approaches
• Source Only (SO): train classifiers using o...
Ohio Center of Excellence in Knowledge-Enabled Computing
Baseline approaches
• Source Only (SO): train classifiers using o...
Ohio Center of Excellence in Knowledge-Enabled Computing
Experimental settings
• Features
– Experimented unigrams, bigrams...
Ohio Center of Excellence in Knowledge-Enabled Computing
Results on four target datasets*
49
Percentage gain
8.01%
24.07%
...
Ohio Center of Excellence in Knowledge-Enabled Computing
Different Instance Selection Strategies
• CDS: select tweets from...
Ohio Center of Excellence in Knowledge-Enabled Computing
Comparing instance selection strategies
51
Among all the strategi...
Ohio Center of Excellence in Knowledge-Enabled Computing
Comparing instance selection strategies
52
CDS-ALL achieves a sim...
Ohio Center of Excellence in Knowledge-Enabled Computing
Comparing instance selection strategies
53
CDS-CORR performs the ...
Ohio Center of Excellence in Knowledge-Enabled Computing
Summary
• People’s emotions can be gleaned from their texts using...
Ohio Center of Excellence in Knowledge-Enabled Computing
Publications
• Wenbo Wang, Lei Duan, Anirudh Koul, Amit P. Sheth....
Ohio Center of Excellence in Knowledge-Enabled Computing
Publications
• Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit ...
Ohio Center of Excellence in Knowledge-Enabled Computing
Patents & Proposal
• Wenbo Wang, Lei Duan. "Temporal User Engagem...
Ohio Center of Excellence in Knowledge-Enabled Computing
Special thanks to AFRL and NSF
58
Credit, credit
*Part of this ma...
Ohio Center of Excellence in Knowledge-Enabled Computing 59
Thank You! & Questions?
Upcoming SlideShare
Loading in …5
×

Automatic Emotion Identification from Text

1,358 views

Published on

Video: https://www.youtube.com/watch?v=ZCToaDgxnAs

Abstract:
People's emotions can be gleaned from their text using machine learning techniques to build models that exploit large self-labeled emotion data from social media. Further, the self-labeled emotion data can be effectively adapted to train emotion classifiers in different target domains where training data are sparse.

Emotions are both prevalent in and essential to most aspects of our lives. They influence our decision-making, affect our social relationships and shape our daily behavior. With the rapid growth of emotion-rich textual content, such as microblog posts, blog posts, and forum discussions, there is a growing need to develop algorithms and techniques for identifying people's emotions expressed in text. It has valuable implications for the studies of suicide prevention, employee productivity, well-being of people, customer relationship management, etc. However, emotion identification is quite challenging partly due to the following reasons: i) It is a multi-class classification problem that usually involves at least six basic emotions. Text describing an event or situation that causes the emotion can be devoid of explicit emotion-bearing words, thus the distinction between different emotions can be very subtle, which makes it difficult to glean emotions purely by keywords. ii) Manual annotation of emotion data by human experts is very labor-intensive and error-prone. iii) Existing labeled emotion datasets are relatively small, which fails to provide a comprehensive coverage of emotion-triggering events and situations.

Published in: Retail
  • Be the first to comment

  • Be the first to like this

Automatic Emotion Identification from Text

  1. 1. Ohio Center of Excellence in Knowledge-Enabled Computing Automatic Emotion Identification from Text Wenbo Wang Kno.e.sis Center Advisor: Dr. Amit P. Sheth Committee members: Dr. Keke Chen Kevin Haas Dr. T.K. Prasad Dr. Ramakanth Kavuluru Ph.D. Dissertation Defense
  2. 2. Ohio Center of Excellence in Knowledge-Enabled Computing 2Sadness Anger Fear Joy Your emotions are the slaves to your thoughts, and you are the slave to your emotions. --Elizabeth Gilbert
  3. 3. Ohio Center of Excellence in Knowledge-Enabled Computing 3 S&P 500 dropped 1% … Jon C. Ogg, credit Stock Market
  4. 4. Ohio Center of Excellence in Knowledge-Enabled Computing 4 Employee Productivity Credit, credit
  5. 5. Ohio Center of Excellence in Knowledge-Enabled Computing 5 Subjective Well-being Credit, credit Happiness IndexECG Physical State Emotional State
  6. 6. Ohio Center of Excellence in Knowledge-Enabled Computing 6
  7. 7. Ohio Center of Excellence in Knowledge-Enabled Computing 7
  8. 8. Ohio Center of Excellence in Knowledge-Enabled Computing Emotion Identification • Emotion – “a strong feeling (such as love, anger, joy, hate, or fear)” -- Merriam-Webster Online Dictionary • Emotion Identification – the task of automatically identifying and extracting the emotions expressed in a given text. • Examples 8 “I hate when my mom compares me to my friends” -> Anger “When I see a cop, no matter where I am or what I’m doing, I always feel like every law I’ve broken is stamped all over my body” -> Fear
  9. 9. Ohio Center of Excellence in Knowledge-Enabled Computing Proposed Questions • How to glean people’s emotions from their texts using machine learning techniques? • How to create large self-labeled emotion data from social media? • How to improve emotion identification in target domains (e.g., blog, diary) by leveraging large self-labeled emotion data from social media? 9
  10. 10. Ohio Center of Excellence in Knowledge-Enabled Computing 1. EMOTION CLASSIFICATION 10 Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit P. Sheth. Discovering Fine- grained Sentiment in Suicide Notes. Biomedical Informatics Insights, 2012 Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter ‘Big Data’ for Automatic Emotion Identification. 2012 ASE International Conference on Social Computing (SocialCom 2012)
  11. 11. Ohio Center of Excellence in Knowledge-Enabled Computing Background - Classification Credit: nltk 11
  12. 12. Ohio Center of Excellence in Knowledge-Enabled Computing Dataset Description • Suicide notes – 15 fine-grained emotions – Training: 4,633 sentences; – Testing: 2086 sentences • Twitter data – 7 emotions – Training: ~250 K tweets – Testing: 250 K tweets 12
  13. 13. Ohio Center of Excellence in Knowledge-Enabled Computing Suicide Notes Dataset 13 Sentence example: “I loved you and was proud of you.” Unigrams: i, love, you, and, be, proud, of, you, . Bigrams: I love, love you, you and, and be, be proud, proud of, of you, you . The combination of unigrams and bigrams perform the best among n-gram features.
  14. 14. Ohio Center of Excellence in Knowledge-Enabled Computing Suicide Notes Dataset 14 Sentence example: “I loved you and was proud of you.” LIWC Knowledge: Posemo: 2 (love, proud) Negemo: 0 Anger: 0 Sad: 0 Adding knowledge-based features further increases the performance.
  15. 15. Ohio Center of Excellence in Knowledge-Enabled Computing Suicide Notes Dataset 15 Sentence example: “I loved you and was proud of you .” POS count: Adjective: 1 (proud) Noun: 0 () Pronoun: 3 (i, you) … Sentence tense: Simple past tense: 2 (I loved, was proud) Adding sentence tenses and POS counts further increases the performance
  16. 16. Ohio Center of Excellence in Knowledge-Enabled Computing Twitter Dataset – Supervised Classifier 16 Applying only adjectives performs poorly because emotions can be implicitly expressed in text.
  17. 17. Ohio Center of Excellence in Knowledge-Enabled Computing Twitter Dataset – Supervised Classifier 17 The combination of unigrams and bigrams perform the best among n-gram features.
  18. 18. Ohio Center of Excellence in Knowledge-Enabled Computing Twitter Dataset – Supervised Classifier 18 Knowledge features and syntactic features become less important on Twitter data.
  19. 19. Ohio Center of Excellence in Knowledge-Enabled Computing Challenge: The Lack of Training Data • Emotion annotation is typically time-consuming, expensive and error-prone. – multiple emotion categories – subtle and ambiguous emotion expressions – Human judgement of emotion tends to be subjective and varied. • Most of existing datasets are small, e.g., – Blog: 1,890 sentences (Aman and Szpakowicz 2008) – Experience: 1,000 sentences (Neviarouskaya et. al. 2010) – Diary: 700 sentences (Neviarouskaya et. al. 2011) 19
  20. 20. Ohio Center of Excellence in Knowledge-Enabled Computing Why do We Need More Training Data? (I) 20 speech. The memory-based learner used only the word before and word after as features. 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.1 1 10 100 1000 Millions of Words TestAccuracy Memory-Based Winnow Perceptron Naïve Bayes Figure 1. Learning Curves for Confusion Set Disambiguation We collected a 1-billion-word training corpus from a variety of English texts, including “We may want to reconsider the trade-off between spending time and money on algorithm development versus spending it on corpus development” -- (Banko and Brill 2001) From (Banko and Brill 2001)
  21. 21. Ohio Center of Excellence in Knowledge-Enabled Computing Why do We Need More Training Data? (II) • Emotions arise in various situations, which leads to very diverse expressions conveying the emotions. 21 “I hate when my mom compares me to my friends” “When I see a cop, no matter where I am or what I’m doing, I always feel like every law I’ve broken is stamped all over my body” “I hate when I get the hiccups in class” “Omg I finally fit into one pair of my jeans from last year!!” “A dog barked at me!”
  22. 22. Ohio Center of Excellence in Knowledge-Enabled Computing The Use of Hashtags on Twitter 22 “I hate when my mom compares me to my friends #annoying” “When I see a cop, no matter where I am or what I’m doing, I always feel like every law I’ve broken is stamped all over my body #nervous” “I hate when I get the hiccups in class #embarrassing” “Omg I finally fit into one pair of my jeans from last year!! #excited” “A dog barked at me! #scared #weak”
  23. 23. Ohio Center of Excellence in Knowledge-Enabled Computing 2. SELF-LABELED DATA CREATION 23 Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter ‘Big Data’ for Automatic Emotion Identification. 2012 ASE International Conference on Social Computing (SocialCom 2012)
  24. 24. Ohio Center of Excellence in Knowledge-Enabled Computing Emotion Hashtags • From existing psychology literature (Shaver et. al. 1987), collected 7 sets of emotion words for 7 different emotions – joy, sadness, anger, love, fear, thankfulness, and surprise. 24 Emotion Hashtag Word Examples Number of Tweets Joy excited, happy, elated, proud (36) 706,182 Sadness sorrow, unhappy, depressing, lonely (36) 616,471 Anger irritating, annoyed, frustrate, fury (23) 574,170 Love affection, lovin, loving, fondness (7) 301,759 Fear fear, panic, fright, worry, scare (22) 135,154 Thankfulness thankfulness, thankful (2) 131,340 Surprise surprised, astonished, unexpected (5) 23,906 Total 131 2,488,982
  25. 25. Ohio Center of Excellence in Knowledge-Enabled Computing Removing Irrelevant Tweets 25 Hashtag count > 2 Emotion hashtag is not at the end Word count < 5 Has URL or quotations About 5 million tweets -> 2,488,982 tweets
  26. 26. Ohio Center of Excellence in Knowledge-Enabled Computing Results with Increasing Training Data 0.4 0.45 0.5 0.55 0.6 0.65 1,000 10,000 248,898 497,796 746,694 995,592 1,244,490 1,493,388 1,742,286 1,991,184 accuracy number of tweets in training data LIBLINEAR MNB 26 0.4341 0.5292 Logistic Regression (LR) Training instance: 1K -> 2M Percentage gain = 51.05% 0.6557 LR 0.6156
  27. 27. Ohio Center of Excellence in Knowledge-Enabled Computing Results with Increasing Training Data 0.4 0.45 0.5 0.55 0.6 0.65 1,000 10,000 248,898 497,796 746,694 995,592 1,244,490 1,493,388 1,742,286 1,991,184 accuracy number of tweets in training data LIBLINEAR MNB 27 0.4580 0.5426 Multinomial Naive Bayes (MNB) Training instance: 1K -> 2M Percentage gain = 38.65% 0.6350 LR 0.6113
  28. 28. Ohio Center of Excellence in Knowledge-Enabled Computing For three popular emotions (76.2% of the tweets), the classifier achieves F-measures of over 64% Detailed Results 28
  29. 29. Ohio Center of Excellence in Knowledge-Enabled Computing Detailed Results 29 For three less popular emotions (22.8% of the tweets), the precisions are relatively higher compared with the recalls, and the F-measures are over 43%.
  30. 30. Ohio Center of Excellence in Knowledge-Enabled Computing What Have We Learned? • We can automatically create training datasets for emotion identification by leveraging emotion hashtags on Twitter. – A large amount of labeled data are collected with little effort and cost – Covers a variety of situations that elicit emotions – Performance gain with increasing size of training data • However, there is still a lack of labeled data in many other domains/data sources. 30
  31. 31. Ohio Center of Excellence in Knowledge-Enabled Computing New Challenge 31 Lots of labeled tweets Far less labeled data in many other domains Can we use emotion-labeled tweets to help emotion identification in other domains?
  32. 32. Ohio Center of Excellence in Knowledge-Enabled Computing 3. DOMAIN ADAPTATION FOR EMOTION IDENTIFICATION 32 Wenbo Wang, Lu Chen, Keke Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Domain Adaptation for Emotion Identification via Data Selection. Technical paper (under review) 2015
  33. 33. Ohio Center of Excellence in Knowledge-Enabled Computing Problem Definition • Input – Large amount of emotion-labeled tweets – Small amount of labeled sentences from target domains (e.g., blogs, fairy tales) • Objective – Select informative tweets and add them to target domain training data, and train an adaptive classifier for the target domain 33
  34. 34. Ohio Center of Excellence in Knowledge-Enabled Computing The Bootstrapping Framework 34 Self-labeled tweets Target domain labeled data Credit1, credit2, credit3 • Train classifier c • Apply c to tweets
  35. 35. Ohio Center of Excellence in Knowledge-Enabled Computing The Bootstrapping Framework 35 Target domain labeled data Credit1, credit2, credit3 Correctly classified Misclassified • Train classifier c • Apply c to tweets • Identify informative tweets from misclassified tweets • Add them to target domain training data Why select from misclassified tweets?
  36. 36. Ohio Center of Excellence in Knowledge-Enabled Computing Informativeness Overview 36 Consistency Diversity Similarity
  37. 37. Ohio Center of Excellence in Knowledge-Enabled Computing Consistency • Fear: “Amazing night with my baby. Hope she liked our anniversary present. Alil early but whatever. :) hopefully tmmrw goes as planned.” – Top supporting features for emotion fear – Top supporting features for any emotion other than fear – Use the margin to estimate consistency: 0.5094 – 0.5962 = -0.0868 37 Consistency measures how much is a tweet’s Label consistent with its content.
  38. 38. Ohio Center of Excellence in Knowledge-Enabled Computing Diversity • Sadness: “Searching for vinyl proved to be quite disappointing” – “disappoint” occurs 2 times • Sadness: “I'm about to lose everything I've ever wanted, my whole world, and it's all my fault..” – “lose” occurs 15 times 38 0.00 0.25 0.50 0.75 1.00 0 25 50 75 100 term_freq diversity 0.9048 (disappoint) 0.4724 (lose) Exponential decay of its term frequency in target domain training data Diversity encourages the selection of source instances containing discriminative features that are infrequent or underrepresented in the target domain.
  39. 39. Ohio Center of Excellence in Knowledge-Enabled Computing Similarity Intuition • Inspired by domain adaptation for machine translation studies that select source instances similar to test instances (Eck et al., 2004; Lu et al., 2007) • Given a target test sentence – Disgust: “im sick of look at a comput screen.” • Retrieve most similar tweets – Anger: “im sick and tire of look like a fool” – Joy: “i have get usb fairi light around my comput screen .” 39 Content Similarity is not sufficient!
  40. 40. Ohio Center of Excellence in Knowledge-Enabled Computing Similarity Overview 40 Content similarity Label similarity Uncertainty
  41. 41. Ohio Center of Excellence in Knowledge-Enabled Computing Content Similarity • Upweight important words – Source instance: – Target test instance: inverse document frequency (idf) 41
  42. 42. Ohio Center of Excellence in Knowledge-Enabled Computing Label Similarity • Target test sentence • Disgust: “im sick of look at a comput screen.” • Source tweet • Anger: “im sick and tire of look like a fool” • How likely will the test sentence express anger? • Apply the same formula used for Consistency factor • Top supporting features for emotion anger • Top supporting features for any emotion other than anger • Use the margin to estimate consistency: 0.5838 – 0.625 = -0.0412 42
  43. 43. Ohio Center of Excellence in Knowledge-Enabled Computing Uncertainty Sentence Label Predicted Label Classifier confidence Uncertainty the second day i go in and i be so paranoid . Fear Sadness 0.2352 we are total awesome! Joy Joy 0.8683 43 0.7648 0.1317 The more confident the classifier is, the more likely the prediction is correct, the less focus we should give to this sentence.
  44. 44. Ohio Center of Excellence in Knowledge-Enabled Computing Similarity Revisit • Encourage the selection of source instances that share high content and label similarities with target domain test instances that classifier c is most uncertain about. 44 Content similarity Label similarity Uncertainty
  45. 45. Ohio Center of Excellence in Knowledge-Enabled Computing Informativeness Revisit • A tweet is informative when – 1) its label is consistent with its content – AND 2) it contains a discriminative feature that is infrequent in target training data – AND 3) it is similar to an target domain test instance whose label cannot be predicted by the classifier c with high confidence. 45 Consistency Diversity Similarity Our proposed approach: CDS
  46. 46. Ohio Center of Excellence in Knowledge-Enabled Computing Baseline approaches • Source Only (SO): train classifiers using only Twit • Target Only (TO): train classifiers using only target domain training data • Feature Injection (FI): first train a source classifier using only source data (Daume III, 2007) • Feature Augmentation (FA) (Daume III, 2007) – Source instances: X -> XX0 (common, source, target) – Target instances: X -> XoX (common, source, target) • Balance Weight (BW): assign larger weights for the target instances so that the weight sum of target instances equals to that of source instances (Jiang and Zhai, 2007) 46
  47. 47. Ohio Center of Excellence in Knowledge-Enabled Computing Baseline approaches • Source Only (SO): train classifiers using only Twit • Target Only (TO): train classifiers using only target domain training data • Feature Injection (FI): first train a source classifier using only source data (Daume III, 2007) • Feature Augmentation (FA) (Daume III, 2007) – Source instances: X -> XX0 (common, source, target) – Target instances: X -> XoX (common, source, target) • Balance Weight (BW): assign larger weights for the target instances so that the weight sum of target instances equals to that of source instances (Jiang and Zhai, 2007) 47
  48. 48. Ohio Center of Excellence in Knowledge-Enabled Computing Experimental settings • Features – Experimented unigrams, bigrams, unigrams+bigrams – Applied unigrams in the end • Logistic regression – Fast, support probability output (uncertainty) • Five-fold cross validation – Four folds: training; 1 fold; testing • Add-0.5 smoothing 48
  49. 49. Ohio Center of Excellence in Knowledge-Enabled Computing Results on four target datasets* 49 Percentage gain 8.01% 24.07% 36.53% 3.62% 16.45% *: The numbers are different from those in the dissertation defense video, because I fixed a bug after that. Results got slightly improved because of this.
  50. 50. Ohio Center of Excellence in Knowledge-Enabled Computing Different Instance Selection Strategies • CDS: select tweets from misclassified tweets • CD: removed similarity factor from CDS • CDS-ALL: select tweets from all source tweets • CDS-CORR: select tweets from source tweets that can be correctly classified by c 50
  51. 51. Ohio Center of Excellence in Knowledge-Enabled Computing Comparing instance selection strategies 51 Among all the strategies, CDS improves F1 in the fastest way.
  52. 52. Ohio Center of Excellence in Knowledge-Enabled Computing Comparing instance selection strategies 52 CDS-ALL achieves a similar performance as CDS does but takes more iterations, because the input of CDS-ALL is a superset of CDS.
  53. 53. Ohio Center of Excellence in Knowledge-Enabled Computing Comparing instance selection strategies 53 CDS-CORR performs the worst because it selects tweets from correctly classified tweets, the knowledge of which might already exist in target domains.
  54. 54. Ohio Center of Excellence in Knowledge-Enabled Computing Summary • People’s emotions can be gleaned from their texts using machine learning techniques. – The combination of n-grams (n=1,2), knowledge-based and syntactic features achieves the best performance. – Knowledge features and syntactic features become less important on large training data. • We can automatically create a large training dataset for emotion identification by leveraging emotion hashtags on Twitter. – A large amount of labeled data are collected with little effort and cost – Covers a variety of situations that elicit emotions – Performance gain with increasing size of training data • This self-labeled emotion dataset can be used to improve emotion identification in text from other domains/data sources. – Domain adaptation via selecting tweets that are informative to the target domain – It is superior to select source instances that cannot be correctly classified. – Informativeness of a source instance is measured by three factors: consistency, diversity and similarity. 54
  55. 55. Ohio Center of Excellence in Knowledge-Enabled Computing Publications • Wenbo Wang, Lei Duan, Anirudh Koul, Amit P. Sheth. YouRank: Let User Engagement Rank Microblog Search Results. In the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM'14) 2014 • Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Cursing in English on Twitter. In ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW'14) 2014 • Amit Sheth, Ashutosh Jadhav, Pavan Kapanipathi, Lu Chen, Hemant Purohit, Gary Alan Smith, and Wenbo Wang. "Twitris: A system for collective social intelligence." In Encyclopedia of Social Network Analysis and Mining, pp. 2240-2253. Springer New York, 2014. • Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Users Equal in Predicting Elections? A Study of User Groups in Predicting 2012 U.S. Republican Presidential Primaries. In Proceedings of the Fourth International Conference on Social Informatics (SocInfo'12) 2012 • Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter ‘Big Data’ for Automatic Emotion Identification. 2012 ASE International Conference on Social Computing (SocialCom 2012), 2012 • Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit P. Sheth. Extracting Diverse Sentiment Expressions with Target-dependent Polarity from Twitter. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM), 2012 55
  56. 56. Ohio Center of Excellence in Knowledge-Enabled Computing Publications • Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit P. Sheth. Discovering Fine-grained Sentiment in Suicide Notes. Biomedical Informatics Insights, 2012 • Ramakanth Kavuluru, Christopher Thomas, Amit Sheth, Victor Chan, Wenbo Wang, Alan Smith, An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused Bioscience Domains, IHI 2012 - 2nd ACM SIGHIT Intl Health Informatics Symposium, January 28- 30, 2012. • Wenbo Wang, Christopher Thomas, Amit Sheth, Victor Chan. Pattern-Based Synonym and Antonym Extraction. 48th ACM Southeast Conference, ACMSE2010, Oxford Mississippi, April 15- 17, 2010 • Christopher J. Thomas, Wenbo Wang, Pankaj Mehra, Delroy Cameron, Pablo N. Mendes, and Amit P. Sheth.. What Goes Around Comes Around – Improving Linked Opend Data through On-Demand Model Creation. In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, April 26-27th, 2010, Raleigh, NC: US. • Ashutosh Jadhav, Wenbo Wang, Raghava Mutharaju, Pramod Anantharam, Vinh Nyugen, Amit P. Sheth, Karthik Gomadam, Meenakshi Nagarajan, and Ajith Ranabahu, Twitris: Socially Influenced Browsing, Semantic Web Challenge 2009, demo at 8th International Semantic Web Conference, Oct. 25-29 2009, Washington, DC, USA 56
  57. 57. Ohio Center of Excellence in Knowledge-Enabled Computing Patents & Proposal • Wenbo Wang, Lei Duan. "Temporal User Engagement Features", U.S. Patent No. 20,150,120,753. 30 Apr. 2015. • Lu Chen, Wenbo Wang, Amit Sheth. "Topic-specific Sentiment Extraction", U.S. Patent No. 20,140,358,523. 4 Dec. 2014. • Context-Aware Harassment Detection on Social Media. NSF proposal 57
  58. 58. Ohio Center of Excellence in Knowledge-Enabled Computing Special thanks to AFRL and NSF 58 Credit, credit *Part of this material is based upon work supported by the National Science Foundation under Grant IIS-1111182 `` SoCS: Collaborative Research: Social Media Enhanced Organizational Sensemaking in Emergency Response.''
  59. 59. Ohio Center of Excellence in Knowledge-Enabled Computing 59 Thank You! & Questions?

×