Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Lu Chen
Kno.e.sis Center
Ph.D. Dissertation Defense
Advisor:
Prof. Amit P. Sheth
Committee members:
Prof. T.K. Prasad
Prof...
Subjective Experience –
What We Experience in Our Mind
Hunger
Love
Happiness
Surprise
Embarrassment
Like
Dislike
Confused
...
Subjective Information – The Information
about People’s Subjective Experiences
Source: http://bit.ly/1GDD9Mb
Source: http:...
User Generated Content
• New opportunities arise as we now can obtain a wide variety of
subjective information from user g...
The Demand of Subjective Information
• Subjective information can be used to support better decision-
making.
5
Source: ht...
Different Types of Subjective Information
Intent “would like to watch”
Expectation “hope it’s good”
would like to watch Th...
Defining Subjective Information
 cesh ,,,
Formally, a subjective experience can be represented as a quadruple
𝒉 − a hold...
Different Types of Subjective Information
8
𝐇𝐨𝐥𝐝𝐞𝐫 𝒉 𝐒𝐭𝐢𝐦𝐮𝐥𝐮𝐬 𝐬 𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝒄
Sentiment
an individual who
...
9
* The holders of these experiences are the authors of the messages.
Example Type 𝐒𝐭𝐢𝐦𝐮𝐥𝐮𝐬 𝐬 𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 ...
10
An overview of subjective
information extraction.
The box colored in orange
indicate the scope of this
dissertation.
Dissertation Focus
1. Extraction of Target-
Specific Sentiment
Expressions (ICWSM’12)
2. Discovery of Domain-
Specific Fea...
Thesis Statement
• This dissertation presents a unified framework that characterizes a
subjective experience, such as sent...
Sentiment in User Generated Content
Sources: Social media
Data: posts, messages
Targets: movies, persons,
brands, etc.
13
...
1. Extraction of Target-
Specific Sentiment Expressions
14
Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit Sh...
Example
E1. Lights out definitely lived up to the hype! Great movie!
E2. I got my second Pikachu today this one was from 2...
Challenges
• Sentiment expressions can be very diverse.
‒ Vary from single words (e.g., “good”, “predictable”) to multi-wo...
Contributions
We propose a novel optimization-based approach that:
• identifies a diverse and richer set of sentiment expr...
The Proposed Approach
Extracting
Candidate Expressions
Identifying
Inter-Expression Relations
Assessing
Target-dependent P...
Example:
“The Avengers movie was bloody amazing! A little cheesy at times, but I
liked it. Mmm looking good Robert Downey ...
Identifying Inter-Expression Relations
1. I saw The Avengers yesterday evening. It was long but it was very good!
2. I do ...
Assessing Target-dependent Polarity
21
An Optimization Model (1)
• For each candidate expression ,
‒ P-Probability – the probability that indicates positive sent...
An Optimization Model (2)
• We want the consistency and inconsistency probabilities derived from
the P-Probabilities and N...
Experiments: Datasets
Table: Description of four
target-specific datasets from
social media.
24
Tweet about movie New Star...
Experiments on Tweets
• Datasets:
‒ 168,005 tweets about movies
‒ 258,655 tweets about persons
• Gold standard: 1500 tweet...
Methods
COM -- Constrained Optimization Model
• COM-const: Assign 0.5 to all the candidates as their initial P-
Probabilit...
Results
27
It demonstrates the advantage of our
optimization-based approach over
the lexicon-based or rule-based
manner in...
Results of Sentiment Expression Extraction with Various Corpora Sizes
Our approach make increases on both
precision and re...
• Datasets:
‒ 100 forum posts about epilepsy treatment
‒ 162 forum posts about cellular company
‒ 200 Facebook posts about...
Results
30
Table: Quality of the extracted sentiment
expressions.
Figure: Sentence-level sentiment
classification accuracy...
Sample Output (Movie Domain)
31
Aspect-based Opinion Mining
It would be helpful to have an aspect-based opinion summarization for products.
…
Size
picture...
2. Discovery of Domain-
Specific Features and Aspects
33
Lu Chen, Justin Martineau, Doreen Cheng and Amit Sheth. Clusterin...
Example
Review Sentences
1. Phone is easy to use and has great features. Large
screen is great. Great speed makes smooth
v...
• Two-step approach: first identifying features, then clustering
them
• Feature Identification
‒ Only extract features but...
Contributions
We propose a new clustering-based approach that:
• identifies both features and aspects simultaneously;
• ex...
Notation
 is a set of
candidate features, which are extracted
from reviews of a given product.
o Candidate of explicit fe...
• General semantic similarities that are learned from thesaurus
dictionaries or web corpus.
‒ The similarities between wor...
Domain-specific Similarity
• General similarity matrix G -- a n × n matrix, where Gij is the general semantic
similarity b...
• A candidate xi can be represented by the i-th row in G or T.
40
where
• The domain-specific similarity between xi and xj...
• We evaluate this approach on reviews from three different domains.
• The default setting of CAFE (Clustering for Aspect ...
• PROP: A double propagation approach that extracts features using hand-
crafted rules based on dependency relations betwe...
43
Evaluations on Feature Extraction – Results
• MuReinf: A clustering method utilizes the mutual reinforcement
association between features and opinion words to iterati...
Evaluations on Aspect Discovery – Results
45
The results showed the advantage of combining feature and aspect discovery
ov...
Influence of Parameters
46
Based on the experiments on three domains, the best results can be achieved when
distance upper...
Sample Output
47
3. Harnessing Public Opinion
on Twitter to predict election
results
48
Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Use...
Overview
49
Tweet ID
candidate: XXX
opinion:
positive
User category:
right-leaning
high engagement
opinion prone
orig. twe...
Contributions
• We introduce a new method to predict the election results that:
‒ identifies which candidate is mentioned,...
Findings
51
Revealing the challenge of
identifying the opinion of “silent
majority”
Retweets may not necessarily
reflect u...
4. Religion and Subjective Well-
being
52
Lu Chen, Ingmar Weber and Adam Okulicz-Kozaryn. U.S. Religious Landscape on Twit...
53
user’s religious
belief: Buddhism
a user
tweets
network
user ID
happiness_level: ℎ 𝑎𝑣𝑔 𝑢𝑠𝑒𝑟
topic_preference: 𝑝 𝑡𝑜𝑝𝑖𝑐 𝑢...
Contributions
• We provide a fresh perspective about happiness and religion,
complementing traditional survey-based studie...
Findings
• There is a significant difference among the seven groups (atheist,
Buddhist, Christian, Hindu, Jew, Muslim, and...
Conclusion
• This dissertation presents a unified framework that characterizes a
subjective experience, such as sentiment,...
Future Directions
57
Time
1. Detecting different types of
subjectivity in text
2. Beyond sentiment and opinion
3. Towards ...
Publications
• Lu Chen, Justin Martineau, Doreen Cheng and Amit Sheth. Clustering for Simultaneous Extraction of Aspects a...
Media Coverage (1)
59
Washington Post Washington Times La Croix
MIT Technology Review Time
Media Coverage (2)
60
Fast Company RAPPLER BuzzFeed
The Times of India
Huffington Post
Media Coverage (3)
61
IN Gizmodo RNS
NDTV World Religion News
Acknowledgement
62
Prof. Amit Sheth
(Advisor)
Dr. Ingmar Weber
(QCRI)
Prof. T.K.Prasad Dr. Justin Martineau
(SRA)
Prof. Ke...
63
Acknowledgement
This dissertation is based upon work supported by the National
Science Foundation under Grant:
• IIS-11111...
Upcoming SlideShare
Loading in …5
×

Mining and Analyzing Subjective Experiences in User-generated Content

1,448 views

Published on

Dissertation Defense:

" Mining and Analyzing Subjective Experiences in User Generated Content "
By Lu Chen
Tuesday, April 9, 2016

Dissertation Committee: Dr. Amit Sheth, Advisor, Dr. T. K. Prasad, Dr. Keke Chen, Dr. Ingmar Weber, and Dr. Justin Martineau,

Pictures: https://www.facebook.com/Kno.e.sis/photos/?tab=album&album_id=1225911137443732
Video: https://youtu.be/tzLEUB-hggQ
Lu's Home page: http://knoesis.wright.edu/researchers/luchen/

ABSTRACT

Web 2.0 and social media enable people to create, share and discover information instantly anywhere, anytime. A great amount of this information is subjective information -- the information about people's subjective experiences, ranging from feelings of what is happening in our daily lives to opinions on a wide variety of topics. Subjective information is useful to individuals, businesses, and government agencies to support decision making in areas such as product purchase, marketing strategy, and policy making. However, much useful subjective information is buried in ever-growing user generated data on social media platforms, it is still difficult to extract high quality subjective information and make full use of it with current technologies.

Current subjectivity and sentiment analysis research has largely focused on classifying the text polarity -- whether the expressed opinion regarding a specific topic in a given text is positive, negative, or neutral. This narrow definition does not take into account the other types of subjective information such as emotion, intent, and preference, which may prevent their exploitation from reaching its full potential. This dissertation extends the definition and introduces a unified framework for mining and analyzing diverse types of subjective information. We have identified four components of a subjective experience: an individual who holds it, a target that elicits it (e.g., a movie, or an event), a set of expressions that describe it (e.g., "excellent", "exciting"), and a classification or assessment that characterize it (e.g., positive vs. negative). Accordingly, this dissertation makes contributions in developing novel and general techniques for the tasks of identifying and extracting these components.

We first explore the task of extracting sentiment expressions from social media posts. We propose an optimization-based approach that extracts a diverse set of sentiment-bearing expressions, including formal and slang words/phrases, for a given target from an unlabeled corpus. Instead of associating the overall sentiment with a given text, this method assesses the more fine-grained target-dependent polarity of each sentiment expression. Unlike pattern-based approaches which often fail to capture the diversity of sentiment expressions due to the informal nature of language usage and writing style in social media posts, the proposed approach is capable of identifying sentiment phrase

Published in: Social Media
  • Be the first to comment

  • Be the first to like this

Mining and Analyzing Subjective Experiences in User-generated Content

  1. 1. Lu Chen Kno.e.sis Center Ph.D. Dissertation Defense Advisor: Prof. Amit P. Sheth Committee members: Prof. T.K. Prasad Prof. Keke Chen Dr. Ingmar Weber (QCRI) Dr. Justin Martineau (SRA) Ohio Center of Excellence in Knowledge-Enabled Computing Mining and Analyzing Subjective Experiences in User Generated Content
  2. 2. Subjective Experience – What We Experience in Our Mind Hunger Love Happiness Surprise Embarrassment Like Dislike Confused Pain Tired Stressed Nervous Relaxed Warm Proud Confident Taste of ice cream Feeling about sky Perception of time Appreciation of music Opinion on climate change InterestSource: http://bit.ly/1DvofHX 2 Music preference Purchase intent
  3. 3. Subjective Information – The Information about People’s Subjective Experiences Source: http://bit.ly/1GDD9Mb Source: http://bit.ly/1KkJF2l Source: http://bit.ly/1IjjBSX Source: http://bit.ly/1KkK1Gc The traditional way of collecting subjective information: 3
  4. 4. User Generated Content • New opportunities arise as we now can obtain a wide variety of subjective information from user generated content. 4
  5. 5. The Demand of Subjective Information • Subjective information can be used to support better decision- making. 5 Source: http://twitris2.knoesis.org/debate Predicting election results Source: http://bit.ly/1gQg5Fl Monitoring social phenomena Source: http://bit.ly/1niFkU7 Targeted advertising Source: http://bit.ly/1l0ombo Making purchase decision Source: http://bit.ly/1VzYEZG
  6. 6. Different Types of Subjective Information Intent “would like to watch” Expectation “hope it’s good” would like to watch The Secret Life Of Pets. I hope it's good. "The Secret Life of Pets" was clever, adorable, funny and I already want to see it again. I don't think watching The Secret Life of Pets makes me childish. I laughed I cried and it was so touching for someone who has a pet like me. Finding Dory was much better than The Secret Life of Pets. Still not as good as Zootopia though. 6 The Secret Life of Pets soundtrack should be nominated for an Oscar Sentiment “clever, adorable, funny” Intent “want to see it again” Opinion “don’t think watching … makes me childish” Emotion “I laughed I cried and it was so touching” Preference “much better than” Preference “not as good as” Opinion “should be nominated for an Oscar”
  7. 7. Defining Subjective Information  cesh ,,, Formally, a subjective experience can be represented as a quadruple 𝒉 − a holder, an individual who holds the experiences 𝒔 − a stimulus (or target), an entity, event or situation that elicits the experiences. 𝒆 − a set of expressions that are used to describe the experience, e.g., the sentiment words/phrases or the opinion claims. 𝒄 − a classification or assessment that categorizes or measures the exeprience, e.g., sentiment orientation (positive vs. negative), emotion type (joy, anger, sadness, surprise, etc.), a score indicating the strength of sentiment. 7
  8. 8. Different Types of Subjective Information 8 𝐇𝐨𝐥𝐝𝐞𝐫 𝒉 𝐒𝐭𝐢𝐦𝐮𝐥𝐮𝐬 𝐬 𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝒄 Sentiment an individual who holds the sentiment an entity sentiment words/phrases positive, negative, neutral Opinion an individual who holds the opinion an entity opinion claims (may not contain sentiment words) positive, negative, neutral Emotion an individual who holds the emotion an event or situation emotion words/phrases, description of events/situations anger, disgust, fear, happiness, sadness, surprise Preference an individual who holds the preference a set of alternatives words/phrases that indicate comparison or preference depend on specific tasks Intent an individual who holds the intent an action words/phrases that show the presence of will, description of the act depend on specific tasks Expectation an individual who holds the expectation an entity words/phrases that express the beliefs about someone or something will be. depend on specific tasks
  9. 9. 9 * The holders of these experiences are the authors of the messages. Example Type 𝐒𝐭𝐢𝐦𝐮𝐥𝐮𝐬 𝐬 𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝒄 would like to watch The Secret Life Of Pets. I hope it's good. Intent watch the movie “would like to” transactional Expectation The Secret Life of Pets movie “hope” optimistic "The Secret Life of Pets" was clever, adorable, funny and I already want to see it again. sentiment The Secret Life of Pets movie “clever”, “funny”, “adorable” positive Intent see the movie “want to” transactional I don't think watching The Secret Life of Pets makes me childish. I laughed I cried and it was so touching for someone who has a pet like me. Opinion The Secret Life of Pets movie “don’t think … makes me childish” positive Emotion The Secret Life of Pets movie “laughed”, “cried”, “so touching” funny, touching Finding Dory was much better than The Secret Life of Pets. Still not as good as Zootopia though. preference Finding Dory, The Secret Life of Pets “much better than” preferring Finding Dory preference Finding Dory, Zootopia “not as good as” Preferring Zootopia The Secret Life of Pets soundtrack should be nominated for an Oscar Opinion The Secret Life of Pets soundtrack “should be nominated for an Oscar” positive
  10. 10. 10 An overview of subjective information extraction. The box colored in orange indicate the scope of this dissertation.
  11. 11. Dissertation Focus 1. Extraction of Target- Specific Sentiment Expressions (ICWSM’12) 2. Discovery of Domain- Specific Features and Aspects (NAACL’16) Emotion Identification (SocialCom’12, BII’12, CSCW’14, ACL’14) 3. Application: Predicting Election Results (SocInfo’12) • Identifying and extracting subjective information from user generated content. 11 4. Application: Religiosity & Happiness (SocInfo’14) Sentiment Opinion Emotion Subjective Information 𝐒𝐭𝐢𝐦𝐮𝐥𝐢 𝐬 𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐜 Holder 𝒉
  12. 12. Thesis Statement • This dissertation presents a unified framework that characterizes a subjective experience, such as sentiment, opinion, or emotion, in terms of an individual holding it, a target eliciting it, a set of expressions describing it, and a classification or assessment measuring it; • it describes new algorithms that automatically identify and extract sentiment expressions and opinion targets from user generated content with minimal human supervision; • it shows how to use social media data to predict election results and investigate religion and subjective well-being, by classifying and assessing subjective information in user generated content. 12
  13. 13. Sentiment in User Generated Content Sources: Social media Data: posts, messages Targets: movies, persons, brands, etc. 13 E1. Lights out definitely lived up to the hype! Great movie! E2. I got my second Pikachu today this one was from 2k egg revitalised my love for Pokemon go... Did not last long 😆 stoopid game E3. Game of Thrones is a must watch. E4. I find myself grateful that Hillary Clinton is predictable and steady. Like her or don't, she's SAFE. E5. Saw the avengers last night. Mad overrated. Cheesy lines and horrible writing. Very predictable. E6. I saw The Avengers yesterday evening. It was long but it was very good! E7. Galaxy s7 edge battery life last so long it's almost unlimited battery life xD Target Lights out 75% 20% 5% Pokemon Go 69% 17% 14% Game of Thrones 83% 10% 7% Hillary Clinton 49% 35% 16% The Avengers 70% 24% 6% Galaxy S7 Edge 68% 16% 16% Sentiment Analysis Predictive Models business analytics, predicting financial performance, predicting election results …
  14. 14. 1. Extraction of Target- Specific Sentiment Expressions 14 Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit Sheth. Extracting Diverse Sentiment Expressions with Target-dependent Polarity from Twitter. Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM), 2012. Given a set of unlabeled social media posts, how to extract diverse forms of sentiment expressions with respect to a specific target?
  15. 15. Example E1. Lights out definitely lived up to the hype! Great movie! E2. I got my second Pikachu today this one was from 2k egg revitalised my love for Pokemon go... Did not last long 😆 stoopid game E3. Game of Thrones is a must watch. E4. I find myself grateful that Hillary Clinton is predictable and steady. Like her or don't, she's SAFE. E5. Saw the avengers last night. Mad overrated. Cheesy lines and horrible writing. Very predictable. E6. I saw The Avengers yesterday evening. It was long but it was very good! E7. Galaxy s7 edge battery life last so long it's almost unlimited battery life xD Instances Sentiment Expressions Classification E1 lived up to the hype, great positive E2 love, not last long, stoopid positive, negative E3 must watch positive E4 grateful, predictable, steady, safe positive E5 mad overrated, cheesy, horrible, very predictable negative E6 long, very good negative, positive E7 last so long, unlimited positive Sources: Social media Data: posts, messages e.g., tweets Targets: movies, persons, brands, etc. 15
  16. 16. Challenges • Sentiment expressions can be very diverse. ‒ Vary from single words (e.g., “good”, “predictable”) to multi-word phrases of different lengths (“lived up to the hype”, “must see”) ‒ Can be formal or slang expressions, including abbreviations and spelling variations (e.g., “gud”, “stoopid”). • The polarity of a sentiment expression is sensitive to its target. ‒ E.g., “long” in “long river”, “long battery life”, or “long time for downloading”. ‒ E.g., “predictable” regarding movies, or regarding stocks. 16
  17. 17. Contributions We propose a novel optimization-based approach that: • identifies a diverse and richer set of sentiment expressions, including both formal and slang words/phrases; • assesses the target-dependent polarity of each sentiment expression; and • does not require labeled data or hand-crafted patterns. 17
  18. 18. The Proposed Approach Extracting Candidate Expressions Identifying Inter-Expression Relations Assessing Target-dependent Polarity 18
  19. 19. Example: “The Avengers movie was bloody amazing! A little cheesy at times, but I liked it. Mmm looking good Robert Downey Jr and Captain America ;)” “on-target” subjective words: “bloody”, “amazing”, “cheesy”, “liked” Candidate expressions: “bloody”, “amazing”, “bloody amazing”, “cheesy”, “little cheesy”, “cheesy at times”, “little cheesy at times”, “liked” Method: • For each message, selecting the “on-target” subjective words, and extracting all the n-grams that contain at least one selected subjective word as candidates. • A subjective word is selected as “on-target” if (1) there is a dependency relation between the word and the target, or (2) the word is proximate to the target (e.g., within four words distance). 19 Extracting Candidate Expressions
  20. 20. Identifying Inter-Expression Relations 1. I saw The Avengers yesterday evening. It was long but it was very good! 2. I do enjoy The Avengers, but it's both overrated and problematic. 3. Saw the avengers last night. Mad overrated. Cheesy lines and horrible writing. Very predictable. 4. The avengers was good but the plot was just simple minded and predictable. 5. The Avengers was good. I was not disappointed. 20
  21. 21. Assessing Target-dependent Polarity 21
  22. 22. An Optimization Model (1) • For each candidate expression , ‒ P-Probability – the probability that indicates positive sentiment ‒ N-Probability – the probability that indicates negative sentiment • For each pair of candidate expressions and , ‒ Consistency probability – the probability that and have the same polarity: ‒ Inconsistency probability – the probability that and have different polarities: ic )(Pr i P c )(Pr i N c ic ic 1)(Pr)(Pr  i N i P cc ic jc ic jc )(Pr)(Pr)(Pr)(Pr),(Pr j N i N j P i P ji cons cccccc  ic jc )(Pr)(Pr)(Pr)(Pr),(Pr j P i N j N i P ji incons cccccc  22
  23. 23. An Optimization Model (2) • We want the consistency and inconsistency probabilities derived from the P-Probabilities and N-Probabilities of the candidates to be closest to their expectations suggested by the relation networks. • Objective Function:                1 1 22 ),(Pr1),(Pr1minimize n i n ij ji inconsincons ijji conscons ij ccwccw where and are the weights of the edges (strength of the relations) between and in the consistency and inconsistency relation networks, and n is the total number of candidate expressions. ic jc cons ijw incons ijw )(Pr)(Pr)(Pr)(Pr),(Pr j N i N j P i P ji cons cccccc  )(Pr)(Pr)(Pr)(Pr),(Pr j P i N j N i P ji incons cccccc  23
  24. 24. Experiments: Datasets Table: Description of four target-specific datasets from social media. 24 Tweet about movie New Star Trek movie is great! Highly recommend it! Tweet about person Scarlett Johansson rocking a suit better than most men. Forum post about epilepsy treatment I have an 11 month old who suffers from 0-8 seizures per day. We've tried 6 medications that have all failed and are now on The Ketogenic Diet. The diet has been amazing at reducing the frequency and intensity of his seizures. However, I want them GONE! I am wondering if infant chiropractic care or acupuncture is safe and effective in eliminating seizures. Does anyone have any experience with either of these? Forum post about cellular company I click on Mobile Sync to move all my contacts from my phone to the Sprint website. There are over 100 contacts in my phone, but it's only moving 59 of them? Help Facebook post about automobile company I have a 2006 Trailblazer that had a motor failure at 60,000 miles. GM refused to help in any way. Poor customer service to say the least. I guess they don't care about your car post warranty. With a driveway full of GM's its probably the last one I will buy.
  25. 25. Experiments on Tweets • Datasets: ‒ 168,005 tweets about movies ‒ 258,655 tweets about persons • Gold standard: 1500 tweets were randomly sampled from each domain. Human experts identified sentiment expressions and labeled each expression and tweet with target-specific sentiment. Table: Distributions of N- grams and Part-of-speech of the Sentiment Expressions in the Gold Standard Data Set. Table: Distribution of Sentiment Categories of the Tweets in the Gold Standard Data Set. 25
  26. 26. Methods COM -- Constrained Optimization Model • COM-const: Assign 0.5 to all the candidates as their initial P- Probabilities. • COM-gelex: Initialize the candidates’ polarities according to the subjectivity dictionary. (positive-1.0, negative-0.0, other-0.5) • MPQA, GI, SWN: For each extracted subjective word regarding the target, simply look up its polarity in MPQA, General Inquirer and SentiWordNet, respectively. • PROP: a propagation approach proposed by Qiu et al. (IJCAI’09) 26
  27. 27. Results 27 It demonstrates the advantage of our optimization-based approach over the lexicon-based or rule-based manner in polarity assessment – our method extracts diverse sentiment expressions and capture their target- dependent polarity.
  28. 28. Results of Sentiment Expression Extraction with Various Corpora Sizes Our approach make increases on both precision and recall when we increase the size of corpora from 12,000 to 48,000. Because our method could benefit from more relations extracted from larger corpora. 28
  29. 29. • Datasets: ‒ 100 forum posts about epilepsy treatment ‒ 162 forum posts about cellular company ‒ 200 Facebook posts about automobile company • Gold standard: human experts identified sentiment expressions from posts, and labeled each expression and post sentence with target- specific sentiment. 29 Experiments on Other Social Media Posts Table: Characteristics of sentiment expressions in the Gold Standard Data Set. Table: Distribution of Sentiment Categories of post sentences in the Gold Standard Data Set.
  30. 30. Results 30 Table: Quality of the extracted sentiment expressions. Figure: Sentence-level sentiment classification accuracy using different lexicons. The stable performance on all five datasets provides a strong indication that the proposed approach is not limited to a specific domain or a specific social media data source.
  31. 31. Sample Output (Movie Domain) 31
  32. 32. Aspect-based Opinion Mining It would be helpful to have an aspect-based opinion summarization for products. … Size picture quality motion-smoothing sound quality big screen perfect size fits big bedroom … full hd best picture blur reduction … smooth motion sensor tracing effects … loud white noise high pitched sound … 32
  33. 33. 2. Discovery of Domain- Specific Features and Aspects 33 Lu Chen, Justin Martineau, Doreen Cheng and Amit Sheth. Clustering for Simultaneous Extraction of Aspects and Features from Reviews. Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2016. Given a set of plain product reviews, how to efficiently identify (both explicit and implicit) product features and group them into aspects?
  34. 34. Example Review Sentences 1. Phone is easy to use and has great features. Large screen is great. Great speed makes smooth viewing of tv programs or sports. 2. It has a big bright display, it's very fast and very lightweight for its size. 3. Good features for an inexpensive android, light, good signal, good sound, pretty quick for a 800MHz processor. 4. The phone runs extra fast and smooth, and has great price. Aspects {screen, display, bright} {size, large, big} {lightweight, light} {price, inexpensive} {speed, processor, fast, quick, smooth} {easy, use} {features} {signal} {sound} Feature: components and attributes of a product. • Explicit feature: mentioned as a opinion target • Implicit feature: implied by opinion words • Different feature expressions may be used to describe the same aspect of a product. Aspect: represented as a group of features 34
  35. 35. • Two-step approach: first identifying features, then clustering them • Feature Identification ‒ Only extract features but not group them. ‒ Implicit features have been largely ignored. ‒ Require seed terms, hand-crafted rules/patterns, or other annotation efforts. • Feature Clustering/Aspect Discovery ‒ Assume that features have been identified beforehand. ‒ Topic-model based approach o not fine-grained aspects (Zhang and Liu, 2014), not directly interpretable as aspects (Chen et al., 2013; Bancken et al., 2014), not good at dealing with aspect sparsity (Xu et al., 2014), etc. ‒ Clustering-based approach (Su et al., 2008; Lu et al., 2009; Bancken et al., 2014) Related Work 35
  36. 36. Contributions We propose a new clustering-based approach that: • identifies both features and aspects simultaneously; • extracts both explicit and implicit features and groups them into aspects; and • does not require seed terms, hand-crafted patterns, or any other labeling efforts. 36
  37. 37. Notation  is a set of candidate features, which are extracted from reviews of a given product. o Candidate of explicit features: noun and noun phrases o Candidate of implicit features: adjectives and verbs  is the number of aspects.  is the number of most frequent candidates that will be grouped first to generate the seed clusters.  is the upper bound of the distance between two mergeable clusters. (1) To generate high quality seed clusters: Frequent terms are more likely the actual features of customers' interests. (2) Speed up the process by clustering only the most frequent ones. Domain-specific similarity measure: determine how similar the members in two clusters are regarding the particular domain/product. Merging constraints: further ensure that the terms from different aspects would not be merged The Clustering Algorithm 37
  38. 38. • General semantic similarities that are learned from thesaurus dictionaries or web corpus. ‒ The similarities between words/phrases are domain dependent. E.g., “ice cream sandwich'' and “operating system” (cell-phone domain) “smooth” and “speed” (cell-phone domain vs. hair dryer domain) • Domain-dependent similarities that are learned from a domain- specific corpus based on distributional information. ‒ Different aspects may share similar context. E.g., “great display”, “great price”, “great speed” ‒ The words describing the same aspect may not share similar context or co-occur. E.g., people use “is inexpensive” or “has great price” instead of “has inexpensive price”; “running fast” or “great speed” instead of “fast speed” Similarity Measures 38
  39. 39. Domain-specific Similarity • General similarity matrix G -- a n × n matrix, where Gij is the general semantic similarity between xi and xj , Gij ∈ [0, 1], Gij = 1 when i=j, and Gij = Gji. • Use UMBC Semantic Similarity Service to get G. • Statistical association matrix T -- a n × n matrix, where Tij is the pairwise statistical association between xi and xj in a domain-specific corpus, Tij ∈ [0, 1], Tij = 1 when i=j, and Tij = Tji. • Use normalized pointwise mutual information (NPMI) to get T. 39 - f(xi) (or f(xj)) is the number of documents where xi (or xj) appears, - f(xi, xj) is the number of documents where xi and xj co-occur in a sentence, - N is the total number of documents in the corpus. NPMI(xi, xj) ∈ [−1, 1], and we rescale the values of NPMI to the range of [0, 1].
  40. 40. • A candidate xi can be represented by the i-th row in G or T. 40 where • The domain-specific similarity between xi and xj is defined as the weighted sum of the similarity metrics: simg captures semantically similar/relevant words, e.g., “screen” and “display”, “speed” and “fast”. simt captures words sharing similar context, e.g., “ice cream sandwich” and “operating system”. simgt gets high value when the terms strongly associated with xi (or xj) are semantically similar to xj (or xi), e.g., “smooth” and “speed”. Domain-specific Similarity
  41. 41. • We evaluate this approach on reviews from three different domains. • The default setting of CAFE (Clustering for Aspect and Feature Extraction): ‒ The number of aspects k = 50 ‒ Distance upper bound 𝛿 = 0.8 ‒ The number of candidates that are grouped first to generate seed clusters s = 500 ‒ The weights of three similarity measures wg = wt = 0.2, wgt = 0.6 41 Data and Experimental Setting
  42. 42. • PROP: A double propagation approach that extracts features using hand- crafted rules based on dependency relations between features and opinion words. (Qiu et al., IJCAI’09) • LRTBOOT: A bootstrapping approach that extracts features by mining pairwise feature-feature, feature-opinion, opinion-opinion associations between terms in the corpus, where the association is measured by the likelihood ratio tests (Hai et al., CIKM’12) Evaluations on Feature Extraction – Methods 42
  43. 43. 43 Evaluations on Feature Extraction – Results
  44. 44. • MuReinf: A clustering method utilizes the mutual reinforcement association between features and opinion words to iteratively group them into feature clusters and opinion clusters. (Su et al., WWW’08) • L-EM: A semi-supervised learning method that adapts Naive Bayesian- based EM algorithm to group synonym features into categories. (Zhai et al., WSDM’11) • L-LDA: This is a baseline method used in (Zhai et al., WSDM’11), which is based on LDA. * Because MuReinf, L-EM and L-LDA need another algorithm to extract features, both the LRTBOOT and CAFE is applied. Evaluations on Aspect Discovery – Methods 44
  45. 45. Evaluations on Aspect Discovery – Results 45 The results showed the advantage of combining feature and aspect discovery over chaining them, and also implied the effectiveness of our domain-specific similarity measure in identifying synonym features in a particular domain.
  46. 46. Influence of Parameters 46 Based on the experiments on three domains, the best results can be achieved when distance upper bound 𝜹 is set to a value between 0.76 and 0.84. CAFE generates better results by first clustering the top 10%-30% most frequent candidates. The best F-score and Rand Index can be achieved when we set wgt to 0.5 or 0.6 across all three domains.
  47. 47. Sample Output 47
  48. 48. 3. Harnessing Public Opinion on Twitter to predict election results 48 Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Users Equal in Predicting Elections? A Study of User Groups in Predicting 2012 U.S. Republican Presidential Primaries. Proceedings of the 4th International Conference on Social Informatics (SocInfo) 2012. How to derive public opinion about election candidates? Are opinion holders equal in predicting elections?
  49. 49. Overview 49 Tweet ID candidate: XXX opinion: positive User category: right-leaning high engagement opinion prone orig. tweet-prone a user tweets network 2. Engagement Degree 4. Tweet Mode 3. Content Type1. Political Preference Predicting which candidate this user support Aggregating opinions of each user group to predict election results
  50. 50. Contributions • We introduce a new method to predict the election results that: ‒ identifies which candidate is mentioned, and whether a positive or negative opinion is expressed towards a candidate in a tweet; ‒ predicts which candidate a user supports based on the opinions extracted from his/her tweets; and ‒ aggregates the opinions of all users from a group to predict which candidate will win the election. • We show that the opinion holders matter in predicting election results. ‒ We group users based on their political preference, engagement degree, tweet mode, and content type, and examine the predictive power of different user groups in predicting Super Tuesday results in 10 states. ‒ We evaluate the results in terms of both the accuracy of predicting winners and the error rate between the predicted votes and the actual votes for each candidate. 50
  51. 51. Findings 51 Revealing the challenge of identifying the opinion of “silent majority” Retweets may not necessarily reflect users' attitude. Prediction of user’s vote based on more opinion tweets is not necessarily more accurate than the prediction using more information tweets The right-leaning user group provides the most accurate prediction result. In the best case (56-day time window), it correctly predict the winners in 8 out of 10 states with an average prediction error of 0.1.
  52. 52. 4. Religion and Subjective Well- being 52 Lu Chen, Ingmar Weber and Adam Okulicz-Kozaryn. U.S. Religious Landscape on Twitter. Proceedings of the 6th International Conference on Social Informatics (SocInfo), 2014. Lu Chen, Ingmar Weber, Adam Okulicz-Kozaryn, and Amit Sheth. Understanding the Effect of Religion on Happiness by Examining the Topic Preferences and Word Usage on Twitter. (in submission to PLOS ONE). How to use Twitter data to measure subjective well- being? How does the religious belief of users (holders) affect their happiness expressed in tweets?
  53. 53. 53 user’s religious belief: Buddhism a user tweets network user ID happiness_level: ℎ 𝑎𝑣𝑔 𝑢𝑠𝑒𝑟 topic_preference: 𝑝 𝑡𝑜𝑝𝑖𝑐 𝑢𝑠𝑒𝑟 word_preference: 𝑝(𝑤𝑜𝑟𝑑|𝑡𝑜𝑝𝑖𝑐, 𝑢𝑠𝑒𝑟) Religion: Buddhism happiness_level: ℎ 𝑎𝑣𝑔 𝑔𝑟𝑜𝑢𝑝 topic_preference: 𝑝 𝑡𝑜𝑝𝑖𝑐 𝑔𝑟𝑜𝑢𝑝 word_preference: 𝑝(𝑤𝑜𝑟𝑑|𝑡𝑜𝑝𝑖𝑐, 𝑔𝑟𝑜𝑢𝑝) Overview aggregating the measures of individual users to obtain the group-level measures 1. What is the effect of religion on happiness? 2. How does topic preference and word usage affect the happiness expressed by each group?
  54. 54. Contributions • We provide a fresh perspective about happiness and religion, complementing traditional survey-based studies, via analyzing the topics and words naturally disclosed in people's social media messages. • We introduce a framework and methodology that explore the effect of social and demographic factors of a holder (e.g., a holder’s religious belief) on subjective well-being. • Our method also explores potential reasons for the variations in the level of happiness from the holder’s topic preferences and word usage on topics. 54
  55. 55. Findings • There is a significant difference among the seven groups (atheist, Buddhist, Christian, Hindu, Jew, Muslim, and random Twitter users) on the level of happiness (pleasant/unpleasant emotions) expressed in tweets. • Each user group has different topic preferences and different word usage on the same topic. However, differences on word usage are small compared with the differences on topic distributions. • The users' topic preferences strongly correlate with their happiness expressed in tweets. 55
  56. 56. Conclusion • This dissertation presents a unified framework that characterizes a subjective experience, such as sentiment, opinion, or emotion, in terms of an individual holding it, a target eliciting it, a set of expressions describing it, and a classification or assessment measuring it; • it describes new algorithms that automatically identify and extract sentiment expressions and opinion targets from user generated content with minimal human supervision; • it shows how to use social media data to predict election results and investigate religion and subjective well-being, by classifying and assessing subjective information in user generated content. 56
  57. 57. Future Directions 57 Time 1. Detecting different types of subjectivity in text 2. Beyond sentiment and opinion 3. Towards dynamic modeling of subjective information. A subjective experience is a quintuple , where t is the time when the subjective experience occurs.  tcesh ,,,,
  58. 58. Publications • Lu Chen, Justin Martineau, Doreen Cheng and Amit Sheth. Clustering for Simultaneous Extraction of Aspects and Features from Reviews. Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2016. (Acceptance rate: 24%) • Lu Chen, Ingmar Weber and Adam Okulicz-Kozaryn. U.S. Religious Landscape on Twitter. Proceedings of the 6th International Conference on Social Informatics (SocInfo), 2014. (Acceptance rate: 23%) • Justin Martineau, Lu Chen, Doreen Cheng and Amit Sheth. Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), 2014. (Acceptance rate: 26%) • Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Cursing in English on Twitter. Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW) 2014. (Acceptance rate: 27%) • Amit Sheth, Ashutosh Jadhav, Pavan Kapanipathi, Lu Chen, Hemant Purohit, Alan Smith, and Wenbo Wang. Chapter title: Twitris - A System for Collective Social Intelligence. Encyclopedia of Social Network Analysis and Mining, 2014. • D. Cameron, G. A. Smith, R. Daniulaityte, A. P. Sheth, D. Dave, L. Chen, G. Anand, R. Carlson, K. Z. Watkins, R. Falck. PREDOSE: A Semantic Web Platform for Drug Abuse Epidemiology Using Social Media. Journal of Biomedical Informatics: Special Issue on Biomedical Information through the Implementation of Social Media Environments. 2013. PMID: 23892295. • Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Users Equal in Predicting Elections? A Study of User Groups in Predicting 2012 U.S. Republican Presidential Primaries. Proceedings of the 4th International Conference on Social Informatics (SocInfo) 2012. (Acceptance rate: 35%) • Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter "Big Data" for Automatic Emotion Identification. Proceedings of the 4th ASE/IEEE International Conference on Social Computing (SocialCom), 2012. • Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit Sheth. Extracting Diverse Sentiment Expressions with Target- dependent Polarity from Twitter. Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM), 2012. (Acceptance rate: 20%) • Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit Sheth. Discovering Fine-grained Sentiment in Suicide Notes. Biomedical Informatics Insights (BII), 2012. • R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Perera, L. Chen, and A. Sheth. "I Just Wanted to Tell You That Loperamide WILL WORK": A Web-Based Study of Extra-Medical Use of Loperamide. Journal of Drug and Alcohol Dependence, 2012. • R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Udayanga, L. Chen, A. Sheth. A Web-based Study of Self-treatment of Opioid Withdrawal Symptoms with Loperamide. The College on Problems of Drug Dependence (CPDD), 2012. 58
  59. 59. Media Coverage (1) 59 Washington Post Washington Times La Croix MIT Technology Review Time
  60. 60. Media Coverage (2) 60 Fast Company RAPPLER BuzzFeed The Times of India Huffington Post
  61. 61. Media Coverage (3) 61 IN Gizmodo RNS NDTV World Religion News
  62. 62. Acknowledgement 62 Prof. Amit Sheth (Advisor) Dr. Ingmar Weber (QCRI) Prof. T.K.Prasad Dr. Justin Martineau (SRA) Prof. Keke Chen Dissertation Committee Co-authors and Collaborators Dr. Shaojun Wang Computer Science Dr. Meena Nagarajan (IBM Watson) Prof. Adam Okulicz-Kozaryn (Rutgers-Camden) Dr. Wenbo Wang (GoDaddy) Dr. Doreen Cheng (SRA) Prof. Raminta Daniulaityte Dr. Delroy Cameron (Apple) Dr. Ming Tan (IBM Watson) Prof. Valerie Shalin
  63. 63. 63
  64. 64. Acknowledgement This dissertation is based upon work supported by the National Science Foundation under Grant: • IIS-1111182 “SoCS: Collaborative Research: Social Media Enhanced Organizational Sensemaking in Emergency Response” and • CNS-1513721 “Context-Aware Harassment Detection on Social Media.” 64

×