Background                     Preliminary experiments  Modeling accuracy loss for cross-domain SC                     Gra...
Background                           Preliminary experiments        Modeling accuracy loss for cross-domain SC            ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                          Preliminary experiments   Introduction       Modeling accuracy loss for cross-domain S...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                          Preliminary experiments   Introduction       Modeling accuracy loss for cross-domain S...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                          Preliminary experiments   Introduction       Modeling accuracy loss for cross-domain S...
Background                          Preliminary experiments   Introduction       Modeling accuracy loss for cross-domain S...
Background                          Preliminary experiments   Introduction       Modeling accuracy loss for cross-domain S...
Background                          Preliminary experiments   Introduction       Modeling accuracy loss for cross-domain S...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                         Preliminary experiments   Introduction      Modeling accuracy loss for cross-domain SC ...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                         Preliminary experiments   In-domain study      Modeling accuracy loss for cross-domain ...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                         Preliminary experiments   In-domain study      Modeling accuracy loss for cross-domain ...
Background                         Preliminary experiments   In-domain study      Modeling accuracy loss for cross-domain ...
Background                         Preliminary experiments   In-domain study      Modeling accuracy loss for cross-domain ...
Background                         Preliminary experiments   In-domain study      Modeling accuracy loss for cross-domain ...
Background                         Preliminary experiments   In-domain study      Modeling accuracy loss for cross-domain ...
Background                         Preliminary experiments   In-domain study      Modeling accuracy loss for cross-domain ...
Background                         Preliminary experiments   In-domain study      Modeling accuracy loss for cross-domain ...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                          Preliminary experiments   In-domain study       Modeling accuracy loss for cross-domai...
Background                         Preliminary experiments    In-domain study      Modeling accuracy loss for cross-domain...
Background                         Preliminary experiments    In-domain study      Modeling accuracy loss for cross-domain...
Background                         Preliminary experiments    In-domain study      Modeling accuracy loss for cross-domain...
Background                         Preliminary experiments   In-domain study      Modeling accuracy loss for cross-domain ...
Background                         Preliminary experiments   In-domain study      Modeling accuracy loss for cross-domain ...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                    Domain similarity                          Preliminary exper...
Background                                                  Domain similarity                        Preliminary experimen...
Background                                                    Domain similarity                          Preliminary exper...
Background                                                    Domain similarity                          Preliminary exper...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                    Domain similarity                          Preliminary exper...
Background                                                    Domain similarity                          Preliminary exper...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                     Domain similarity                         Preliminary exper...
Background                                                     Domain similarity                         Preliminary exper...
Background                                                     Domain similarity                         Preliminary exper...
Background                                                     Domain similarity                         Preliminary exper...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                   Domain similarity                         Preliminary experim...
Background                                                    Comparison                          Preliminary experiments ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                  Comparison                        Preliminary experiments     ...
Background                                                  Comparison                        Preliminary experiments     ...
Background                                                    Comparison                          Preliminary experiments ...
Background                                                    Comparison                          Preliminary experiments ...
Background                                                    Comparison                          Preliminary experiments ...
Background                                                    Comparison                          Preliminary experiments ...
Background                                                    Comparison                          Preliminary experiments ...
Background                                                    Comparison                          Preliminary experiments ...
Background                                                    Comparison                          Preliminary experiments ...
Background                                                    Comparison                          Preliminary experiments ...
Background                                                    Comparison                          Preliminary experiments ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Background                                                   Comparison                         Preliminary experiments   ...
Cross domainsc new
Cross domainsc new
Cross domainsc new
Cross domainsc new
Cross domainsc new
Cross domainsc new
Upcoming SlideShare
Loading in...5
×

Cross domainsc new

961

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
961
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cross domainsc new

  1. 1. Background Preliminary experiments Modeling accuracy loss for cross-domain SC Graph-based algorithmsCross-domain Sentiment Classification: Resource Selection and Algorithms Natalia Ponomareva Statistical Cybermetrics Research Group, University of Wolverhampton, UK December 17, 2011 Natalia Ponomareva Cross-domain Sentiment Classification
  2. 2. Background Preliminary experiments Modeling accuracy loss for cross-domain SC Graph-based algorithmsOutline 1 Background Introduction State-of-the-art research 2 Preliminary experiments In-domain study Cross-domain experiments 3 Modeling accuracy loss for cross-domain SC Domain similarity Domain complexity Model construction and validation 4 Graph-based algorithms Comparison Document similarity Strategy for choosing the best parameters Natalia Ponomareva Cross-domain Sentiment Classification
  3. 3. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsWhat is Sentiment Classification? Task within the research field of Sentiment Analysis. It concerns classification of documents on the basis of overall sentiments expressed by their authors. Different scales can be used: positive/negative; positive, negative and neutral; rating: 1*, 2*, 3*, 4*, 5*; Example “The film was fun and I enjoyed it.” ⇒ positive “The film lasted too long and I got bored.” ⇒ negative Natalia Ponomareva Cross-domain Sentiment Classification
  4. 4. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsApplications:Business Intelligence Natalia Ponomareva Cross-domain Sentiment Classification
  5. 5. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsApplications: Event prediction Natalia Ponomareva Cross-domain Sentiment Classification
  6. 6. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsApplications: Opinion search Natalia Ponomareva Cross-domain Sentiment Classification
  7. 7. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsWhy challenging? Irony, humour. Example If you are reading this because it is your darling fragrance, please wear it at home exclusively and tape the windows shut. Generally positive words. Example This film should be brilliant. It sounds like a great plot, the actors are fisrt grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it cannot hold up. Natalia Ponomareva Cross-domain Sentiment Classification
  8. 8. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsWhy challenging? Context dependency. Example This is a great camera. A great amount of money was spent for promoting this camera. One might think this is a great camera. Well think again, because..... Rejection or advice? Example Go read the book. Natalia Ponomareva Cross-domain Sentiment Classification
  9. 9. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsApproaches to Sentiment Classification Lexical approaches Supervised machine learning Semi-supervised and unsupervised approaches Cross-domain Sentiment Classification (SC) Natalia Ponomareva Cross-domain Sentiment Classification
  10. 10. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsLexical approaches Natalia Ponomareva Cross-domain Sentiment Classification
  11. 11. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsLexical approaches Use of dictionaries of sentiment words with a given semantic orientation. Natalia Ponomareva Cross-domain Sentiment Classification
  12. 12. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsLexical approaches Use of dictionaries of sentiment words with a given semantic orientation. Dictionaries are built either manually or (semi-)automatically. Natalia Ponomareva Cross-domain Sentiment Classification
  13. 13. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsLexical approaches Use of dictionaries of sentiment words with a given semantic orientation. Dictionaries are built either manually or (semi-)automatically. A special scoring function is applied in order to calculate the final semantic orientation of a text. Natalia Ponomareva Cross-domain Sentiment Classification
  14. 14. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsLexical approaches Use of dictionaries of sentiment words with a given semantic orientation. Dictionaries are built either manually or (semi-)automatically. A special scoring function is applied in order to calculate the final semantic orientation of a text. Example lightweight +3, good +4, ridiculous -2 Lightweight, stores a ridiculous amount of books and good battery life. SO1 = 3+4−2 = 1 2 3 3 SO2 = max{|3|, |4|, |−2|} · sign(max{|3|, |4|, |−2|}) = 4 Natalia Ponomareva Cross-domain Sentiment Classification
  15. 15. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsSupervised Machine Learning Natalia Ponomareva Cross-domain Sentiment Classification
  16. 16. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsSupervised Machine Learning Learn sentiment phenomena from an annotated corpus. Natalia Ponomareva Cross-domain Sentiment Classification
  17. 17. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsSupervised Machine Learning Learn sentiment phenomena from an annotated corpus. Different Machine Learning methods were tested (NB, SVM, ME). In the majority of cases SVM demonstrates the best performance. Natalia Ponomareva Cross-domain Sentiment Classification
  18. 18. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsSupervised Machine Learning Learn sentiment phenomena from an annotated corpus. Different Machine Learning methods were tested (NB, SVM, ME). In the majority of cases SVM demonstrates the best performance. For review data ML approach performs better than lexical one when training and test data belong to the same domain. Natalia Ponomareva Cross-domain Sentiment Classification
  19. 19. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsSupervised Machine Learning Learn sentiment phenomena from an annotated corpus. Different Machine Learning methods were tested (NB, SVM, ME). In the majority of cases SVM demonstrates the best performance. For review data ML approach performs better than lexical one when training and test data belong to the same domain. But it needs substantial amount of annotated data. Natalia Ponomareva Cross-domain Sentiment Classification
  20. 20. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsSemi-supervised and unsupervised approaches Natalia Ponomareva Cross-domain Sentiment Classification
  21. 21. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsSemi-supervised and unsupervised approaches Require small amount of annotated data or no data at all. Natalia Ponomareva Cross-domain Sentiment Classification
  22. 22. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsSemi-supervised and unsupervised approaches Require small amount of annotated data or no data at all. Different techniques were exploited: Automatic extraction of sentiment words on the Web using seed words (Turney, 2002). Exploiting spectral clustering and active learning (Dasgupta et al., 2009). Applying co-training (Li et al., 2010) Bootstrapping (Zagibalov, 2010) Using graph-based algorithms (Goldberg et al., 2006) Natalia Ponomareva Cross-domain Sentiment Classification
  23. 23. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsCross-domain SC Main approaches: Ensemble of classifiers (Read 2005, Aue and Gamon 2005); Structural Correspondence Learning (Blitzer 2007); Graph-based algorithms (Wu 2009). Natalia Ponomareva Cross-domain Sentiment Classification
  24. 24. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsEnsemble of classifiers Classifiers are learned on data belonging to different source domains. Various methods can be used to combine classifiers: Majority voting; Weighted voting, where development data set is used to learn credibility weights for each classifier. Learning a meta-classifier on a small amount of target domain data. Natalia Ponomareva Cross-domain Sentiment Classification
  25. 25. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsStructural Correspondence Learning Blitzer et al., 2007: Introduce pivot features that appear frequently in source and target domains. Find projections of source features the co-occur with pivots in a target domain. Example The laptop is great, it is extremely fast. The book is great, it is very engaging. Natalia Ponomareva Cross-domain Sentiment Classification
  26. 26. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsStructural Correspondence Learning Blitzer et al., 2007: Introduce pivot features that appear frequently in source and target domains. Find projections of source features the co-occur with pivots in a target domain. Example The laptop is great, it is extremely fast. The book is great, it is very engaging. Natalia Ponomareva Cross-domain Sentiment Classification
  27. 27. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsStructural Correspondence Learning Blitzer et al., 2007: Introduce pivot features that appear frequently in source and target domains. Find projections of source features the co-occur with pivots in a target domain. Example The laptop is great, it is extremely fast. The book is great, it is very engaging. Natalia Ponomareva Cross-domain Sentiment Classification
  28. 28. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsStructural Correspondence Learning Blitzer et al., 2007: Introduce pivot features that appear frequently in source and target domains. Find projections of source features the co-occur with pivots in a target domain. Example The laptop is great, it is extremely fast. The book is great, it is very engaging. Natalia Ponomareva Cross-domain Sentiment Classification
  29. 29. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsDiscussion Natalia Ponomareva Cross-domain Sentiment Classification
  30. 30. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsDiscussion Machine learning methods demonstrate a very good performance and when the size of the data is substantial they outperform lexical approaches. Natalia Ponomareva Cross-domain Sentiment Classification
  31. 31. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsDiscussion Machine learning methods demonstrate a very good performance and when the size of the data is substantial they outperform lexical approaches. On the other hand, there is a plethora of annotated resources on the Web and the possibility to re-use them would be very beneficial. Natalia Ponomareva Cross-domain Sentiment Classification
  32. 32. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsDiscussion Machine learning methods demonstrate a very good performance and when the size of the data is substantial they outperform lexical approaches. On the other hand, there is a plethora of annotated resources on the Web and the possibility to re-use them would be very beneficial. Structural Correspondence Learning and similar approaches are good for binary classification but difficult to be applied for multi-class problem. Natalia Ponomareva Cross-domain Sentiment Classification
  33. 33. Background Preliminary experiments Introduction Modeling accuracy loss for cross-domain SC State-of-the-art research Graph-based algorithmsDiscussion Machine learning methods demonstrate a very good performance and when the size of the data is substantial they outperform lexical approaches. On the other hand, there is a plethora of annotated resources on the Web and the possibility to re-use them would be very beneficial. Structural Correspondence Learning and similar approaches are good for binary classification but difficult to be applied for multi-class problem. That motivates us to exploit graph-based cross-domain algorithms. Natalia Ponomareva Cross-domain Sentiment Classification
  34. 34. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsData Natalia Ponomareva Cross-domain Sentiment Classification
  35. 35. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsData Data represent the corpus consist of Amazon product reviews on 7 different topics: books (BO), electronics (EL), kitchen&housewares (KI), DVDs (DV), music (MU), health&personal care (HE) and toys&games(TO). Natalia Ponomareva Cross-domain Sentiment Classification
  36. 36. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsData Data represent the corpus consist of Amazon product reviews on 7 different topics: books (BO), electronics (EL), kitchen&housewares (KI), DVDs (DV), music (MU), health&personal care (HE) and toys&games(TO). Reviews are rated either as positive or negative. Natalia Ponomareva Cross-domain Sentiment Classification
  37. 37. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsData Data represent the corpus consist of Amazon product reviews on 7 different topics: books (BO), electronics (EL), kitchen&housewares (KI), DVDs (DV), music (MU), health&personal care (HE) and toys&games(TO). Reviews are rated either as positive or negative. Data within each domain are balanced, they contain 1000 positive and 1000 negative reviews. Natalia Ponomareva Cross-domain Sentiment Classification
  38. 38. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsData statistics corpus num words mean words vocab size vocab size (>= 3) BO 364k 181.8 23k 8 256 DV 397k 198.7 24k 8 632 MU 300k 150.1 19k 6 163 EL 236k 117.9 12k 4 465 KI 198k 98.9 11k 4 053 TO 206k 102.9 11k 4 018 HE 188k 93.9 11k 4 022 Natalia Ponomareva Cross-domain Sentiment Classification
  39. 39. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsData statistics corpus num words mean words vocab size vocab size (>= 3) BO 364k 181.8 23k 8 256 DV 397k 198.7 24k 8 632 MU 300k 150.1 19k 6 163 EL 236k 117.9 12k 4 465 KI 198k 98.9 11k 4 053 TO 206k 102.9 11k 4 018 HE 188k 93.9 11k 4 022 BO, DV, MU - longer reviews, richer vocabularies. Natalia Ponomareva Cross-domain Sentiment Classification
  40. 40. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsFeature selection We compared several characteristics of features: words vs. stems and lemmas; unigrams vs. unigrams + bigrams; binary weights vs. frequency, idf and tfidf; features filtered by presence of verbs, adjectives, adverbs and modal verbs vs. unfiltered features. Natalia Ponomareva Cross-domain Sentiment Classification
  41. 41. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsFeature selection Natalia Ponomareva Cross-domain Sentiment Classification
  42. 42. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsFeature selection Filtering of features worsen the accuracy for all domains. Natalia Ponomareva Cross-domain Sentiment Classification
  43. 43. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsFeature selection Filtering of features worsen the accuracy for all domains. Unigrams + bigrams generally perform significantly much better then unigrams alone. Natalia Ponomareva Cross-domain Sentiment Classification
  44. 44. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsFeature selection Filtering of features worsen the accuracy for all domains. Unigrams + bigrams generally perform significantly much better then unigrams alone. Binary, idf and delta idf weights generally give better results than frequency, tfidf and delta tfidf weights. Natalia Ponomareva Cross-domain Sentiment Classification
  45. 45. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsFeature selection domain features preference confidence interval, α = 0.01 BO word ≈ lemma ≈ stem inside DV word ≈ lemma ≈ stem inside MU lemma > stem > word boundary EL word > lemma ≈ stem inside KI word ≈ lemma > stem inside TO word ≈ stem > lemma boundary HE stem > lemma > word inside Natalia Ponomareva Cross-domain Sentiment Classification
  46. 46. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsFeature selection domain features preference confidence interval, α = 0.01 BO word ≈ lemma ≈ stem inside DV word ≈ lemma ≈ stem inside MU lemma > stem > word boundary EL word > lemma ≈ stem inside KI word ≈ lemma > stem inside TO word ≈ stem > lemma boundary HE stem > lemma > word inside Natalia Ponomareva Cross-domain Sentiment Classification
  47. 47. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsFeature selection domain features preference confidence interval, α = 0.01 BO word ≈ lemma ≈ stem inside DV word ≈ lemma ≈ stem inside MU lemma > stem > word boundary EL word > lemma ≈ stem inside KI word ≈ lemma > stem inside TO word ≈ stem > lemma boundary HE stem > lemma > word inside Natalia Ponomareva Cross-domain Sentiment Classification
  48. 48. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithms10 most discriminative positive features BO EL KI DV highly recommend plenty perfect for album concise plenty of be perfect magnificent for anyone highly recommend favorite superb i highly highly highly recommend debut excellent ps NUM fiestaware wolf my favorite please with be easy join unique very happy easy to charlie inspiring beat perfect love it must read glad eliminate highly recommend and also well as easy rare Natalia Ponomareva Cross-domain Sentiment Classification
  49. 49. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithms10 most discriminative positive features BO EL KI DV highly recommend plenty perfect for album concise plenty of be perfect magnificent for anyone highly recommend favorite superb i highly highly highly recommend debut excellent ps NUM fiestaware wolf my favorite please with be easy join unique very happy easy to charlie inspiring beat perfect love it must read glad eliminate highly recommend and also well as easy rare Natalia Ponomareva Cross-domain Sentiment Classification
  50. 50. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithms10 most discriminative positive features BO EL KI DV highly recommend plenty perfect for album concise plenty of be perfect magnificent for anyone highly recommend favorite superb i highly highly highly recommend debut excellent ps NUM fiestaware wolf my favorite please with be easy join unique very happy easy to charlie inspiring beat perfect love it must read glad eliminate highly recommend and also well as easy rare Natalia Ponomareva Cross-domain Sentiment Classification
  51. 51. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithms10 most discriminative positive features BO EL KI DV highly recommend plenty perfect for album concise plenty of be perfect magnificent for anyone highly recommend favorite superb i highly highly highly recommend debut excellent ps NUM fiestaware wolf my favorite please with be easy join unique very happy easy to charlie inspiring beat perfect love it must read glad eliminate highly recommend and also well as easy rare Natalia Ponomareva Cross-domain Sentiment Classification
  52. 52. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithms10 most discriminative positive features BO EL KI DV highly recommend plenty perfect for album concise plenty of be perfect magnificent for anyone highly recommend favorite superb i highly highly highly recommend debut excellent ps NUM fiestaware wolf my favorite please with be easy join unique very happy easy to charlie inspiring beat perfect love it must read glad eliminate highly recommend and also well as easy rare Natalia Ponomareva Cross-domain Sentiment Classification
  53. 53. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithms10 most discriminative negative features BO EL KI DV poorly refund waste of your money disappointing repair return it so bad waste of do not buy it break ridiculous your money waste of refund waste of waste waste to return waste annoying defective waste worst movie bunch forum return pointless boring junk very disappoint talk and bunch of work worst pathetic to finish worst I return horrible Natalia Ponomareva Cross-domain Sentiment Classification
  54. 54. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithms10 most discriminative negative features BO EL KI DV poorly refund waste of your money disappointing repair return it so bad waste of do not buy it break ridiculous your money waste of refund waste of waste waste to return waste annoying defective waste worst movie bunch forum return pointless boring junk very disappoint talk and bunch of stop work worst pathetic to finish worst I return horrible Natalia Ponomareva Cross-domain Sentiment Classification
  55. 55. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithms10 most discriminative negative features BO EL KI DV poorly refund waste of your money disappointing repair return it so bad waste of do not buy it break ridiculous your money waste of refund waste of waste waste to return waste annoying defective waste worst movie bunch forum return pointless boring junk very disappoint talk and bunch of work worst pathetic to finish worst I return horrible Natalia Ponomareva Cross-domain Sentiment Classification
  56. 56. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsResults Natalia Ponomareva Cross-domain Sentiment Classification
  57. 57. Background Preliminary experiments In-domain study Modeling accuracy loss for cross-domain SC Cross-domain experiments Graph-based algorithmsResults for cross-domain SC Accuracy Accuracy drop Natalia Ponomareva Cross-domain Sentiment Classification
  58. 58. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsMotivation Usually cross-domain algorithms do not work well for very different source and target domains. Combinations of classifiers from different domains in some cases perform much worse than a single classifier trained on the closest domain (Blitzer et al. 2007) Finding the closest domain can help to improve the results of cross-domain sentiment classification. Natalia Ponomareva Cross-domain Sentiment Classification
  59. 59. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsHow to compare data sets? Natalia Ponomareva Cross-domain Sentiment Classification
  60. 60. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsHow to compare data sets? Machine-learning techniques are based on the assumption that training and test data are driven from the same probability distribution, and, therefore, they perform much better when training and test data sets are alike. Natalia Ponomareva Cross-domain Sentiment Classification
  61. 61. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsHow to compare data sets? Machine-learning techniques are based on the assumption that training and test data are driven from the same probability distribution, and, therefore, they perform much better when training and test data sets are alike. The task of finding the best training data transforms into the task of finding data whose feature distribution is similar to the test one. Natalia Ponomareva Cross-domain Sentiment Classification
  62. 62. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsHow to compare data sets? Machine-learning techniques are based on the assumption that training and test data are driven from the same probability distribution, and, therefore, they perform much better when training and test data sets are alike. The task of finding the best training data transforms into the task of finding data whose feature distribution is similar to the test one. We propose two characteristics to model accuracy loss: domain similarity and domain complexity or, more precisely, domain complexity variance. Natalia Ponomareva Cross-domain Sentiment Classification
  63. 63. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsHow to compare data sets? Machine-learning techniques are based on the assumption that training and test data are driven from the same probability distribution, and, therefore, they perform much better when training and test data sets are alike. The task of finding the best training data transforms into the task of finding data whose feature distribution is similar to the test one. We propose two characteristics to model accuracy loss: domain similarity and domain complexity or, more precisely, domain complexity variance. Domain similarity approximate similarity between distributions for frequent features. Natalia Ponomareva Cross-domain Sentiment Classification
  64. 64. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsHow to compare data sets? Machine-learning techniques are based on the assumption that training and test data are driven from the same probability distribution, and, therefore, they perform much better when training and test data sets are alike. The task of finding the best training data transforms into the task of finding data whose feature distribution is similar to the test one. We propose two characteristics to model accuracy loss: domain similarity and domain complexity or, more precisely, domain complexity variance. Domain similarity approximate similarity between distributions for frequent features. Domain complexity compares tails of distributions. Natalia Ponomareva Cross-domain Sentiment Classification
  65. 65. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsDomain similarity Natalia Ponomareva Cross-domain Sentiment Classification
  66. 66. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsDomain similarity We are not interested in all terms but rather on those bearing sentiment. Natalia Ponomareva Cross-domain Sentiment Classification
  67. 67. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsDomain similarity We are not interested in all terms but rather on those bearing sentiment. The study on SA suggested that adjectives, verbs and adverbs are the main indicators of sentiment, so, we keep only unigrams and bigrams that contain those POS as features. Natalia Ponomareva Cross-domain Sentiment Classification
  68. 68. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsDomain similarity We are not interested in all terms but rather on those bearing sentiment. The study on SA suggested that adjectives, verbs and adverbs are the main indicators of sentiment, so, we keep only unigrams and bigrams that contain those POS as features. We compare different weighting schemes: frequencies, TF-IDF and IDF to compute corpus similarity. Natalia Ponomareva Cross-domain Sentiment Classification
  69. 69. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsMeasures of domain similarity Natalia Ponomareva Cross-domain Sentiment Classification
  70. 70. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsMeasures of domain similarity χ2 taken from Corpus Linguistics where it was demonstrated to have the best correlation with the gold standard. Natalia Ponomareva Cross-domain Sentiment Classification
  71. 71. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsMeasures of domain similarity χ2 taken from Corpus Linguistics where it was demonstrated to have the best correlation with the gold standard. Kullback-Leibler divergence (DKL ) and its symmetric analogue Jensen-Shannon divergence (DJS ) were borrowed from Information Theory. Natalia Ponomareva Cross-domain Sentiment Classification
  72. 72. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsMeasures of domain similarity χ2 taken from Corpus Linguistics where it was demonstrated to have the best correlation with the gold standard. Kullback-Leibler divergence (DKL ) and its symmetric analogue Jensen-Shannon divergence (DJS ) were borrowed from Information Theory. Jaccard coefficient (Jaccard) and cosine similarity (cosine) are well-known similarity measures Natalia Ponomareva Cross-domain Sentiment Classification
  73. 73. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsCorrelation for different domain similarity measures Table: Correlation with accuracy drop measure R (freq) R (filtr.,freq) R (filtr.,TFIDF) R (filtr.,IDF) cosine -0.790 -0.840 -0.836 -0.863 Jaccard -0.869 -0.879 -0.879 -0.879 χ2 0.855 0.869 0.876 0.879 DKL 0.734 0.827 0.676 0.796 DJS 0.829 0.833 0.804 0.876 Natalia Ponomareva Cross-domain Sentiment Classification
  74. 74. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsDomain similarity: χ2 inv The boundary between similar and distinct domains approximately corresponds to χ2 = 1.7. inv Natalia Ponomareva Cross-domain Sentiment Classification
  75. 75. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsDomain complexity Similarity between domains is mostly controlled by frequent words, but the shape of the corpus distribution is also influenced by rare words representing its tail. It was shown that richer domains with more rare words are more complex for SC. We also observed that the accuracy loss is higher in cross-domain settings when source domain is more complex than the target one. Natalia Ponomareva Cross-domain Sentiment Classification
  76. 76. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsMeasures of domain complexity We propose several measures to approximate domain complexity: percentage of rare words; word richness (proportion of vocabulary size in a corpus size); relative entropy. Correlation of domain complexity measures with in-domain accuracy: % of rare words word richness rel.entropy -0.904 -0.846 0.793 Natalia Ponomareva Cross-domain Sentiment Classification
  77. 77. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsMeasures of domain complexity We propose several measures to approximate domain complexity: percentage of rare words; word richness (proportion of vocabulary size in a corpus size); relative entropy. Correlation of domain complexity measures with in-domain accuracy: % of rare words word richness rel.entropy -0.904 -0.846 0.793 Natalia Ponomareva Cross-domain Sentiment Classification
  78. 78. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsDomain complexity corpus accuracy % of rare words word richness rel.entropy BO 0.786 64.77 0.064 9.23 DV 0.796 64.16 0.061 8.02 MU 0.774 67.16 0.063 8.98 EL 0.812 61.71 0.049 12.66 KI 0.829 61.49 0.053 14.44 TO 0.816 63.37 0.053 15.27 HE 0.808 61.83 0.056 15.82 Natalia Ponomareva Cross-domain Sentiment Classification
  79. 79. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsDomain complexity corpus accuracy % of rare words word richness rel.entropy BO 0.786 64.77 0.064 9.23 DV 0.796 64.16 0.061 8.02 MU 0.774 67.16 0.063 8.98 EL 0.812 61.71 0.049 12.66 KI 0.829 61.49 0.053 14.44 TO 0.816 63.37 0.053 15.27 HE 0.808 61.83 0.056 15.82 Natalia Ponomareva Cross-domain Sentiment Classification
  80. 80. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsModeling accuracy loss To model the performance drop we assume a linear dependency on domain similarity and complexity variance and propose the following linear regression model: F (sij , ∆cij ) = β0 + β1 sij + β2 ∆cij , (1) where sij – domain similarity (or distance) between target domain i and source domain j ∆cij = ci − cj , – difference between domain complexities. The unknown coefficients βi are solutions of the following system of linear equations: β0 + β1 sij + β2 ∆cij = ∆aij , (2) where ∆aij is the accuracy drop when adapting the classifier from domain i to domain j. Natalia Ponomareva Cross-domain Sentiment Classification
  81. 81. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsModel evaluation The evaluation of the constructed regression model includes following steps: Global test (or F-test) to verify statistical significance of regression model with respect to all its predictors. Test on individual variables (or t-test) to reveal regressors that do not bring a significant impact into the model. Leave-one-out-cross validation for the data set of 42 examples. Natalia Ponomareva Cross-domain Sentiment Classification
  82. 82. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsGlobal test The null hypothesis for global test states that there is no correlation between regressors and the response variable. Our purpose is to demonstrate that this hypothesis must be rejected with a high level of confidence. In other words, we have to show that coefficient of determination R 2 is high enough to consider its value significantly different from zero. R2 R F-value p-value 0.873 0.935 134.60 << 0.0001 Natalia Ponomareva Cross-domain Sentiment Classification
  83. 83. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsTest on individual coefficients β0 β1 β2 value -8.67 27.71 -0.55 standard error 1.08 1.77 0.11 t-value -8.00 15.67 -4.86 p-value << 0.0001 << 0.0001 << 0.0001 All coefficients are justified to be statistically significant with the confidence level higher than 99.9%. Natalia Ponomareva Cross-domain Sentiment Classification
  84. 84. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsLeave-one-out cross-validation results accuracy drop standard error standard deviation max error, 95% all data 1.566 1.091 3.404 < 5% 1.465 1.133 3.373 > 5%, < 10% 1.646 1.173 3.622 > 10% 1.556 1.166 3.519 Natalia Ponomareva Cross-domain Sentiment Classification
  85. 85. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsLeave-one-out cross-validation results accuracy drop standard error standard deviation max error, 95% all data 1.566 1.091 3.404 < 5% 1.465 1.133 3.373 > 5%, < 10% 1.646 1.173 3.622 > 10% 1.556 1.166 3.519 We are able to predict accuracy loss with standard error of 1.5% and maximum error not exceeding 3.4%. Natalia Ponomareva Cross-domain Sentiment Classification
  86. 86. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsLeave-one-out cross-validation results accuracy drop standard error standard deviation max error, 95% all data 1.566 1.091 3.404 < 5% 1.465 1.133 3.373 > 5%, < 10% 1.646 1.173 3.622 > 10% 1.556 1.166 3.519 We are able to predict accuracy loss with standard error of 1.5% and maximum error not exceeding 3.4%. Lower values are being noticed for domains which are more similar. Natalia Ponomareva Cross-domain Sentiment Classification
  87. 87. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsLeave-one-out cross-validation results accuracy drop standard error standard deviation max error, 95% all data 1.566 1.091 3.404 < 5% 1.465 1.133 3.373 > 5%, < 10% 1.646 1.173 3.622 > 10% 1.556 1.166 3.519 We are able to predict accuracy loss with standard error of 1.5% and maximum error not exceeding 3.4%. Lower values are being noticed for domains which are more similar. This is a strength of the model as our main purpose is to identify the closest domains. Natalia Ponomareva Cross-domain Sentiment Classification
  88. 88. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsComparing actual and predicted drop Natalia Ponomareva Cross-domain Sentiment Classification
  89. 89. Background Domain similarity Preliminary experiments Domain complexity Modeling accuracy loss for cross-domain SC Model construction and validation Graph-based algorithmsComparing actual and predicted drop Natalia Ponomareva Cross-domain Sentiment Classification
  90. 90. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsGraph-based algorithms: OPTIM Goldberg et al., 2006: The algorithm is based on the assumption that the rating function is smooth with respect to the graph. Rating difference between the closest nodes is minimised. Difference between initial rating and the final value is also minimised. The result is a solution of an optimisation problem. Natalia Ponomareva Cross-domain Sentiment Classification
  91. 91. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsGraph-based algorithms: RANK Wu et al., 2009: On each iteration of the algorithm sentiment scores of unlabeled documents are updated on the basis of the weighted sum of sentiment scores of the nearest labeled neighbours and the nearest unlabeled neighbours. The process stops when convergence is achieved. Natalia Ponomareva Cross-domain Sentiment Classification
  92. 92. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsComparison OPTIM algorithm RANK algorithm (Goldberg et al., 2006) (Wu et al., 2009) Natalia Ponomareva Cross-domain Sentiment Classification
  93. 93. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsComparison Initial setting of RANK does not allow in-domain and out-domain neighbours to be different: easy to change! The condition of smoothness of sentiment function over the nodes is satisfied for both algorithms. Unlike RANK, OPTIM requires the closeness of initial sentiment values and output ones for unlabeled nodes. The last condition makes the OPTIM solution more stable. Natalia Ponomareva Cross-domain Sentiment Classification
  94. 94. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsComparison Initial setting of RANK does not allow in-domain and out-domain neighbours to be different: easy to change! The condition of smoothness of sentiment function over the nodes is satisfied for both algorithms. Unlike RANK, OPTIM requires the closeness of initial sentiment values and output ones for unlabeled nodes. The last condition makes the OPTIM solution more stable. What about the measure of similarity between graph nodes? Natalia Ponomareva Cross-domain Sentiment Classification
  95. 95. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsDocument representation We consider 2 types of document representation: feature-based sentiment units-based Natalia Ponomareva Cross-domain Sentiment Classification
  96. 96. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsDocument representation We consider 2 types of document representation: feature-based, that involves weighted document features. sentiment units-based Natalia Ponomareva Cross-domain Sentiment Classification
  97. 97. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsDocument representation We consider 2 types of document representation: feature-based, that involves weighted document features. Features are filtered by POS: adjectives, verbs and adverbs. Features are weighted using either tfidf or idf. sentiment units-based Natalia Ponomareva Cross-domain Sentiment Classification
  98. 98. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsDocument representation We consider 2 types of document representation: feature-based, that involves weighted document features. Features are filtered by POS: adjectives, verbs and adverbs. Features are weighted using either tfidf or idf. sentiment units-based, that is based upon the percentage of positive and negative units in a document. Natalia Ponomareva Cross-domain Sentiment Classification
  99. 99. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsDocument representation We consider 2 types of document representation: feature-based, that involves weighted document features. Features are filtered by POS: adjectives, verbs and adverbs. Features are weighted using either tfidf or idf. sentiment units-based, that is based upon the percentage of positive and negative units in a document. Units can be either sentences or words. PSP states for positive sentences percentage, PWP - for positive words percentage. Lexical approach was exploited to calculate semantic orientation of sentiment units with the use of SentiWordNet and SOCAL dictionary. SO of sentences are averaged by a number of its positive and negative words. Natalia Ponomareva Cross-domain Sentiment Classification
  100. 100. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsResults Correlation between document’s ratings and document features/units: domain idf tfidf PSP SWN PSP SOCAL PWP SWN PWP SOCAL BO 0.387 0.377 0.034 0.206 0.067 0.252 DV 0.376 0.368 0.064 0.251 0.098 0.316 EL 0.433 0.389 0.048 0.182 0.043 0.196 KI 0.444 0.416 0.068 0.238 0.076 0.230 Natalia Ponomareva Cross-domain Sentiment Classification
  101. 101. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsResults Correlation between document’s ratings and document features/units: domain idf tfidf PSP SWN PSP SOCAL PWP SWN PWP SOCAL BO 0.387 0.377 0.034 0.206 0.067 0.252 DV 0.376 0.368 0.064 0.251 0.098 0.316 EL 0.433 0.389 0.048 0.182 0.043 0.196 KI 0.444 0.416 0.068 0.238 0.076 0.230 Feature-based document representation with idf-weights better correlates with document rating than any other representation. Natalia Ponomareva Cross-domain Sentiment Classification
  102. 102. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsResults Correlation between document’s ratings and document features/units: domain idf tfidf PSP SWN PSP SOCAL PWP SWN PWP SOCAL BO 0.387 0.377 0.034 0.206 0.067 0.252 DV 0.376 0.368 0.064 0.251 0.098 0.316 EL 0.433 0.389 0.048 0.182 0.043 0.196 KI 0.444 0.416 0.068 0.238 0.076 0.230 Feature-based document representation with idf-weights better correlates with document rating than any other representation. SentiWordNet does not provide good results for this task, probably due to high level of noise which comes from its automatic construction. Natalia Ponomareva Cross-domain Sentiment Classification
  103. 103. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsResults Correlation between document’s ratings and document features/units: domain idf tfidf PSP SWN PSP SOCAL PWP SWN PWP SOCAL BO 0.387 0.377 0.034 0.206 0.067 0.252 DV 0.376 0.368 0.064 0.251 0.098 0.316 EL 0.433 0.389 0.048 0.182 0.043 0.196 KI 0.444 0.416 0.068 0.238 0.076 0.230 Feature-based document representation with idf-weights better correlates with document rating than any other representation. SentiWordNet does not provide good results for this task, probably due to high level of noise which comes from its automatic construction. Document similarity is calculated using cosine measure. Natalia Ponomareva Cross-domain Sentiment Classification
  104. 104. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsBest accuracy improvement achieved by the algorithms We tested the performance of each algorithm for several values of their parameters. The best accuracy improvement that was given by each algorithm: OPTIM RANK Natalia Ponomareva Cross-domain Sentiment Classification
  105. 105. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsGeneral observations We selected and examined only those results that were inside the confidence interval of the best accuracy for α = 0.01. RANK: tends to depend a lot on values of its parameters and the most unstable results are obtained when source and target domains are different. RANK: A great improvement is achieved when adapting the classifier from more complex to more simple domains. OPTIM: Stable, but results are modest. Natalia Ponomareva Cross-domain Sentiment Classification
  106. 106. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsAnalysis of RANK behaviour Natalia Ponomareva Cross-domain Sentiment Classification
  107. 107. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsAnalysis of RANK behaviour Within clusters of similar domains the majority of good answers have γ ≥ 0.9. Natalia Ponomareva Cross-domain Sentiment Classification
  108. 108. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsAnalysis of RANK behaviour Within clusters of similar domains the majority of good answers have γ ≥ 0.9. This demonstrates that information provided by labeled data is more valuable. Natalia Ponomareva Cross-domain Sentiment Classification
  109. 109. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsAnalysis of RANK behaviour Within clusters of similar domains the majority of good answers have γ ≥ 0.9. This demonstrates that information provided by labeled data is more valuable. For non-similar domains, when source domain is more complex than the target one, best results are achieved with smaller γ close to 0.5. Natalia Ponomareva Cross-domain Sentiment Classification
  110. 110. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsAnalysis of RANK behaviour Within clusters of similar domains the majority of good answers have γ ≥ 0.9. This demonstrates that information provided by labeled data is more valuable. For non-similar domains, when source domain is more complex than the target one, best results are achieved with smaller γ close to 0.5. This means that the algorithm benefits much from unlabeled data. Natalia Ponomareva Cross-domain Sentiment Classification
  111. 111. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsAnalysis of RANK behaviour Natalia Ponomareva Cross-domain Sentiment Classification
  112. 112. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsAnalysis of RANK behaviour For non-similar domains, when target one is more complex than the source one, γ tends to increase to 0.7 Natalia Ponomareva Cross-domain Sentiment Classification
  113. 113. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsAnalysis of RANK behaviour For non-similar domains, when target one is more complex than the source one, γ tends to increase to 0.7 That gives preference to more simple labeled data. Natalia Ponomareva Cross-domain Sentiment Classification
  114. 114. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsAnalysis of RANK behaviour For non-similar domains, when target one is more complex than the source one, γ tends to increase to 0.7 That gives preference to more simple labeled data. Number of labeled and unlabeled neighbours is not equal, there is a clear tendency to prefer results with smaller number of unlabeled and higher number of labeled examples. Natalia Ponomareva Cross-domain Sentiment Classification
  115. 115. Background Comparison Preliminary experiments Document similarity Modeling accuracy loss for cross-domain SC Strategy for choosing the best parameters Graph-based algorithmsAnalysis of RANK behaviour For non-similar domains, when target one is more complex than the source one, γ tends to increase to 0.7 That gives preference to more simple labeled data. Number of labeled and unlabeled neighbours is not equal, there is a clear tendency to prefer results with smaller number of unlabeled and higher number of labeled examples. Proportion of 50 against 150 seems to be an ideal, covering most of the cases. Natalia Ponomareva Cross-domain Sentiment Classification
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×