Mining Product Opinions and Reviews on the Web

3,908 views

Published on

1 Comment
1 Like
Statistics
Notes
  • To read this thesis you can download this PDF:
    http://www.rn.inf.tu-dresden.de/uploads/Studentische_Arbeiten/Masterarbeit_Mattosinho_Felipe.pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
3,908
On SlideShare
0
From Embeds
0
Number of Embeds
631
Actions
Shares
0
Downloads
171
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Mining Product Opinions and Reviews on the Web

  1. 1. Mining Product Opinions and Reviews on The Web<br />Felipe Mattosinho<br />
  2. 2. Agenda<br />Introduction<br />Basics<br />Requirements<br />Design<br />Implementation<br />Evaluation<br />Conclusion<br />
  3. 3. Introduction<br />
  4. 4. Introduction<br />What is Opinion Mining?<br />Opinion Mining<br /><ul><li> not conventional data mining
  5. 5. retrieve useful information out of users opinions
  6. 6. a sub-area of Web Mining
  7. 7. stays at the crossroads of IR, IE and DM. </li></li></ul><li>Introduction<br />Why using Opinion Mining?<br />Opinion overload<br />748<br />Source: Amazon.com<br /><ul><li> Users are not willing to read them all
  8. 8. Difficult to find necessary information
  9. 9. Difficult to draw conclusions</li></ul>1279<br />100<br />128<br />Structure data<br />576<br /><ul><li> Build intelligent applications (Web 3.0)</li></li></ul><li>Introduction<br />Existing approaches<br />Source: Amazon.com<br />Ranking<br /><ul><li> Facts are different from opinions</li></ul>Classification<br /><ul><li> Asking „pros“ and „cons“ can induce opinions
  10. 10. Tiresome task for users</li></ul>Source: CNET.com<br />
  11. 11. Basics<br />
  12. 12. Basics<br />Opinion Model<br />Opinions highlight strengths and weaknesses about objects under discussion (OuD)<br />O:(T,A), T is a taxonomy of components and A is a set of attributes of O<br />The use of word feature for simplicity<br />
  13. 13. Basics<br /> Level of Sentiment Analysis<br />Opinion Level<br /><ul><li> too coarse-grained, does not cover important information </li></ul>Sentence Level<br /><ul><li> better approach, but still do not cover everything</li></ul>Feature Level<br /><ul><li> Optimal level, best coverage </li></li></ul><li>Basics<br />Trends in Sentiment Analysis for Opinion Mining<br />Granularity Level<br />Higher Complexity<br />Lower Perfomance<br />
  14. 14. Requirements<br />
  15. 15. Requirements<br />Target Audience<br />
  16. 16. Requirements<br />Functional Requirements<br />Generate a feature-based summary to the user<br />System administrator has control over core parameters and policies mechanisms<br />Non-functional Requirements<br /><ul><li>Fedseeko compatibility
  17. 17. Fault-tolerance
  18. 18. Performance
  19. 19. Interoperability</li></li></ul><li>Design<br />
  20. 20. Design<br />System Architecture<br /><ul><li>System Management Module
  21. 21. POS Tagging Module
  22. 22. Opinion Retriever Module
  23. 23. Opinion Mining Module</li></li></ul><li>Design<br />System Management Module<br /><ul><li>Long jobs handled asynchronously
  24. 24. Workers run concurrently, different times of the day or in different machines</li></li></ul><li>Design<br />Opinion Retrieval Module<br /><ul><li>Create task description
  25. 25. Web scraping otimization</li></li></ul><li>Design<br />Opinion Composition Model<br /><ul><li>Other words are also special (negation words, orientation inverter words, “too” words)
  26. 26. Workers run concurrently, different times of the day or in different machines</li></li></ul><li>Design<br />Opinion Sentence<br />I needed to take pictures during my last travel to Italy. <br />So far, I’m very happy with this camera. The picture quality is good and the zoom is powerful. One thing that I didn’t like is the LCD resolution.<br />I_PRP needed_VBD to_TO take_VB pictures_NNS during_IN my_PRP$ last_RB travel_NN to_TO Italy_NNP ._. So_RB far_RB ,_, I_PRP ‘m_VBP very_RB happy_JJ with_IN this_DT camera_NN ._. The_DT picture_NN quality_NN is_VBZ good_JJ and_CC the_DT zoom_NN is_VBZ powerful_JJ ._. One_CD thing_NN that_IN I_PRP did_VBD n‘t_RB like_VB is_VBZ the_DT LCD_NNP resolution_NN ._.<br />So_RB far_RB ,_, I_PRP ‘m_VBP very_RB happy_JJ with_IN this_DT camera_NN ._. The_DT picture_NN quality_NN is_VBZ good_JJ and_CC the_DT zoom_NN is_VBZ powerful_JJ ._. One_CD thing_NN that_IN I_PRP did_VBD n‘t_RB like_VB is_VBZ the_DT LCD_NNP resolution_NN ._.<br />
  27. 27. Design<br />Feature Identification<br />camera_NN (picture_NN quality_NN) zoom_NN thing_NN LCD_NNP resolution_NN wedding_NN car_NN photos_NNS dog_NN road_NN lot_NN [...]<br />camera_NN (picture_NN quality_NN) flash_NN thing_NN England_NNP rehearsel_NN photos_NNS [...]<br />horse_NN (picture_NN quality_NN) flash_NN farm_NN country_NN rehearsel_NN photos_NNS [...]<br />camera_NN (picture_NN quality_NN) flash_NN photos_NNS [...]<br />
  28. 28. Design<br />Feature Identification<br />Pros<br />Customers use different words to refer to the same feature<br />Detects additional useful information (not part of the opinion model)<br />No manual annotated data<br />Cons<br /><ul><li>Does not detect infrequent features
  29. 29. Detects non-features</li></li></ul><li>Design<br />Search Word Orientation Algorithm<br />bad<br />Seed List<br />good,1,nice,1<br />good,1,nice,1,bad,-1<br />
  30. 30. Design<br />Opinion Words in Context<br />Negation Rules<br />good<br />not<br />not<br />bad<br />problem<br />no<br />Too Rules<br /><ul><li>Before adjectives usually denotes negative sentiment. E.g „This camera is too small“.</li></li></ul><li>Design<br />Opinion Words in Context<br />Orientation Inverter / Sentiment Inverter Words<br /><ul><li>Find sentiment/orientation for opinion words with unknown orientation
  31. 31. „The camera is nice, except for the initialization time which takes long “
  32. 32. „The aufocous is great, the battery life lasts long, but i find the functions a little complex.“</li></li></ul><li>Design<br />Aggregating opinions for a feature<br />0 1 2 3 4 5 6 7 8 9<br />“The image quality is amazing, but the autofocus is terrible.”<br />Score(image quality) = (1 / |4 – 1| ) + (-1 / |9 - 1| ) = 0.2083 (positive)<br />Score(autofocus) = (1 / |1 - 7 | ) + (-1 / | 9 - 7 | ) = 0.1667 - 0.5 = -0.333 (negative)<br />
  33. 33. Implementation<br />Overview<br />Core Technologies<br /><ul><li> JRuby on Rails (JRuby 1.5.0 RC3 / Rails 2.3.8 )</li></ul>Ruby Gems (Libraries)<br /><ul><li> Mechanize
  34. 34. Nokogiri
  35. 35. Ruby-aaws
  36. 36. Delayed_Job</li></ul>Java Libraries<br /><ul><li> Rita.Wordnet
  37. 37. Stanford POS Tagger API</li></li></ul><li>Implementation<br />Overview<br />
  38. 38. Implementation<br />Overview<br />
  39. 39. Evaluation<br />
  40. 40. Evaluation<br />Test Environment<br /><ul><li>AMD Turion(tm) 64 Mobile Technology ML-32 / 1GB RAM
  41. 41. Ubuntu 9.04 32 bits</li></ul>Sample data<br />System Configuration<br />
  42. 42. Evaluation<br />Effectiveness of Feature Identification<br />Threshold<br />Accuracy<br />Features<br />
  43. 43. Evaluation<br />Sentiment Classification Effectiveness<br /><ul><li>Xbox 360 lowest effectiveness due to wrong part-of-speech tagging
  44. 44. „Complex sentences“ and domain dependent sentences are also wrongly classified</li></li></ul><li>Evaluation<br />System Efficiency<br /><ul><li>The lower the threshold, the higher the number of features and hence the number of sentences analyzed
  45. 45. What is the price to address many exceptions ? </li></li></ul><li>Evaluation<br />Considerations<br /><ul><li>Users may talk about other objects, with similar features
  46. 46. Domain Dependent sentences (e.g „ The device heats very fast.”)</li></ul>Complex Sentences / Domain Dependent Sentences / Exceptions<br />POS Tagging Errors<br />Pluralization cases<br /><ul><li>May not refer to the same OuD (e.g „camera“ and „cameras“)
  47. 47. „This camera is GOOD“
  48. 48. „[...] the hard drive which comes with the device.”</li></li></ul><li>Conclusion<br />POECS performs well with a good rate of accuracy.<br />Observations shows that many users write „simple“ straightforward sentences, which are covered by POECS.<br />Domain specific annotations can help the system to be more effective.<br />Human language is complex, covering many cases represent a lot of loss in performance<br />Sample data<br />
  49. 49. Conclusion<br />Future Work<br />Minimize number of manual annotations through recognition of reusable patterns of the human language<br />Cope with common unsolved problems such as <br />Safe ways to recognize which features belong to which object<br />Global opinion knowledge to help improvement of local analysis (sentence or feature level)<br />
  50. 50. Conclusion<br />Questions?<br />

×