Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining Product Opinions and Reviews on the Web

4,194 views

Published on

  • Login to see the comments

Mining Product Opinions and Reviews on the Web

  1. 1. Mining Product Opinions and Reviews on The Web<br />Felipe Mattosinho<br />
  2. 2. Agenda<br />Introduction<br />Basics<br />Requirements<br />Design<br />Implementation<br />Evaluation<br />Conclusion<br />
  3. 3. Introduction<br />
  4. 4. Introduction<br />What is Opinion Mining?<br />Opinion Mining<br /><ul><li> not conventional data mining
  5. 5. retrieve useful information out of users opinions
  6. 6. a sub-area of Web Mining
  7. 7. stays at the crossroads of IR, IE and DM. </li></li></ul><li>Introduction<br />Why using Opinion Mining?<br />Opinion overload<br />748<br />Source: Amazon.com<br /><ul><li> Users are not willing to read them all
  8. 8. Difficult to find necessary information
  9. 9. Difficult to draw conclusions</li></ul>1279<br />100<br />128<br />Structure data<br />576<br /><ul><li> Build intelligent applications (Web 3.0)</li></li></ul><li>Introduction<br />Existing approaches<br />Source: Amazon.com<br />Ranking<br /><ul><li> Facts are different from opinions</li></ul>Classification<br /><ul><li> Asking „pros“ and „cons“ can induce opinions
  10. 10. Tiresome task for users</li></ul>Source: CNET.com<br />
  11. 11. Basics<br />
  12. 12. Basics<br />Opinion Model<br />Opinions highlight strengths and weaknesses about objects under discussion (OuD)<br />O:(T,A), T is a taxonomy of components and A is a set of attributes of O<br />The use of word feature for simplicity<br />
  13. 13. Basics<br /> Level of Sentiment Analysis<br />Opinion Level<br /><ul><li> too coarse-grained, does not cover important information </li></ul>Sentence Level<br /><ul><li> better approach, but still do not cover everything</li></ul>Feature Level<br /><ul><li> Optimal level, best coverage </li></li></ul><li>Basics<br />Trends in Sentiment Analysis for Opinion Mining<br />Granularity Level<br />Higher Complexity<br />Lower Perfomance<br />
  14. 14. Requirements<br />
  15. 15. Requirements<br />Target Audience<br />
  16. 16. Requirements<br />Functional Requirements<br />Generate a feature-based summary to the user<br />System administrator has control over core parameters and policies mechanisms<br />Non-functional Requirements<br /><ul><li>Fedseeko compatibility
  17. 17. Fault-tolerance
  18. 18. Performance
  19. 19. Interoperability</li></li></ul><li>Design<br />
  20. 20. Design<br />System Architecture<br /><ul><li>System Management Module
  21. 21. POS Tagging Module
  22. 22. Opinion Retriever Module
  23. 23. Opinion Mining Module</li></li></ul><li>Design<br />System Management Module<br /><ul><li>Long jobs handled asynchronously
  24. 24. Workers run concurrently, different times of the day or in different machines</li></li></ul><li>Design<br />Opinion Retrieval Module<br /><ul><li>Create task description
  25. 25. Web scraping otimization</li></li></ul><li>Design<br />Opinion Composition Model<br /><ul><li>Other words are also special (negation words, orientation inverter words, “too” words)
  26. 26. Workers run concurrently, different times of the day or in different machines</li></li></ul><li>Design<br />Opinion Sentence<br />I needed to take pictures during my last travel to Italy. <br />So far, I’m very happy with this camera. The picture quality is good and the zoom is powerful. One thing that I didn’t like is the LCD resolution.<br />I_PRP needed_VBD to_TO take_VB pictures_NNS during_IN my_PRP$ last_RB travel_NN to_TO Italy_NNP ._. So_RB far_RB ,_, I_PRP ‘m_VBP very_RB happy_JJ with_IN this_DT camera_NN ._. The_DT picture_NN quality_NN is_VBZ good_JJ and_CC the_DT zoom_NN is_VBZ powerful_JJ ._. One_CD thing_NN that_IN I_PRP did_VBD n‘t_RB like_VB is_VBZ the_DT LCD_NNP resolution_NN ._.<br />So_RB far_RB ,_, I_PRP ‘m_VBP very_RB happy_JJ with_IN this_DT camera_NN ._. The_DT picture_NN quality_NN is_VBZ good_JJ and_CC the_DT zoom_NN is_VBZ powerful_JJ ._. One_CD thing_NN that_IN I_PRP did_VBD n‘t_RB like_VB is_VBZ the_DT LCD_NNP resolution_NN ._.<br />
  27. 27. Design<br />Feature Identification<br />camera_NN (picture_NN quality_NN) zoom_NN thing_NN LCD_NNP resolution_NN wedding_NN car_NN photos_NNS dog_NN road_NN lot_NN [...]<br />camera_NN (picture_NN quality_NN) flash_NN thing_NN England_NNP rehearsel_NN photos_NNS [...]<br />horse_NN (picture_NN quality_NN) flash_NN farm_NN country_NN rehearsel_NN photos_NNS [...]<br />camera_NN (picture_NN quality_NN) flash_NN photos_NNS [...]<br />
  28. 28. Design<br />Feature Identification<br />Pros<br />Customers use different words to refer to the same feature<br />Detects additional useful information (not part of the opinion model)<br />No manual annotated data<br />Cons<br /><ul><li>Does not detect infrequent features
  29. 29. Detects non-features</li></li></ul><li>Design<br />Search Word Orientation Algorithm<br />bad<br />Seed List<br />good,1,nice,1<br />good,1,nice,1,bad,-1<br />
  30. 30. Design<br />Opinion Words in Context<br />Negation Rules<br />good<br />not<br />not<br />bad<br />problem<br />no<br />Too Rules<br /><ul><li>Before adjectives usually denotes negative sentiment. E.g „This camera is too small“.</li></li></ul><li>Design<br />Opinion Words in Context<br />Orientation Inverter / Sentiment Inverter Words<br /><ul><li>Find sentiment/orientation for opinion words with unknown orientation
  31. 31. „The camera is nice, except for the initialization time which takes long “
  32. 32. „The aufocous is great, the battery life lasts long, but i find the functions a little complex.“</li></li></ul><li>Design<br />Aggregating opinions for a feature<br />0 1 2 3 4 5 6 7 8 9<br />“The image quality is amazing, but the autofocus is terrible.”<br />Score(image quality) = (1 / |4 – 1| ) + (-1 / |9 - 1| ) = 0.2083 (positive)<br />Score(autofocus) = (1 / |1 - 7 | ) + (-1 / | 9 - 7 | ) = 0.1667 - 0.5 = -0.333 (negative)<br />
  33. 33. Implementation<br />Overview<br />Core Technologies<br /><ul><li> JRuby on Rails (JRuby 1.5.0 RC3 / Rails 2.3.8 )</li></ul>Ruby Gems (Libraries)<br /><ul><li> Mechanize
  34. 34. Nokogiri
  35. 35. Ruby-aaws
  36. 36. Delayed_Job</li></ul>Java Libraries<br /><ul><li> Rita.Wordnet
  37. 37. Stanford POS Tagger API</li></li></ul><li>Implementation<br />Overview<br />
  38. 38. Implementation<br />Overview<br />
  39. 39. Evaluation<br />
  40. 40. Evaluation<br />Test Environment<br /><ul><li>AMD Turion(tm) 64 Mobile Technology ML-32 / 1GB RAM
  41. 41. Ubuntu 9.04 32 bits</li></ul>Sample data<br />System Configuration<br />
  42. 42. Evaluation<br />Effectiveness of Feature Identification<br />Threshold<br />Accuracy<br />Features<br />
  43. 43. Evaluation<br />Sentiment Classification Effectiveness<br /><ul><li>Xbox 360 lowest effectiveness due to wrong part-of-speech tagging
  44. 44. „Complex sentences“ and domain dependent sentences are also wrongly classified</li></li></ul><li>Evaluation<br />System Efficiency<br /><ul><li>The lower the threshold, the higher the number of features and hence the number of sentences analyzed
  45. 45. What is the price to address many exceptions ? </li></li></ul><li>Evaluation<br />Considerations<br /><ul><li>Users may talk about other objects, with similar features
  46. 46. Domain Dependent sentences (e.g „ The device heats very fast.”)</li></ul>Complex Sentences / Domain Dependent Sentences / Exceptions<br />POS Tagging Errors<br />Pluralization cases<br /><ul><li>May not refer to the same OuD (e.g „camera“ and „cameras“)
  47. 47. „This camera is GOOD“
  48. 48. „[...] the hard drive which comes with the device.”</li></li></ul><li>Conclusion<br />POECS performs well with a good rate of accuracy.<br />Observations shows that many users write „simple“ straightforward sentences, which are covered by POECS.<br />Domain specific annotations can help the system to be more effective.<br />Human language is complex, covering many cases represent a lot of loss in performance<br />Sample data<br />
  49. 49. Conclusion<br />Future Work<br />Minimize number of manual annotations through recognition of reusable patterns of the human language<br />Cope with common unsolved problems such as <br />Safe ways to recognize which features belong to which object<br />Global opinion knowledge to help improvement of local analysis (sentence or feature level)<br />
  50. 50. Conclusion<br />Questions?<br />

×