Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Mining User’s Opinions in    Hotel                                      TEY JUN HONG                                      ...
Content      1. Background2. Formulating the problem  3. Data Mining Process       4. Techniques         5. Analysis      ...
What is Data              Mining?• Extraction of meaningful / useful / Interesting  patterns from a large volume of data s...
What is Data•                Mining?    Process of exploration and analysis•   By automatic / semi automatic means•   With...
User’s Opinions in Hotel• Increase in social media and web  user• Increase in valuable opinion  oriented data in Hotel due...
What can Data Mining do? • Identify best prospects   (ASPECTS), and retain customers • Predict what ASPECTS   customers li...
What are the problems?• Exponential growth of user’s  opinions• Limitations of human analysis• Accuracy of human analysisM...
Some Limitations of machines   • Unable to read like a human   • No emotions   • Cannot detect sarcasm   • Expression of s...
Some machine limitation• “The service is as good as none”.          examples  Negation not obvious to machine• “Swimming p...
Sentiment•             Analysis    Machine learning•   Pattern recognition•   Statistics•   Databases
Machine Learning• A tool for data mining and intelligent decision  support• Application of computer algorithms that  impro...
Types of Machine learning • Supervised Learning   • A training set is provided (data     with correct answers) which is   ...
Supervised Learning techniques    • Rule Mining and Rule learning    • Bayesian Networks    • Support Vector Machine
Project Objective• Prediction of sentence polarity• Classification of polarity for sentiment  lexicon• Detection of relati...
Pre-requisite• Large data set• Relevant Prior Knowledge to  domain, in our case the hotel  domain  • Eg. Rating• Sentiment...
Data Mining Process
Cleaning the “Dirty” Data (60% of •                     effort)     Frequent problem : Data inconsistencies •   Duplicate ...
Data Preprocessing (Laundering)•   Part of Speech Tagging (POS) using Brill    Tagger•   Polarity tagging using sentiment ...
Findings•    Part of Speech Tagging (POS) using Brill     Tagger - NO PROBLEM    -95% accuracy POS tagging words after dat...
Findings• Polarity tagging using sentiment lexicon –  BIG PROBLEM -40% sentiment words not found in sentiment             ...
Problems•   Sentiment lexicon not comprehensive to fulfill    machine learning technique adopted•   Polarity of sentiment ...
Solution• Classify the polarity of unlabeled sentiment  word using rule based mining• Classify domain dependent sentiment ...
Data Processing•    Rule based mining using conjunction and     punctuation    Polarity Assignment       Rules       Same ...
Data Processing•   Relation Network – Aspect – Sentiment word    pair
Data Processing•   Relation Network – Aspect – Sentiment word    pair
Analysis• Using the expanded sentiment lexicon, we  analyze the polarity sentiment by doing a  sentiment lookup using Baye...
Bayesian•   To determine polarity of sentiments           P(X | Y) = P(X) P(Y | X) / P(Y)•   Probability that a sentiments...
Validation• Precision = N (agree & found) / N (found)• High precision means most of the correct  sentiment words are found...
Validation Results•   It is found that out of the 350 aspect-    unlabelled sentiment word pairs,•   Only 194 are founded ...
Discussion•   The results will improve if more rules are    applied such the inclusion of more adverbs    such as “excessi...
Conclusion• Comprehensive Sentiment Lexicon is a  simple yet effective solution to sentiment  analysis as it does not requ...
Conclusion• This leads to poor coverage• Thus, expanding general sentiment lexicon to  capture domain and context sensitiv...
Questions?    01   DEMO
Upcoming SlideShare
Loading in …5
×

Fypca4

229 views

Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Fypca4

  1. 1. Mining User’s Opinions in Hotel TEY JUN HONG U095074X National University Of Singapore
  2. 2. Content 1. Background2. Formulating the problem 3. Data Mining Process 4. Techniques 5. Analysis 01
  3. 3. What is Data Mining?• Extraction of meaningful / useful / Interesting patterns from a large volume of data sources• In this project, the source will be large volume of WEB HOTEL REVIEWS data• Data mining is one of the top ten emerging technology MIT’s TECHNOLOGY REVIEW 2004
  4. 4. What is Data• Mining? Process of exploration and analysis• By automatic / semi automatic means• With little or no human interactions• To discover meaningful patterns and rules MASTERING DATA MINING BY BERRY AND LINOFF, 2000
  5. 5. User’s Opinions in Hotel• Increase in social media and web user• Increase in valuable opinion oriented data in Hotel due to web expansion• Identify potential hotel to stay by looking at the aspects• Overall Sentiments on hotel are greatly sought on the web for Sentiment Analysis
  6. 6. What can Data Mining do? • Identify best prospects (ASPECTS), and retain customers • Predict what ASPECTS customers like and promote accordingly • Learn parameters influencing trends in sales and margins • Identification of opinions for customers Sentiment Analysis !!!
  7. 7. What are the problems?• Exponential growth of user’s opinions• Limitations of human analysis• Accuracy of human analysisMachines can be trained to takeover human analysis with advancedcomputer technology and it is donewith LOW COST
  8. 8. Some Limitations of machines • Unable to read like a human • No emotions • Cannot detect sarcasm • Expression of sentiments in different topic and domain • Polarity analysis • Facts Vs Opinion
  9. 9. Some machine limitation• “The service is as good as none”. examples Negation not obvious to machine• “Swimming pool is big enough to swim with comfort” , “There is a big crowd at the counter complaining”. Polarity might change with context.• “The room is warmer than the lobby”. Comparisons are hard to classify
  10. 10. Sentiment• Analysis Machine learning• Pattern recognition• Statistics• Databases
  11. 11. Machine Learning• A tool for data mining and intelligent decision support• Application of computer algorithms that improve automatically through experience MASTERING DATA MINING BY BERRY AND LINOFF, 2000
  12. 12. Types of Machine learning • Supervised Learning • A training set is provided (data with correct answers) which is used to mine for known pattern • Unsupervised Learning • Data are provided with no prior knowledge of the hidden patterns that they contain. • Semi Supervised Learning
  13. 13. Supervised Learning techniques • Rule Mining and Rule learning • Bayesian Networks • Support Vector Machine
  14. 14. Project Objective• Prediction of sentence polarity• Classification of polarity for sentiment lexicon• Detection of relations
  15. 15. Pre-requisite• Large data set• Relevant Prior Knowledge to domain, in our case the hotel domain • Eg. Rating• Sentiment lexicon for sentiment analysis• Data selection for reliability and standards
  16. 16. Data Mining Process
  17. 17. Cleaning the “Dirty” Data (60% of • effort) Frequent problem : Data inconsistencies • Duplicate data • Spelling Errors != Trim from data • Foreign accent and characters • Singular / Plural conversion • Punctuations removal / replacement • Noise and incomplete data • Naming convention misused, same name but different meaning
  18. 18. Data Preprocessing (Laundering)• Part of Speech Tagging (POS) using Brill Tagger• Polarity tagging using sentiment lexicon
  19. 19. Findings• Part of Speech Tagging (POS) using Brill Tagger - NO PROBLEM -95% accuracy POS tagging words after data cleaning
  20. 20. Findings• Polarity tagging using sentiment lexicon – BIG PROBLEM -40% sentiment words not found in sentiment lexicon -10% sentiment words with a positive ornegative polarity found are in the neutral section of sentiment lexicon
  21. 21. Problems• Sentiment lexicon not comprehensive to fulfill machine learning technique adopted• Polarity of sentiment words who are domain dependent are founded in neutral section of sentiment lexicon• Polarity of sentiment words can also change within the domain even though they are domain dependent EXPANSION OF LEXICON !!!
  22. 22. Solution• Classify the polarity of unlabeled sentiment word using rule based mining• Classify domain dependent sentiment words• Establish word relations between labeled and unlabeled sentiment words
  23. 23. Data Processing• Rule based mining using conjunction and punctuation Polarity Assignment Rules Same Adj – AND/OR - Adj Opposite Neg - Adj – AND/OR - Adj / Adj – AND/OR - Neg- Adj Same Neg - Adj – AND/OR - Neg- Adj Opposite Adj – BUT/NOR – Adj Same Neg - Adj – BUT/NOR - Adj / Adj – BUT/NOR - Neg- Adj Opposite Neg - Adj – BUT/NOR - Neg- Adj Same Adj , Adj
  24. 24. Data Processing• Relation Network – Aspect – Sentiment word pair
  25. 25. Data Processing• Relation Network – Aspect – Sentiment word pair
  26. 26. Analysis• Using the expanded sentiment lexicon, we analyze the polarity sentiment by doing a sentiment lookup using Bayesian Network
  27. 27. Bayesian• To determine polarity of sentiments P(X | Y) = P(X) P(Y | X) / P(Y)• Probability that a sentiments is positive or negative, given its contents• Assumptions: There is no link between words• P(sentiment | sentence) = P(sentiment)P(sentence | sentiment) / P(sentence)
  28. 28. Validation• Precision = N (agree & found) / N (found)• High precision means most of the correct sentiment words are found by the system• Recall = N (agree & found) / N (agree)• High recall means most of found sentiment words are correctly labeled by the system
  29. 29. Validation Results• It is found that out of the 350 aspect- unlabelled sentiment word pairs,• Only 194 are founded by the methods. Thus, the precision is about 57%.• The recall is also not very high; only 126 words are corrected labelled by the system, which is about 63%.
  30. 30. Discussion• The results will improve if more rules are applied such the inclusion of more adverbs such as “excessively” as negation words.• There might not be enough dataset for the system to work on. There are only 350 aspect- unlabelled sentiment word pairs for the application to work with.• This, however requires more human judges to validate the data
  31. 31. Conclusion• Comprehensive Sentiment Lexicon is a simple yet effective solution to sentiment analysis as it does not requires prior training• Current sentiment lexicon does not capture such domain and context sensitivities of sentiment expressions
  32. 32. Conclusion• This leads to poor coverage• Thus, expanding general sentiment lexicon to capture domain and context sensitivities of sentiment expressions are advocated
  33. 33. Questions? 01 DEMO

×