Your SlideShare is downloading. ×
0
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Fypca4
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Fypca4

33

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
33
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • What can we infer from user opinions of hotel
  • What can data mining do in a hotel domain, in other words, learn the market
  • Impossible for humans to read every single opinions Biased of humans to read certain opinions Machines Allow fast access to vast amount of data Allow computational intensive algorithm and statistical methods
  • Impossible for humans to read every single opinions Biased of humans to read certain opinions Machines Allow fast access to vast amount of data Allow computational intensive algorithm and statistical methods
  • Many fields of data mining and in this project we will focus on these 4
  • Growing data volume , limitation of humans and low cost to human
  • The goal for unsupervised learning is to discover these patterns Semi – Knowledge is known and applied from one data collection in order to mine, classify, analyze, interpret a related data collection
  • Some of the problems to be solved by data mining Prediction of sentence polarity Classification of polarity for sentiment lexicon Detection of relations
  • Data inconsistencies: Say good in the title but in the review say bad
  • Assigning a label to every word in the text to allow machine to do something with it
  • Pos tagging wrong due to some word like heart having double tagging
  • For example, in the domain of handheld devices, the word “ large ” can express positivity for screen size but negativity in the phone size.
  • Assigning a label to every word in the text to allow machine to do something with it
  • After establishing relations, we have a graph of nodes (Sentiments / Aspects) Determine the probability that the node is positive or negative given its surrounding nodes Start with a high frequency unlabelled sentiment word-aspect pair then based on the aspect and its label semtiment pair, determine the polarity for the unlabel This process iterate till all unlabe found their polarity
  • After establishing relations, we have a graph of nodes (Sentiments / Aspects) Determine the probability that the node is positive or negative given its surrounding nodes Start with a high frequency unlabelled sentiment word-aspect pair then based on the aspect and its label semtiment pair, determine the polarity for the unlabel This process iterate till all unlabe found their polarity
  • Assigning a label to every word in the text to allow machine to do something with it
  • A comprehensive sentiment lexicon can provide a simple yet effective solution to sentiment analysis, because it is general and does not require prior training. Therefore, attention and effort have been paid to the construction of such lexicons. However, a significant challenge to this approach is that the polarity of many words is domain and context dependent. For example, ‘long’ is positive in ‘long battery life’ and negative in ‘long shutter lag.’ Current sentiment lexicons do not capture such domain and context sensitivities of sentiment expressions. They either exclude such domain and context dependent sentiment expressions or tag them with an overall polarity tendency based on statistics gathered from certain corpus such as the world wide web accessed via the internet. While excluding such expressions leads to poor coverage, simply tagging them with a polarity tendency leads to poor precision.
  • AThey either exclude such domain and context dependent sentiment expressions or tag them with an overall polarity tendency based on statistics gathered from certain corpus such as the world wide web accessed via the internet. While excluding such expressions leads to poor coverage, simply tagging them with a polarity tendency leads to poor precision.
  • Transcript

    • 1. User’sOpinionsin Hotel TEY JUN HONG U095074X National University Of Singapore
    • 2. Content 1. Background 2.Formulating the problem3. Data Mining Process 4. Techniques 5. Analysis 01
    • 3. What is Data Mining?• Extraction of meaningful / useful / Interesting patterns from a large volume of data sources• In this project, the source will be large volume of WEB HOTEL REVIEWS data• Data mining is one of the top ten emerging technology MIT’s TECHNOLOGY REVIEW 2004
    • 4. What is Data Mining?• Process of exploration and analysis• By automatic / semi automatic means• With little or no human interactions• To discover meaningful patterns and rulesAND LINOFF, 2000 MASTERING DATA MINING BY BERRY
    • 5. User’s Opinions in• Increase in social Hotel media and web user• Increase in valuable opinion oriented data in Hotel due to web expansion• Identify potential hotel to stay by looking at the aspects• Overall Sentiments on hotel are greatly sought on the web for
    • 6. What can Data Mining • Identify best prospects do? (ASPECTS), and retain customers • Predict what ASPECTS customers like and promote accordingly • Learn parameters influencing trends in sales and margins • Identification of opinions for customers
    • 7. What are the• Exponential growth of problems? user’s opinions• Limitations of human analysis• Accuracy of human analysisMachines can be trained to take over human analysis with advanced computer technology and it is done with LOW
    • 8. Some Limitations of • Unable to read like a machines human • No emotions • Cannot detect sarcasm • Expression of sentiments in different topic and domain • Polarity analysis • Facts Vs Opinion
    • 9. Some machine • “The service is aslimitation examples good as none”. Negation not obvious to machine • “Swimming pool is big enough to swim with comfort” , “There is a big crowd at the counter complaining”. Polarity might change with context.
    • 10. Sentiment Analysis
    • 11. Machine Learning• A tool for data mining and intelligent decision support• Application of computer algorithms that improve automatically through experience MASTERING DATA MINING BY BERRY AND LINOFF, 2000
    • 12. Types of Machine• Supervised Learning learning • A training set is provided (data with correct answers) which is used to mine for known pattern• Unsupervised Learning • Data are provided with no prior knowledge of the hidden patterns that they contain.
    • 13. Supervised Learning • Rule Mining and Rule techniques learning • Bayesian Networks • Support Vector Machine
    • 14. Project Objective• Prediction of sentence polarity• Classification of polarity for sentiment lexicon• Detection of relations
    • 15. Pre-requisite• Large data set• Relevant Prior Knowledge to domain, in our case the hotel domain • Eg. Rating• Sentiment lexicon for sentiment analysis• Data selection for reliability and standards
    • 16. Data Mining Process
    • 17. Cleaning the “Dirty”• Frequent problem : DataData (60% of effort) inconsistencies• Duplicate data• Spelling Errors != Trim from data• Foreign accent and characters• Singular / Plural conversion• Punctuations removal / replacement• Noise and incomplete data• Naming convention misused,
    • 18. Data Preprocessing• Part of Speech Tagging (POS) (Laundering) using Brill Tagger• Polarity tagging using
    • 19. Findings• Part of Speech Tagging (POS) using Brill Tagger - NO PROBLEM -95% accuracy POS tagging words after data cleaning
    • 20. Findings•Polarity tagging using sentiment lexicon – BIG PROBLEM-40% sentiment words not found in sentiment lexicon -10% sentiment words with a positive or negative polarity found are in the neutral section of sentiment lexicon
    • 21. Problems• Sentiment lexicon not comprehensive to fulfill machine learning technique adopted• Polarity of sentiment words who are domain dependent are founded in neutral section of sentiment lexicon• Polarity of sentiment words can also change within the domain even though they are domain dependent
    • 22. Solution• Classify the polarity of unlabeled sentiment word using rule based mining• Classify domain dependent sentiment words• Establish word relations between labeled and unlabeled sentiment words
    • 23. Data Processing• Rule based mining using conjunction and punctuation Polarity Assignment Rules Same Adj – AND/OR - Adj Opposite Neg - Adj – AND/OR - Adj / Adj – AND/OR - Neg- Adj Same Neg - Adj – AND/OR - Neg- Adj Opposite Adj – BUT/NOR – Adj Same Neg - Adj – BUT/NOR - Adj / Adj – BUT/NOR - Neg- Adj Opposite Neg - Adj – BUT/NOR - Neg- Adj Same Adj , Adj
    • 24. Data Processing• Relation Network – Aspect – Sentiment word pair
    • 25. Data Processing• Relation Network – Aspect – Sentiment word pair
    • 26. Analysis• Using the expanded sentiment lexicon, we analyze the polarity sentiment by doing a sentiment lookup using Bayesian Network
    • 27. Bayesian• To determine polarity of sentiments P(X | Y) = P(X) P(Y | X) / P(Y)• Probability that a sentiments is positive or negative, given its contents• Assumptions: There is no link between words• P(sentiment | sentence) =
    • 28. Validation• Precision = N (agree & found) / N (found)• High precision means most of the correct sentiment words are found by the system• Recall = N (agree & found) / N (agree)• High recall means most of
    • 29. Validation Results• It is found that out of the 350 aspect-unlabelled sentiment word pairs,• Only 194 are founded by the methods. Thus, the precision is about 57%.• The recall is also not very high; only 126 words are corrected labelled by the system, which is about 63%.
    • 30. Discussion• The results will improve if more rules are applied such the inclusion of more adverbs such as “excessively” as negation words.• There might not be enough dataset for the system to work on. There are only 350 aspect- unlabelled sentiment word pairs for the application to work with.• This, however requires more
    • 31. Conclusion• Comprehensive Sentiment Lexicon is a simple yet effective solution to sentiment analysis as it does not requires prior training• Current sentiment lexicon does not capture such domain and context sensitivities of sentiment expressions
    • 32. Conclusion• This leads to poor coverage• Thus, expanding general sentiment lexicon to capture domain and context sensitivities of sentiment expressions are advocated
    • 33. Question s? 01 DEMO

    ×