• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Web Opinion Mining
 

Web Opinion Mining

on

  • 346 views

What is Web Opinion Mining?

What is Web Opinion Mining?

Statistics

Views

Total Views
346
Views on SlideShare
346
Embed Views
0

Actions

Likes
0
Downloads
22
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Web Opinion Mining Web Opinion Mining Document Transcript

    • Web Opinion Mining Dupré Marc-Antoine, Alexander Patronas, Erhard Dinhobl, Ksenija Ivekovic, Martin Trenkwalder TU Wien, Wintersemester 2009/10 marcantoine.dupre@gmail.com, e0425487@student.tuwien.ac.at, e0525938@student.tuwien.ac.at, xenia.ivekovic@gmail.com, trenkwaldermartin@gmail.com Abstract. This paper covers an overview about the topic Web Opinion Mining, which includes the structure of an opinion, several different approaches, opinion spam and analysis and existing tools using sentiment analysis techniques to gather the opinions from different sources. Web 2.0 has dramatically changed the way in which people communicate with each other. People are writing their point of view about every topic you can imagine of on the web. For example there are opinions about people, a product, website or a specific service. The need for good opinion mining is increasing in a very fast way. Market analysis or companies capitalize on those techniques. A very interesting aspect for those companies is the knowledge what people, respectively the market, is thinking currently about a new product they just released. Of course for individuals gathering opinions from several product reviews is also very useful. Keywords: Data Mining, Opinion Mining, Sentiment Analysis, Opinion Mining Tools, Sentiment Analysis ToolsIntroductionThink about everything is posted on blogs, facebook-feeds, twitter and so on. Usersexpress there what they think, their opinions and also maybe their political, religiouspoint of view. There are also websites like wikipedia, research-information-sites orsomething like that which describe facts. So we can distinguish between opinions andfacts on the web [14]. Data what is read and declared as a fact must be assumed that isit is true. Currently search engines search and index facts. They can be associatedwith keywords, tags and can be grouped by topics [14]. But opinions underlie a morecomplex situation. Usually they are out of a question like, what do people think ofMotorola Cell phones, or, what do people in America think about Barack Obama [14].Todays search algorithms are not designed to receive opinions. In most cases it is alsovery difficult to determine such data and also user opinion data is mostly part of thedeep web [14] (Bing Liu defines this as user generated content, but this is exactlywhat it is, namely the deep web mostly, but there is also other content). It is not partof the global scope of the web but more on one’s circle of friends. Most data lies in
    • review sites, forums, blogs, black boards and so on. This type of information is alsocalled “Word-of-mouth”. To mine opinions expressed in such content needs somekind of artificial intelligence algorithm [14]. This is not easy. But practically it wouldbe very useful, for example in market intelligence for organisations and companies toserve better product and service advertising. Maybe persons are interested in otheropinions when purchasing products or discussion about political topics. It is alsointeresting for overall search functions like “Opinions: Motorola cell phones” or“BMW vs. Porsche”. Due that data types there crystallize out two types of opinionsnamely direct opinions and comparisons. The former is some kind of expression on anobject like products, events, persons and so on. The latter describes a relation betweenobjects, usually an ordering of them like “product x is more expensive than y” [14].These relations can be objective like prices but also subjective.Opinion mining conceptTo get a realizable way to opinion mining the process must be formalized. The basiccomponents of an opinion are [14]: • Opinion holder: the person or organization which has written an opinion on the web • Object: object on which the opinion holder expressed the opinion • Opinion: the content on the object from the opinion holderModelAn object is an entity like a product, event or so and represents a hierarch ofcomponents and each component is associated with attributes [14]. O is the root node.There also exist sub-events or sub-topics. The represent the whole component tree wedescribe this as “feature”. Therefore expressing an opinion on a feature makes iteasier not to determine between components attributes. In that sense the object is alsoa feature. So the object O is defined by a finite set of feature F= {f1, f2, f3, …, fn}.Every feature fi F defines a set of words or phrases Wi as synonyms. Wi W. W ={W1, W2, W3, …., Wn}. Now the opinion holder is j and comments on a subset of features Sj of F of O.Now feature fk Sj is commented by j by a word or phrase from Wk to determine thefeature and a positive, negative or neutral opinion on fk.TaskThe opinion mining task seen as the sentiment classification is done on three levels.[14] First it is done on document level. There is one assumption namely that onedocument focuses only on a single opinion from a single opinion holder. In many
    • cases like forums or something like that this is not true therefore the document mustbe separated. In this level the opinion is given the class it belongs to like positive,negative or neutral. This level is too coarse grained for most applications. The secondone is mining at sentence level. In this level are two tasks. First one is to determinethe sentence type like objective or subjective. The second task is to determine thesentence class to which it belongs to like positive, negative or neutral. Theassumption is that sentence contains only one opinion but this is not very easy tomatch. Therefore clauses or phrases may be useful and focuses on identifyingsubjective sentences. The last and third level of the mining task is done at the featurelevel. In overall task focus is on sentiment words like great, excellent, horrible, bad,worst and so. In topic-based classification topic words are important. Summary-List 1. document level - class determining (1 opinion from 1 opinion holder) 2. sentence level (one opinion) a. sentence type determining (objective or subjective) b. sentence class determining (neutral, positive, negative) 3. feature level – determining words and phrasesWords and PhrasesThe basic question is how to determine the sentiment classification on document andsentence level [14]. Negative sentiment doesn’t mean that the opinion holder dislikesthe feature of the product or the whole product and a positive one that he/she likeseverything. There is more! Sentiment words are often context dependent, for examplelong. Long runtime of a benchmark on a graphic card would be very bad but longruntime of a battery would be very nice. To get such word and phrase lists there arethree approaches: 1. manual approach: manual creation of the list, one time effort 2. corpus-based approach text is analyzed by co-occurrence patterns and is domain dependent 3. dictionary-based approach Using constraints on connectives of words to identify opinion words, for example “This camera is beautiful AND spacious” where and gives the same orientation. This constraint using can also be applied to OR, BUT, EITHER- OR and NEITHER-OR. For this learning approach there exists a database which contained 21 million words in 1987. There is a good online resource called “WordNet”.Document-level sentiment analysis In order to analyse the general opinion of documents most of the research studiesuse classifiers. A classifier is an algorithm or the program based on it. Given a set of
    • documents, a sentiment classifier classifies each document in two classes : positive ornegative (the class neutral is seldom used). A document classified in the positive classexpresses a general opinion which is positive. And a document classified as negativeexpresses a general negative opinion. Such a classifier is unable to determine who arethe holders of the opinions or what are the objects targeted by the opinions. Thus theset of documents has to be chosen wisely, the topic of all the documents could be asingle object for example. It is assumed that a single document only expresses theopinion of a single holder. Several approaches exist to perform sentimentclassification at a document level, we describe three of them below [14, 30].Classification based on sentiment phrases This approach is a research field of Tuney [28]. It can be divided into three steps. First the document is tagged using the Part-of-speech (POS) method [30]. Itbasically replaced each word by a linguistic category according to its syntactic ormorphological behavior. For instances, JJ means adjective and VBN means verb inpast participle. It has be proven [29] that, for sentiment classification purposes, theadjectives are the most relevant words. Nevertheless an adjective may have severalsemantic orientation depending of the context. "unpredictable" might be negative in aautomotive review but be positive in a movie review [29]. That is why, thanks to thePOS tagging, pairs of words are extracted depending on precise patterns in order todetermine precisely the semantic orientation of the adjectives. The following tablecontains some of the patterns used for extracting two-words phrases. First word Second word Third word (not extracted) JJ NN anything RB JJ not NN JJ JJ not NN NN JJ not NN RB VB anything The table above presents a simple version of the extraction patterns. NN are nouns,RB adverbs, VB vers and JJ adjectives. For example, in the sentence "This cameraproduces beautiful pictures", "beautiful pictures" will be extracted (first pattern : NN+ JJ). The second step is based on a measure called the pointwise mutual information(PMI). The concept is to search if a given phrase is more likely to co-occur with theword "excellent" or with the word "poor" on the web.
    • Pr(term1 ^ term2) is the probability that term1 and term2 co-occur.Pr(term1)Pr(term2) is the probability that term1 and term2 co-occur if they arestatistically independant. Thus the ratio gives an information about the statisticaldependence of those two terms. Tuney proposes to compute a value of the semanticorientation of a phrase by the following way : Then by using the number of hits on a search engine it is possible to estimate theprobabilities and the SO equation becomes The last step of Turneys algorithm is, given a review, to compute the average SOof all phrases in the review. If it is greater than null then the review expresses apositive opinion. Otherwise it expresses a negative opinion. Final classification results on reviews from various domains are from 84% forautomobile review to 66% for movie reviews [29, 30].Classification using text classification methods Sentiment classification can be tackled as a topic-based text classification problem.All the usual text classification algorithms can be used, e.g., naïve Bayes, SVM, kNN,etc. This approach was experimented by Pang et al. [31]. They have classified 1400 movies reviews from IMDb.com with a random-choicebaseline of 50%. They used the three following algorithms, SVM, naïve Bayes andMaximum Entropy. Each of those algorithm usually produces good results on textclassification problems. With various pre-processing options and a 3-fold cross-validation, the resultsspread from 72.8% to 82.9%. The best result is achieved by SVM algorithm onunigrams data. All the results are above the random-choice baseline and the humanbag-of-words experiences (58% and 64%). They are superior to the PMI-IR algorithmfrom Turney on movies review (66%).
    • Still the three used algorithms are expected to get results around 90% on topic-based text classification problems. Thus sentiment classification is a more difficulttask because of the various semantic values and uses of sentiment phrases.Classification using a score function Another approach by Dave et al. [32] is by using a score function. The first step isto score each term of the learning set with the following score function the score number is between -1 and 1, it indicates toward which class, C or C, theterm is more likely to belong to. A learning set is a set of reviews which have beenlabeled manually. So it is possible to compute statistics such as Pr(t|C): probabilitythat the term t appears in a review belonging to class C. Then a document is classifiedaccording to the sum of the scores of all its terms. On a large set of reviews from theweb (more than 13000) and by working with bigrams and trigrams, the classificationrate is between 84.6% and 88.3%.Sentence-level sentiment analysis The sentiment classification at the document-level is the most important field ofweb opinion mining. However, for most applications, the document-level is toocoarse. Therefore it is possible to perform finer analysis at the sentence-level. Theresearch studies in this field mostly focus on a classification of the sentences wetherthey hold a objective or a subjective speech, the aim is to recognise subjectivesentences in news articles and not to extract them. The sentiment classification as ithas been described in the document-level part still exists at the sentence-level, thesame approaches as the Turneys algorithm are used, based on likelihood ratios.Because this approach has already been described in this paper, this part focuses onthe objective/subjective sentences classification and presents two methods to tacklethis issue. The first method is based on a bootstrapping approach using learned patterns. Itmeans that this method is self-improving and is based on phrases patterns which arelearned automatically. This method comes from the study of Wiebe & Riloff [33], thefollowing schema helps to understand the bootstrapping
    • process. The input of this method is known subjective vocabulary and a collection ofunannotated texts. • The high-precision classifiers find wether the sentences are objective or subjective based on the input vocabulary. High-precision means their behaviours are stable and reproductible. They are not able to classify all the sentences but they make almost no errors. • Then the phrase patterns which are supposed to represent a subjective sentence are extracted and used on the sentences the HP classifiers have let unlabeled. • The system is self-improving as the new subjective sentences or patterns are used in a loop on the unlabeled data. This algorithm was able to recognise 40% of the subjective sentences in a test setof 2197 sentences (59% are subjective) with a 90% precision. In order to compare, theHP subjective classifier alone recognises 33% of the subjective sentences with a 91%precision. Along this original method, more classical data mining algorithm are used such asthe naïve bayes classifier in the research studies of Yu & Hatzivassiloglou [34]. Thenaïves bayes is a supervised learning method which is simple and efficient, especiallyfor text classification problems (i.e. when the number of attributes is huge). To copewith an important and unavoidable approximation about their training data to avoidhuman labelisation on enormous data set, they use a multiple naïve bayes classifiersmethod. The general concept is to split each sentence in features -- such as presenceof words, presence of n-grams, heuristics from other studies in the field -- and to use
    • the statistics of the training data set about those features to classify new sentences.Their results show that the more features, the better. They achieved at best a 80-90%recall and precision classification for subjective/opinions sentences and a 50% recalland precision classification for objective/facts sentences. The sentence-level sentiment classification methods are improving, this resultsfrom research studies in 2003 show that they were already quite efficient then and thatthe task is possible.Feature-based opinion miningMain objective of feature-based opinion mining is to find what reviewers (opinionholders) like and dislike about observed object. This process consists of followingtasks: 1. extract object features that have been commented on in each review 2. determine whether the opinions on the features are positive, negative or neutral 3. group feature synonyms 4. produce a feature-based opinion summaryThere are three main review formats on the Web which may need different techniquesto perform the above tasks: 1. Format 1 – Pros and Cons: The reviewer is asked to describe Pros and Cons separately. Example: C|net.com 2. Format 2 – Pros, Cons and detailed review: The reviewer is asked to describe Pros and Cons separately and also write a detailed review. Example: Epinions.com 3. Format 3 – free format: The reviewer can write freely, there is no separation of Pros and Cons. Example: Amazon.comAnalysing reviews of formats 1 and 3:The summarization is performed in three main steps:1) mining product features that have been commented on by customers: • part-of-speech tagging: Product features are usually nouns or noun phrases in review sentences. Each review text is segmented into sentences and part- of-speech tag is produced for each word. Each sentence is saved in the review database along with the POS tag information of each word in the sentence. Example of sentence with POS tags:
    • <S> <NG><W C=PRP L=SS T=w S=Y> I </W> </NG> <VG> <W C=VBP> am </W><W C=RB> absolutely </W></VG> <W C=IN> in </W> <NG> <W C=NN> awe </W> </NG> <W C=IN> of </W> <NG> <W C=DT> this </W> <W C=NN> camera </W></NG><W C=.> . </W></S> • frequent feature identification: Frequent features are those features that are talked about by many customers. To identify them, association mining is used. However, not all candidate frequent features generated by association mining are genuine features. Two types of pruning are used to remove those unlikely features. Compactness pruning checks features that contain at least two words, which we call feature phrases, and remove those that are likely to be meaningless. In redundancy pruning, redundant features that contain single words are removed. Redundant features are described with concept of p-support (pure support). p-support of feature ftr is the number of sentences that ftr appears in as a noun or noun phrase, and these sentences must contain no feature phrase that is a superset of ftr. Minimum p-support value is used to prune those redundant features. • infrequent feature generation: For generating infrequent features following algorithm is applied: for each sentence in the review database if (it contains no frequent feature but one or more opinion words) { find the nearest noun/noun phrase around the opinion word. The noun/noun phrase is stored in the feature set as an infrequent feature. }2) identify orientation of an opinion sentenceTo determine the orientation of the sentence, dominant orientation of the opinionwords (e.g. adjectives) in the sentence is used. If positive opinion prevails, theopinion sentence is regarded as a positive and opposite.3) Summarizing the results The following picture shows an example summary for the feature picture of adigital camera.Feature: picturePositive: 12• Overall this is a good camera with a really goodpicture clarity.• The pictures are absolutely amazing - the camera
    • captures the minutest of details.• After nearly 800 pictures I have found that this cameratakes incredible pictures.…Negative: 2• The pictures come out hazy if your hands shake evenfor a moment during the entire process of taking apicture.• Focusing on a display rack about 20 feet away in abrightly lit room during day time, pictures produced by this camera were blurry and in a shade of orange.Analysing reviews of format 2:Features extracted based on the principle that each sentence segment contains at mostone product feature. Sentence segments are separated by ‘,’, ‘.’, ‘and’, and ‘but.For extracting product features, suprevised rule discovery is used. First, trainingdataset has to be prepaired. The steps are following: • perform part-of-speech tagging e.g. <N> Battery <N> usage <V> included <N> MB <V>is <Adj> stingy • replace actual feature words in a sentence with [feature] e.g. <N> [feature] <N> usage <V> included <N> [feature] <V> is <Adj> stingy • use n-gram to produce shorter segments from long ones e.g. <V> included <N> [feature] <V> is <N> [feature] <V> is <Adj> stingyAfter these steps, rule generation can be performed – definition of extraction patterns.e.g. of extraction pattern:<JJ> <NN> [feature]easy to <VB> [feature]The resulting patterns are used to match and identify features from new reviews.Sometimes mistakes made during extraction have to be corrected. E.g. when there are
    • two or more candidate features in one sentence segment or there is a feature in thesentence segment but not extracted by any pattern. First problem can be solved byimplementing an iterative algorithm to deal with the problem by rememberingoccurrence counts.Orientation (positive or negative) of extracted features is easily to define as we knowif the feature is from Pros or Cons of a review.These features are usually used to make comparison of consumer’s opinions ofdifferent products.Opinion Spam and AnalysisThe web has dramatically changed the way that people express themselves andinteract with others. They are now able to post reviews of products at merchant sitesand interact with others via blogs and forums. Reviews contain rich user opinions onproducts and services. They are used to by potential customers to find opinions ofexisting users before deciding to purchase a product and they are also helpful forproduct manufacturers to identify product problems and to find marketing intelligenceinformation about their competitors. Due to the fact that there is no quality control,anyone can write anything on the Web. This results in many low quality reviews andreview spam.
    • It is now very common for people to read opinions on the Web for many reasons. Forexample, if someone wants to buy a product and sees that the reviews of the productare mostly positive, one is very likely to buy the product. If the reviews are mostlynegative, one is very likely to choose another product. There are generally three typesof spam reviews: 1. Untruthful opinions: Those that the reviewer is giving an unjustly positive review to a product or an object in order to promote the object (hyper spam) or when the reviewer is giving some wrongly negative comment to some object in order to damage that product (defaming spam) 2. Reviews on brands only: Those are comments given by a reviewer only for the brands, the seller or the manufactures but not for the specific product or object. In some cases it is useful, but it is considered as spam because it focuses not to the specific product. 3. Non-Reviews: Those are comments that are not related to the product, for example advertisements, questions, answers and random texts.In general, spam detection can be regarded as a classification problem with twoclasses, spam and non-spam. However, due to the specific nature of different types ofspam, we have to deal with them differently. For spam reviews of type 2 and type 3,we can detect them based on traditional classification learning using manually labeledspam and non-spam reviews because these two types of spam reviews arerecognizable manually. Quite a lot of reviews of this two types are duplicates andeasy to detect. To detect the remaining spam reviews it is necessary to create a modelcontaining the following model: • The content of the review: i.e. number of helpful feedbacks, length of the review title, length of the review body, position of the review, textual features etc. • The previewer who wrote the review: i.e. number of reviews of the reviewer, average rating given by the reviewer, standard deviation in rating • The product being reviewed: i.e. price of the product, average rating, standard deviation in ratingsUsing the just discussed model with logistic regression, that produces a probabilityestimate of each review being a spam. It was evaluated on 470 spam reviews searchedon amazon.com and it got the following result: Spam Type Num AUC AUC – text AUC – w/o reviews features only feedbacks Types 2 & 3 470 98,7 % 90 % 98% Types 2 only 221 98,5 % 88 % 98 % Types 3 only 249 99,00 % 92 % 98 %
    • To perform the logistic regression it was used the statistical package R (http://www.r-project.org/). The AUC (Area under ROC curve) is a standard measure used inmachine learning for assessing the model quality.Without using feedback features it is reached quite the same result as include them inthe evaluation. This is important because feedbacks can be spammed too.However for the first type of spam, manual labeling by simply reading the reviews isquite impossible. The point is to distinguish the untruth review of a spammer from ainnocent review. The only way is to create a logistic regression model usingduplicates as positive training examples and the rest of the reviews as negativetraining examples. The model was evaluated on totally 223.002 reviews, of them therewas 4488 duplicate spam reviews and 218514 other reviews. Features used AUC All features 78 % Only review features 75 % Only reviewer features 72,5 % Without feedback features 77 % Only text features 63 %The table shows that review centric features are most helpful. Using only text featuresgives only 63 % AUC, which demonstrates that it is very difficult to identify spamreviews using text content alone. Combining all the features gives the best result.Opinion mining ToolsFollowing there will be a categorized list of several different tools, which can be usedfor opinion mining. A short review is enclosed for each mentioned tool.APIsEvri[15] Evri is a semantic search engine. It automatically reads web content in a similarway humans do. It performs a deep linguistic analysis of many millions documentswhich is then build up to a large set of semantic relationships expressing grammaticalSubject-Verb-Object style clause level relationships.Evri offers a complex API for developers, with it, it is easy to automatically, costeffectively and in a fully scalable manner, analyze text, get recommendations,discover relationships, mine facts and get popularity data.
    • Further it is possible to get Widgets with different usage, one of those is using thesentiment aspect. An example of the sentiment widget displays the positive andnegative aspects in a percentage bar of the opinion on the new mobile operatingsystem running on the Linux Kernel, called “Android”. [16]OpenDover[17] OpenDover is a Java based webservice that allows to easily integrate sementicfeatures within your blog, content management system, website or application.Basically how it works is that your content is sent through a webservice to theirservers (OpenDover), which process and analyze the content by using theimplemented linguistic processing technologies. After processing the content is sentback, emotion tagged along with an indicating value how positive or negative thecontent is.Without any effort it is possible to test this service at a live-demo site on their website[17].As an example i chose an arbitrary review on a camera from amazon.com:„...the L20 is unisex and its absolutely right in line with the jeweled quality of Nikon.I was able to use the camera right out of the box without having to read theinstruction manual, its that easy to use....The camera feels good in my hands and the controls are easy to find without havingto take your eyes off your subject...The Nikon L20 comes with a one year manufactures warranty - "Not that you wouldneed a warranty for a Nikon camera" - Impressive warranty details, I was amazedthat any camera manufacturer would offer a one year on a point and shoot but Nikonhas such a good reputation and so I doubt very much that you would even need to useit.In a nutshell, I love this camera so much that I would recommend this Nikon L20 tomy friends, family and anyone else looking to buy. Its a real beauty!“The first BaseTag was set to “camera”, the second to “Nikon L20”, which the productreview was about. The Mode was set to “Accurate” and the selected subject domainwas “camera”.The output is then the emotion tagged text. It recognizes positive, negative words andthe object. The result of their algorithm is good, for example positive words like “easyto use”, “good”, “impressive” and “love” are marked with green colour.Twitter/BlogsphereRankSpeed[18] RankSpeed is a sentimant search tool for the blogosphere / twittersphere.It finds the best websites, the most useful web apps, the most secure web services andother topics with the help of sentiment analysis.It is possible to search for any website category using tags and rank them by any
    • desired criteria. Criterias like good, useful, easy and secure.A statistical analysis computes the percentage of bloggers/users who correspond tothe desired criteria. The given result, which is a list of links from the source, is thensorted in an descending order by the given percentage.Twittratr[19] Twittratr is a simple search tool for answers to questions like "Are Tweets aboutObama generally positve or negative?". The functionality is kept simple. It is basedon a list of positive and negative keywords. Twitter is searched for these keywordsand the results are crossreferenced against their adjective lists, then displayedaccordingly.TwitterSentiment[20] "Twitter Sentiment is a graduate school project from Stanford University. Itstarted as a Natural Language Processing class project in Spring 2009 and willcontinue as a Natural Language Understanding CS224U project in Winter 2010."Twitter Sentiment was created by three Computer Science graduate students atStanford University: Alec Go, Richa Bhayani, Lei Huang. It is an academic project.As it is obvious the are doing sentiment analysis on Tweets from Twitter.[27] The approach they are working with is different from other sentiment analysissites due to following reasons: • Use of classifiers built from machine learning algorithms. Other sites tend to use a keyword-based approach which is much simpler, it may have higher precision, although lower recall. • Transparent in how classification on individual tweets is done. Other sites often do not display the classification of individual tweets. There are only showing aggregated numbers, which makes it almost impossible to assess how accurate the classifiers are.WE Twendz Pro[21] Waggener Edstorm twendz pro service is a Twitter monitoring and analytics webapplication. It enables the user to easily measure the impact of a specific messagewithin the key audiences.It uses a keyword-based approach to determine general emotion. Meaningful wordsare compared against a dictionary of thousands of words which are associated withpositive or negative emotion.Each word has a specific score, combined with the other scored words it results in aneducated guess at the overall emotion.
    • NewspaperNewssift[22] Newssift is a sentiment search tool on Newspapers and a product from FinancialTimes. It indexes content from major news and business sources. The query, forexample brands, legal risks and environmental impact, is matched in regards to thebusiness topics. This gives you information about changing issues across time for acompany or product.ApplicationsLingPipe[23] "LingPipe is a state-of-the-art suite of natural language processing tools writtenin Java that performs tokenization, sentence detection, named entity detection,coreference resolution, classification, clustering, part-of-speech tagging, generalchunking, fuzzy dictionary matching. These general tools support a range ofapplications."The idea on how sentiment analysis is done using LingPipes language classificationframework is to make two classification tasks: • separating subjective from objective sentences • separating positive from negative reviewsA tutorial is online at their website [23] which describes how to use LingPipe forsentiment analysis.Radian6[24] Radian6 is a commercial social media monitoring application. It has muchfunctionality, like working with dashboards, widgets. Radian6 gathers from blogs,comments, multimedia and forums and communities like Twitter the discussions andopinions and gives businesses the ability to analyze, manage, track and report on theirsocial media engagement and monitoring efforts.RapidMiner[25] RapidMiner is an open-source system, at least the Community Editon, for datamining and machine learning. It is available as a stand-alone application for dataanalysis and as a data-mining engine for the integration into own products. SentimentAnalysis is also supported. It is used for both real-world data mining and in theresearch area.
    • Web Opinion Mining 17References / Further Readings 1. Liu, Bing, Mining Opinion Features in Customer Reviews, Department of Computer Science, University of Illinois at Chicago 2. Liu, Bing, Mining and Summarizing Opinions on the Web, Department of Computer Science, University of Illinois at Chicago 3. Liu, Bing, From Web Content Mining to Natural Language Processing, Department of Computer Science, University of Illinois Chicago 4. Liu, Bing, Mining and Searching Opinions in User-Generated Contents, Department of Computer Science, University of Illinois Chicago 5. Hu, Minquing, Liu, Bing, Mining and Summarizing Customer Reviews, Department of Computer Science, University of Illinois Chicago 6. Ding, Xiaowen, Liu, Bing, Zhang, Lei, Entity Discovery and Assignment for Opinion Mining Applicatinos, Department of Computer Science, University of Illinois Chicago 7. Liu, Bing, Opinion Mining, Department of Computer Science, University of Illinois Chicago 8. Liu, Bing, Opinion Mining and Search, Department of Computer Science, University of Illinois Chicago 9. Ding, Xiaowen, Liu, Bing, Yu, Philip S., A Holistic Lexicon-Based Approach to Opinion Mining, Department of Computer Science, University of Illinois Chicago 10. Liu, Bing, Opinion Mining & Summerazation – Sentment Analysis, Department of Computer Science, University of Illinois Chicago 11. Jindal, Nitin, Liu, Bing, Opinion Spam and Analysis, Department of Computer Science, University of Illinois Chicago 12. Liu, Bing, Web Content Mining, Department of Computer Science, University of Illinois Chicago 13. Liu, Bing, Hu, Minqing, Cheng, Junsheng, Opinion Observer: Analyzing and Comparing Opinions on the Web, Department of Computer Science, University of Illinois Chicago 14. Liu, Bing Web Data Mining – Exploring Hyperlinks, Contents and Usage Data – Lecture Sides, Springer, Dec. 200615. Evri, Semantic Web Search Engine; [cited 2010 Jan 19].<http://www.evri.com/>. 16. Evri, Widget Sentiment Analysis Example on “Android”; [cited 2010 Jan 19]. <http://www.evri.com/widget_gallery/single_subject?widget=sentiment&ent ity_uri=/product/android-0xf14fe&entity_name=Android>. 17. OpenDover, Sentiment Analysis Webservice; [cited 2010 Jan 19]. <http://www.opendover.nl/>. 18. RankSpeed, Sentiment Analysis on Blogosphere and Twittersphere; [cited 2010 Jan 19]. <http://www.rankspeed.com/>.
    • 18 Dupré Marc-Antoine, Alexander Patronas, Erhard Dinhobl, Ksenija Ivekovic,Martin Trenkwalder 19. Twittratr; [cited 2010 Jan 19]. <http://twitrratr.com/>. 20. Twitter Sentiment, a sentiment analysis tool; [cited 2010 Jan 19]. <http://twittersentiment.appspot.com/>. 21. WE twendz pro service, influence analytics for twitter; [cited 2010 Jan 19]. <https://wexview.waggeneredstrom.com/twendzpro/default.aspx>. 22. Newssift, sentiment analysis based on Newspapers; [cited 2010 Jan 19]. <http://www.newssift.com/>. 23. LingPipe, Java libraries for the linguistic analysis of human language; [cited 2010 Jan 19]. <http://alias-i.com/lingpipe/index.html>. 24. Radian6, social media monitoring and engagement; [cited 2010 Jan 19]. <http://www.radian6.com/>. 25. Sysomos, Business Intellegence for Social Media; [cited 2010 Jan 19]. <http://sysomos.com/>. 26. RapidMiner, environment for machine learning and data mining experiments; [cited 2010 Jan 19]. <http://rapid-i.com/>. 27. Go, Alec, Bhayani, Richa, Huang, Lei, Twitter Sentiment Classifciation using Distand Supervision, Stanford University; [cited 2010 Jan 19]. Available From: <http://www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf >. 28. Turney, P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In Proc. of the 15th Intl. Conf. on World Wide Web (WWW06), 2006. 29. Liu, Bing Web Data Mining – Exploring Hyperlinks, Contents and Usage Data, Springer, 2007. 30. Santorini, B. Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical Report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania, 1990. 31. Pang, B., Lee, L., Vaithyanathan, S. Thumbs Up? Sentiment Classification Using Machine Learning Techniques. In Proc. of the EMNLP02, 2002. 32. Dave, K., Lawrence, S., Pennock, D. Mining the Peanut Gallery : Opinion Extraction and Semantic Classification of Product Reviews. In WWW03, 2003. 33. Wiebe, J., Riloff, E. Learning Extraction Patterns for Subjective Expressions. 34. Yu, H., Hatzivassiloglou, V. Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 129-136, 2003.