Sentiment classification for product reviews (documentation)

  • 1,448 views
Uploaded on

The documentation of the pre-master graduation project prepared by my self and my colleagues Mostafa Ameen, Mai M. Farag and Mohamed Abd El kader.

The documentation of the pre-master graduation project prepared by my self and my colleagues Mostafa Ameen, Mai M. Farag and Mohamed Abd El kader.

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,448
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
116
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 2013 Sentiment Classification For Product Reviews by Mahmoud Mohamed Hassan Mostafa Mohamed Ameen Mohamed Abdelkader Hamed Mai Mohamed Mahmoud Supervisor: Dr. Mohamed Farouk Cairo university - ISSR 1
  • 2. Table of Contents Abstract ......................................................................................................................................................................5 Chapter 1 1.1 Introduction ............................................................................................................................................6 Motivations.................................................................................................................................................6 Chapter 2 Sentiment Analysis .................................................................................................................................7 2.1 Sentiment Analysis Applications ................................................................................................................7 2.2 Sentiment Analysis Research......................................................................................................................8 2.3 Different Levels of Analysis ........................................................................................................................8 2.3.1 Document level:..................................................................................................................................8 2.3.2 Sentence level: ...................................................................................................................................8 2.3.3 Entity and Aspect level: ......................................................................................................................8 2.4 Sentiment Lexicon and Its Issues ................................................................................................................9 2.5 Natural Language Processing Issues ........................................................................................................ 10 2.6 Opinion Spam Detection ......................................................................................................................... 10 Chapter 3 3.1 Machine Learning Approaches ............................................................................................................ 11 Data preprocessing .................................................................................................................................. 11 3.1.1 Feature extraction ........................................................................................................................... 11 3.1.2 Feature selection (dimensionality reduction) ................................................................................. 13 3.2 Classification ............................................................................................................................................ 18 3.2.1 Support vector machine classification (SVM) .................................................................................. 18 3.2.2 Naïve Bayes classification ................................................................................................................ 25 3.2.3 Conditional Independence .............................................................................................................. 29 3.2.4 Bayes Theorem ................................................................................................................................ 31 3.2.5 Maximum A Posteriori (MAP) Hypothesis ....................................................................................... 32 3.2.6 Maximum Likelihood (ML) Hypothesis ............................................................................................ 33 3.2.7 Naïve Bayesian Classification........................................................................................................... 34 3.3 Evaluation measures ............................................................................................................................... 36 3.3.1 Precision .......................................................................................................................................... 36 3.3.2 Recall (Sensitivity)............................................................................................................................ 37 3.3.3 Accuracy .......................................................................................................................................... 37 3.3.4 F1 –measure .................................................................................................................................... 37 2
  • 3. Chapter 4 Opinion lexicons .............................................................................................................................. 38 4.1 Definition ................................................................................................................................................. 38 4.2 SENTIWORDNET 3.0 ................................................................................................................................ 38 4.3 (Bing Liu) Opinion lexicon ........................................................................................................................ 39 4.3.1 Who is Dr. Bing Liu? ......................................................................................................................... 39 4.3.2 (Bing Liu) Opinion lexicon ................................................................................................................ 39 Chapter 5 Experimental results ........................................................................................................................ 40 5.1 Data collection ......................................................................................................................................... 40 5.2 Feature extraction ................................................................................................................................... 41 5.2.1 Unigram: .......................................................................................................................................... 41 5.2.2 Bigram:............................................................................................................................................. 42 5.3 Feature selection ..................................................................................................................................... 43 5.4 Results of the classifiers .......................................................................................................................... 47 5.4.1 Unigram ........................................................................................................................................... 47 5.4.2 Bigram.............................................................................................................................................. 50 5.5 Comparison.............................................................................................................................................. 53 5.6 Charts....................................................................................................................................................... 54 5.7 UI for predictions preview ....................................................................................................................... 57 5.8 Application for live sentiment analysis.................................................................................................... 58 Chapter 6 Conclusion ....................................................................................................................................... 60 References ........................................................................................................................................................... 61 3
  • 4. Acknowledgment Firstly, we thanks Allah for helping us to complete this project We would like to thanks our project supervisor Dr. Mohamed Farouk For his efforts and his great support which helped us to improve our performance and knowledge, since the very beginning as he helped us to discover the world of machine learning and made us looking forward to explore further in this overwhelming science. Also, all thanks and respect to all stuff members who taught us during the four semesters of the diploma and the 2 semesters of the premaster courses. We also thanks Professor Hisham Hefny for his inspiration and support by teaching us many subjects opening our minds to new trends and challenges in the computer science field. Special thanks for our parents for their endless efforts and unconditional support in all phases of our life especially in our education which lead us to this far of education. Project team.. 4
  • 5. Abstract Sentiment classification concerns the use of automatic methods for predicting the orientation of subjective content on text documents, with applications on a number of areas including recommender and advertising systems, customer intelligence and information retrieval. This research presents the result of applying two different approaches to the problem of automatic sentiment classification of product reviews. The first approach is using opinion lexicons (SentiWordNet 3.0 (Stefano Baccianella) and Bing Lui’s opinion lexicon). The second approach is using a supervised machine learning approaches (NaiveBayes algorithm and LibSVM algorithm). Also this research is an attempt to present such results in a comparative manner to help the reader to understand the difference between using variable numbers of attribute selection, attribute extraction and variable classification algorithms. 5
  • 6. Chapter 1 Introduction Sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. It represents a large problem space. There are also many names and slightly different tasks, e.g., sentiment analysis, opinion mining, opinion extraction, sentiment mining, subjectivity analysis, affect analysis, emotion analysis, review mining, etc. However, they are now all under the umbrella of sentiment analysis or opinion mining. While in industry, the term sentiment analysis is more commonly used, but in academia both sentiment analysis and opinion mining are frequently employed. They basically represent the same field of study. The term sentiment analysis perhaps first appeared in (Nasukawa and Yi, 2003), and the term opinion mining first appeared in (Dave, Lawrence and Pennock, 2003). However, the research on sentiments and opinions appeared earlier (Das and Chen, 2001; Morinaga et al., 2002; Pang, Lee and Vaithyanathan, 2002; Tong, 2001; Turney, 2002; Wiebe, 2000). Although linguistics and natural language processing (NLP) have a long history, little research had been done about people’s opinions and sentiments before the year 2000. Since then, the field has become a very active research area. There are several reasons for this. First, it has a wide arrange of applications, almost in every domain. The industry surrounding sentiment analysis has also flourished due to the proliferation of commercial applications. This provides a strong motivation for research. Second, it offers many challenging research problems, which had never been studied before. Third, for the first time in human history, we now have a huge volume of opinionated data in the social media on the Web. Without this data, a lot of research would not have been possible. Not surprisingly, the inception and the rapid growth of sentiment analysis coincide with those of the social media. In fact, sentiment analysis is now right at the center of the social media research. Hence, research in sentiment analysis not only has an important impact on NLP, but may also have a profound impact on management sciences, political science, economics, and social sciences as they are all affected by people’s opinions. Although the sentiment analysis research mainly started from early 2000, there were some earlier work on interpretation of metaphors, sentiment adjectives, subjectivity, view points, and affects (Hatzivassiloglou and McKeown, 1997; Hearst, 1992; Wiebe, 1990; Wiebe, 1994; Wiebe, Bruce and O'Hara, 1999). 1.1 Motivations With the explosive growth of social media (e.g., reviews, forum discussions, blogs, micro-blogs, Twitter, comments, and postings in social network sites) on the Web, individuals and organizations are increasingly using the content in these media for decision making. Nowadays, if one wants to buy a consumer product, one is no longer limited to asking one’s friends and family for opinions because there are many user reviews and discussions in public forums on the Web about the product. For an organization, it may no longer be necessary to conduct surveys, opinion polls, and focus groups in order to gather public opinions because there is an abundance of such information publicly available. However, finding and monitoring opinion sites on the Web and distilling the information contained in them remains a formidable task because of the proliferation of diverse sites. Each site typically contains a huge volume of opinion text that is not always easily deciphered in 6
  • 7. long blogs and forum postings. The average human reader will have difficulty identifying relevant sites and extracting and summarizing the opinions in them. Automated sentiment analysis systems are thus needed. In recent years, we have witnessed that opinionated postings in social media have helped reshape businesses, and sway public sentiments and emotions, which have profoundly impacted on our social and political systems. Such postings have also mobilized masses for political changes such as those happened in some Arab countries in 2011. It has thus become a necessity to collect and study opinions on the Web. Of course, opinionated documents not only exist on the Web (called external data), many organizations also have their internal data, e.g., customer feedback collected from emails and call centers or results from surveys conducted by the organizations. Due to these applications, industrial activities have flourished in recent years. Sentiment analysis applications have spread to almost every possible domain, from consumer products, services, healthcare, and financial services to social events and political elections. Chapter 2 Sentiment Analysis Opinions are central to almost all human activities because they are key influencers of our behaviors. Whenever we need to make a decision, we want to know others’ opinions. In the real world, businesses and organizations always want to find consumer or public opinions about their products and services. Individual consumers also want to know the opinions of existing users of a product before purchasing it, and others’ opinions about political candidates before making a voting decision in a political election. In the past, when an individual needed opinions, he/she asked friends and family. When an organization or a business needed public or consumer opinions, it conducted surveys, opinion polls, and focus groups. Acquiring public and consumer opinions has long been a huge business itself for marketing, public relations, and political campaign companies. 2.1 Sentiment Analysis Applications Apart from real-life applications, many application-oriented research papers have also been published. For example, in (Liu et al., 2007), a sentiment model was proposed to predict sales performance. In (McGlohon, Glance and Reiter, 2010), reviews were used to rank products and merchants. In (Hong and Skiena, 2010), the relationships between the NFL betting line and public opinions in blogs and Twitter were studied. In (O'Connor et al., 2010), Twitter sentiment was linked with public opinion polls. In (Tumasjan et al., 2010), Twitter sentiment was also applied to predict election results. In (Chen et al., 2010), the authors studied political standpoints. In (Yano and Smith, 2010), a method was reported for predicting comment volumes of political blogs. In (Asur and Huberman, 2010; Joshi et al., 2010; Sadikov, Parameswaran and Venetis, 2009), Twitter data, movie reviews and blogs were used to predict box-office revenues for movies. In (Miller et al., 2011), sentiment flow in social networks was investigated. In (Mohammad and Yang, 2011), sentiments in mails were used to find how genders differed on emotional axes. In (Mohammad, 2011), emotions in novels and fairy tales were tracked. In (Bollen, Mao and Zeng, 2011), Twitter moods were used to predict the stock market. In (Bar-Haim et al., 2011; Feldman et al., 2011), expert investors in microblogs were identified and sentiment analysis of stocks was performed. In (Zhang and Skiena, 2010), blog and news sentiment was used to study trading strategies. In (Sakunkoo and Sakunkoo, 2009), social influences in online book reviews were studied. In (Groh and Hauffa, 2011), sentiment 7
  • 8. analysis was used to characterize social relations. A comprehensive sentiment analysis system and some case studies were also reported in (Castellanos et al., 2011). 2.2 Sentiment Analysis Research As discussed above, pervasive real-life applications are only part of the reason why sentiment analysis is a popular research problem. It is also highly challenging as a NLP research topic, and covers many novel sub problems as we will see later. Additionally, there was little research before the year 2000 in either NLP or in linguistics. Part of the reason is that before then there was little opinion text available in digital forms. Since the year 2000, the field has grown rapidly to become one of the most active research areas in NLP. It is also widely researched in data mining, Web mining, and information retrieval. In fact, it has spread from computer science to management sciences (Archak, Ghose and Ipeirotis, 2007; Chen and Xie, 2008; Das and Chen, 2007; Dellarocas, Zhang and Awad, 2007; Ghose, Ipeirotis and Sundararajan, 2007; Hu, Pavlou and Zhang, 2006; Park, Lee and Han, 2007). 2.3 Different Levels of Analysis In general, sentiment analysis has been investigated mainly at three levels: 2.3.1 Document level: The task at this level is to classify whether a whole opinion document expresses a positive or negative sentiment (Pang, Lee and Vaithyanathan, 2002; Turney, 2002). For example, given a product review, the system determines whether the review expresses an overall positive or negative opinion about the product. This task is commonly known as document-level sentiment classification. This level of analysis assumes that each document expresses opinions on a single entity (e.g., a single product). Thus, it is not applicable to documents which evaluate or compare multiple entities. 2.3.2 Sentence level: The task at this level goes to the sentences and determines whether each sentence expressed a positive, negative, or neutral opinion. Neutral usually means no opinion. This level of analysis is closely related to subjectivity classification (Wiebe, Bruce and O'Hara, 1999), which distinguishes sentences (called objective sentences) that express factual information from sentences (called subjective sentences) that express subjective views and opinions. However, we should note that subjectivity is not equivalent to sentiment as many objective sentences can imply opinions, e.g., “We bought the car last month and the windshield wiper has fallen off.” Researchers have also analyzed clauses (Wilson, Wiebe and Hwa, 2004), but the clause level is still not enough, e.g., “Apple is doing very well in this lousy economy.” 2.3.3 Entity and Aspect level: Both the document level and the sentence level analyses do not discover what exactly people liked and did not like. Aspect level performs finer-grained analysis. Aspect level was earlier called feature level (featurebased opinion mining and summarization) (Hu and Liu, 2004). Instead of looking at language constructs (documents, paragraphs, sentences, clauses or phrases), aspect level directly looks at the opinion itself. It is based on the idea that an opinion consists of a sentiment (positive or negative) and a target (of opinion). An 8
  • 9. opinion without its target being identified is of limited use. Realizing the importance of opinion targets also helps us understand the sentiment analysis problem better. For example, although the sentence “although the service is not that great, I still love this restaurant” clearly has a positive tone, we cannot say that this sentence is entirely positive. In fact, the sentence is positive about the restaurant (emphasized), but negative about its service (not emphasized). In many applications, opinion targets are described by entities and/or their different aspects. Thus, the goal of this level of analysis is to discover sentiments on entities and/or their aspects. For example, the sentence “The iPhone’s call quality is good, but its battery life is short” evaluates two aspects, call quality and battery life, of iPhone (entity). The sentiment on iPhone’s call quality is positive, but the sentiment on its battery life is negative. The call quality and battery life of iPhone are the opinion targets. Based on this level of analysis, a structured summary of opinions about entities and their aspects can be produced, which turns unstructured text to structured data and can be used for all kinds of qualitative and quantitative analyses. Both the document level and sentence level classifications are already highly challenging. The aspect-level is even more difficult. It consists of several sub-problems. To make things even more interesting and challenging, there are two types of opinions, i.e., regular opinions and comparative opinions (Jindal and Liu, 2006b). A regular opinion expresses a sentiment only on an particular entity or an aspect of the entity, e.g., “Coke tastes very good,” which expresses a positive sentiment on the aspect taste of Coke. A comparative opinion compares multiple entities based on some of their shared aspects, e.g., “Coke tastes better than Pepsi,” which compares Coke and Pepsi based on their tastes (an aspect) and expresses a preference for Coke. 2.4 Sentiment Lexicon and Its Issues Not surprisingly, the most important indicators of sentiments are sentiment words, also called opinion words. These are words that are commonly used to express positive or negative sentiments. For example, good, wonderful, and amazing are positive sentiment words, and bad, poor, and terrible are negative sentiment words. Apart from individual words, there are also phrases and idioms, e.g., cost someone an arm and a leg. Sentiment words and phrases are instrumental to sentiment analysis for obvious reasons. A list of such words and phrases is called a sentiment lexicon (or opinion lexicon). Over the years, researchers have designed numerous algorithms to compile such lexicons. Although sentiment words and phrases are important for sentiment analysis, only using them is far from sufficient. The problem is much more complex. In other words, we can say that sentiment lexicon is necessary but not sufficient for sentiment analysis. Below, we highlight several issues: 1. A positive or negative sentiment word may have opposite orientations in different application domains. For example, “suck” usually indicates negative sentiment, e.g., “This camera sucks,” but it can also imply positive sentiment, e.g., “This vacuum cleaner really sucks.” 2. A sentence containing sentiment words may not express any sentiment. This phenomenon happens frequently in several types of sentences. Question (interrogative) sentences and conditional sentences are two important types, e.g., “Can you tell me which Sony camera is good?” and “If I can find a good camera in the shop, I will buy it.” Both these sentences contain the sentiment word “good”, but neither expresses a positive or negative opinion on any specific camera. However, not all conditional sentences 9
  • 10. or interrogative sentences express no sentiments, e.g., “Does anyone know how to repair this terrible printer” and “If you are looking for a good car, get Toyota Camry.”. 3. Sarcastic sentences with or without sentiment words are hard to deal with, e.g., “What a great car! It stopped working in two days.” Sarcasms are not so common in consumer reviews about products and services, but are very common in political discussions, which make political opinions hard to deal with. 4. Many sentences without sentiment words can also imply opinions. Many of these sentences are actually objective sentences that are used to express some factual information. Again, there are many types of such sentences. Here we just give two examples. The sentence “This washer uses a lot of water” implies a negative sentiment about the washer since it uses a lot of resource (water). The sentence “After sleeping on the mattress for two days, a valley has formed in the middle” expresses a negative opinion about the mattress. This sentence is objective as it states a fact. All these sentences have no sentiment words. 2.5 Natural Language Processing Issues Finally, we must not forget sentiment analysis is a NLP problem. It touches every aspect of NLP, e.g., co reference resolution, negation handling, and word sense disambiguation, which add more difficulties since these are not solved problems in NLP. However, it is also useful to realize that sentiment analysis is a highly restricted NLP problem because the system does not need to fully understand the semantics of each sentence or document but only needs to understand some aspects of it, i.e., positive or negative sentiments and their target entities or topics. In this sense, sentiment analysis offers a great platform for NLP researchers to make tangible progresses on all fronts of NLP with the potential of making a huge practical impact. 2.6 Opinion Spam Detection A key feature of social media is that it enables anyone from anywhere in the world to freely express his/her views and opinions without disclosing his/her true identify and without the fear of undesirable consequences. These opinions are thus highly valuable. However, this anonymity also comes with a price. It allows people with hidden agendas or malicious intentions to easily game the system to give people the impression that they are independent members of the public and post fake opinions to promote or to discredit target products, services, organizations, or individuals without disclosing their true intentions, or the person or organization that they are secretly working for. Such individuals are called opinion spammers and their activities are called opinion spamming (Jindal and Liu, 2008; Jindal and Liu, 2007). Opinion spamming has become a major issue. Apart from individuals who give fake opinions in reviews and forum discussions, there are also commercial companies that are in the business of writing fake reviews and bogus blogs for their clients. Several high profile cases of fake reviews have been reported in the news. It is important to detect such spamming activities to ensure that the opinions on the Web are a trusted source of valuable information. Unlike extraction of positive and negative opinions, opinion spam detection is not just a NLP problem as it involves the analysis of people’s posting behaviors. It is thus also a data mining problem. 10
  • 11. Chapter 3 Machine Learning Approaches 3.1 Data preprocessing Sentiment classification or opinion classification is text classification problem in the first place. The Performance of a text classification task is directly affected by representation of data. Once features are appropriately selected even simple classifiers may produce good classification results. Several feature selection and extraction methods have been proposed in the literature. Feature selection merely selects a good subset of the original features, where as feature extraction allows for arbitrary new features based on the original ones. 3.1.1 Feature extraction The most commonly used features to represent words, Term Frequency (TF) and Inverse Document Frequency (IDF), may not be always appropriate. Choosing an appropriate representation of words in text documents is crucial to obtaining good classification performance. Researchers have used different representations to maximize the accuracy of machine learning algorithms. The”Bag of words” representation is widely used to represent text documents. In this representation, a document is considered to be an unordered collection of words whereas the position of words in the document bears no importance. “'Bag of words'” is the simplest representation of textual data. The number of occurrences of each word in the document is represented by term frequency (TF) which is a document specific measure of importance of a term. The collection of documents under consideration is called a corpus. The importance of a term in a document is measured by its weight in the document. A number of term weighting techniques have been proposed in literature. In the vector space model [2], a document is represented by a document vector whose components are term weights. A document using term frequency as term weights can be represented in vector form as {tf 1,tf 2,tf 3,...,tf n } , where TF is the term frequency and n is total number of terms in the document. Lengths of documents in a corpus may vary and longer documents usually have higher term frequencies and more unique terms than shorter documents. ‎ 18] [ 3.1.1.1 Text Feature Generators Before we address the question of how to discard words, we must first determine what shall count as a word. For example, is ‘HP-UX’ one word, or is it two words? What about ‘650-857-1501’? When it comes to programming, a simple solution is to take any contiguous sequence of alphabetic characters; or alphanumeric characters to include identifiers such as ‘ioctl32’, which may sometimes be useful. By using the Posix regular expression p,L&-+ we avoid breaking ‘naive’ in two, as well as many accented words in French, German, etc. But what about ‘win 32’, ‘can’t’ or words that may be hyphenated over a line break? Like most data cleaning endeavors, the list of exceptions is endless, and one must simply draw a line somewhere and hope for an 80%-20% tradeoff. Fortunately, semantic errors in word parsing are usually only seen by the core learning algorithm, and it is their statistical properties that matter, not its readability or intuitiveness to people. Our purpose is to offer a range of feature generators so that the feature selector 11
  • 12. may discover the strongly predictive features. The most beneficial feature generators will vary according to the characteristics of the domain text. Word Merging One method of reducing the size of the feature space somewhat is to merge word variants together, and treat them as a single feature. More importantly, this can also improve the predictive value of some features. Forcing all letters to lowercase is a nearly ubiquitous practice. It normalizes for capitalization at the beginning of a sentence, which does not otherwise affect the word’s meaning, and helps reduce the dispersion issue mentioned in the introduction. For proper nouns, it occasionally conflates other word meanings, e.g. ‘Bush’ or ‘LaTeX.’ Likewise, various word stemming algorithms can be used to merge multiple related word forms. For example, ‘cat,’ ‘cats,’ ‘catlike’ and ‘catty’ may all be merged into a common feature. Stemming typically benefits recall but at a cost of precision. If one is searching for ‘catty’ and the word is treated the same as ‘cat,’ then a certain amount of precision is necessarily lost. For extremely skewed class distributions, this loss may be unsupportable. Stemming algorithms make both over-stemming errors and under-stemming errors, but again, the semantics are less important than the feature’s statistical properties. Word Phrases Whereas merging related words together can produce features with more frequent occurrence (typically with greater recall and lower precision), identifying multiple word phrases as a single term can produce rarer, highly specific features (which typically aid precision and have lower recall), e.g. ‘John Denver’ or ‘user interface.’ Rather than require a dictionary of phrases as above, a simple approach is to treat all consecutive pairs of words as a phrase term, and let feature selection determine which are useful for prediction. This can be extended for phrases of three or more words with occasionally more specifity, but with strictly decreasing frequency. Most of the benefit is obtained by two-word phrases. This is in part because portions of The phrase may already have the same statistical properties, e.g. the four word phrase ‘United States of America’ is covered already by the two-word phrase ‘United States.’ In addition, the reach of a two-word phrase can be extended by eliminating common stop words, e.g. ‘head of the household’ becomes ‘head household.’ Stop word lists are language specific, unfortunately. Their primary benefit to classification is in extending the reach of phrases, rather than eliminating commonly useless words, which most feature selection methods, can already remove in a language-independent fashion. Character N-grams the word identification methods above fail in some situations, and can miss some good opportunities for features. For example, languages such as Chinese and Japanese do not use a space character. Segmenting such text into words is complex, whereas nearly equivalent accuracy may be obtained by simply using every pair of adjacent Unicode characters as features—n grams. Certainly many of the combinations will be meaningless, but feature selection can identify the most predictive ones. For languages that use the Latin character set, 3-grams or 6-grams may be appropriate. For example, n-grams would capture the essence of common technical text patterns such as ‘HP-UX 11.0’, ‘while (<>) ,’, ‘#!/bin/’, and ‘ :)’. Phrases of two adjacent n-grams simply correspond to (2n)-grams. Note that while the number of 12
  • 13. potential n-grams grows exponentially with n, in practice only a small fraction of the possibilities occur in actual training examples, and only a fraction of those will be found predictive. Multi-Field Records Although most research deals with training cases as a single string, many applications have multiple text (and non-text) fields associated with each record. In document management, these may be title, author, abstract, keywords, body, and references. In technical support, they may be title, product, keywords, engineer, customer, symptoms, problem description, and solution. Multi-field records are common in applications, even though the bulk of text classification research treats only a single string. Furthermore, when classifying long strings, e.g. arbitrary file contents, the first few kilobytes may be treated as a separate field and may often prove sufficient for generating adequate features, avoiding the overhead of processing huge files, such as tar or zip archives. Feature Values Once a decision has been made about what to consider as a feature term, the meaning of the numerical feature must be determined. For some purposes, a binary value is sufficient, indicating whether the term appears at all. This representation is used by the Bernoulli formulation of the Naive Bayes classifier. Many other classifiers use the term frequency tf(t,k) (the word count in document k) directly as the feature value. 3.1.2 Feature selection (dimensionality reduction) The total number of features in a corpus of text documents is the number of unique words present in all documents. Word sharing is reduced in documents belonging to different categories thus producing a large number of unique words in the whole corpus. High dimensionality is thus inherent to text classification. Researchers have been trying to filter out terms which are not important for classification or are redundant. Techniques used for filtering terms in text classification are based on the assumption that very rare and very common terms do not help in discriminating documents of different categories. Very common terms occurring are all documents are treated as stop words are removed. Also rarely occurring terms, occurring in only 2 or 3 documents, are not considered. One of the problems with high-dimensional datasets is that, in many cases, not all the measured variables are important for understanding the underlying phenomena of interest. While certain computationally expensive novel methods can construct predictive models with high accuracy from high-dimensional data. It is still of interest in many applications to reduce the dimension of the original data prior to any modeling of the data. Feature selection is the method that can reduce both the data and the computational complexity. Dataset can also get more efficient and can be useful to find out feature subsets. 13
  • 14. There is Four main steps in a feature selection method: (see Figure 1) Generation = select feature subset candidate. Evaluation = compute relevancy value of the subset. Stopping criterion = determine whether subset is relevant. Validation = verify subset validity. Figure 1 3.1.2.1 Information Gain (IG) Entropy and Information Gain The entropy (very common in Information Theory) characterizes the impurity of an arbitrary collection of examples. Information Gain is the expected reduction in entropy caused by partitioning the examples according to a given attribute. Entropy = 14
  • 15. 3.1.2.2 Principle components Analysis (PCA) Principal component analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to (i.e., uncorrelated with) the preceding components. Principal components are guaranteed to be independent only if the data set is jointly normally distributed. PCA was invented in 1901 by Karl Pearson,[1] as an analogue of the principal axes theorem in mechanics; it was later independently developed (and named) by Harold Hotelling in the 1930s.[2] The method is mostly used as a tool in exploratory data analysis and for making predictive models. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute.[3] The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score). PCA is mathematically defined[7] as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. Consider a data matrix, X, with zero empirical mean (the empirical (sample) mean of the distribution has been subtracted from the data set), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of datum (say, the results from a particular probe). Mathematically, the transformation or loadings component scores is defined by a set of p-dimensional that map each row vector vectors of weights of X to new vector of principal , given by in such a way that the individual variables of t considered over the data set successively inherit the maximum possible variance from x, with each loading vector w constrained to be a unit vector. 3.1.2.2.1 First component The first loading vector w(1) thus has to satisfy Equivalently, writing this in matrix form gives 15
  • 16. Since w(1) has been defined to be a unit vector, it equally must also satisfy The quantity to be maximized can be recognized as a Rayleigh quotient. A standard result for a symmetric matrix such as XTX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector. With w(1) found, the first component of a data vector x(i) can then be given as a score t1(i) = x(i) ⋅ w(1) in the transformed co-ordinates, or as the corresponding vector in the original variables, {x(i) ⋅ w(1)} w(1). 3.1.2.2.2 Further components The kth component can be found by subtracting the first k-1 principal components from X: and then finding the loading vector which extracts the maximum variance from this new data matrix It turns out that this gives the remaining eigenvectors of XTX, with the maximum values for the quantity in brackets given by their corresponding eigenvalues. The kth principal component of a data vector x(i) can therefore be given as a score tk(i) = x(i) ⋅ w(k) in the transformed co-ordinates, or as the corresponding vector in the space of the original variables, {x(i) ⋅ w(k)} w(k), where w(k) is the kth eigenvector of XTX . The full principal components decomposition of X can therefore be given as where W is a p-by-p matrix whose columns are the eigenvectors of XTX 3.1.2.2.3 Covariance XTX itself can be recognized as proportional to the empirical sample covariance matrix of the dataset X. The sample covariance Q between two of the different principal components over the dataset is given by where the eigenvector property of w(k) has been used to move from line 2 to line 3. However eigenvectors w(j) and w(k) corresponding to eigenvalues of a symmetric matrix are orthogonal (if the eigenvalues are different), or can be orthogonalized (if the vectors happen to share an equal repeated value). The product in 16
  • 17. the final line is therefore zero; there is no sample covariance between different principal components over the dataset. Another way to characterize the principal components transformation is therefore as the transformation to coordinates which diagonalize the empirical sample covariance matrix. In matrix form, the empirical covariance matrix for the original variables can be written The empirical covariance matrix between the principal components becomes where Λ is the diagonal matrix of eigenvalues λ(k) of XTX (λ(k) being equal to the sum of the squares over the dataset associated with each component k: λ(k) = Σi tk2(i) = Σi (x(i) ⋅ w(k))2) 3.1.2.2.4 Dimensionality reduction The faithful transformation T = X W maps a data vector x(i) from an original space of p variables to a new space of p variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first L principal components, produced by using only the first L loading vectors, gives the truncated transformation where the matrix TL now has n rows but only L columns. By construction, of all the transformed data matrices with only L columns, this score matrix maximizes the variance in the original data that has been preserved, while minimizing the total squared reconstruction error ||T - TL||2. Such dimensionality reduction can be a very useful step for visualizing and processing high-dimensional datasets, while still retaining as much of the variance in the dataset as possible. For example, selecting L=2 and keeping only the first two principal components finds the two-dimensional plane through the high-dimensional dataset in which the data is most spread out, so if the data contains clusters these too may be most spread out, and therefore most visible to be plotted out in a two-dimensional diagram; whereas if two directions through the data (or two of the original variables) are chosen at random, the clusters may be much less spread apart from each other, and may in fact be much more likely to substantially overlay each other, making them indistinguishable. Similarly, in regression analysis, the larger the number of explanatory variables allowed, the greater is the chance of over fitting the model, producing conclusions that fail to generalize to other datasets. One approach, especially when there are strong correlations between different possible explanatory variables, is to reduce them to a few principal components and then run the regression against them, a method called principal component regression. Dimensionality reduction may also be appropriate when the variables in a dataset are noisy. If each column of the dataset contains independent identically distributed Gaussian noise, and then the columns of T will also contain similarly identically distributed Gaussian noise (such a distribution is invariant under the effects of the matrix W, which can be thought of as a high-dimensional rotation of the co-ordinate axes). However, with more of the total variance concentrated in the first few principal components compared to the same noise variance, the proportionate effect of the noise is less -- the first components achieve a higher signal-to-noise ratio. PCA thus can have the effect of concentrating much of the signal into the first few principal components, which can 17
  • 18. usefully be captured by dimensionality reduction; while the later principal components may be dominated by noise, and so disposed of without great loss. 3.2 Classification 3.2.1 Support vector machine classification (SVM) 3.2.1.1 Description Support Vector Machines (SVM's) are a relatively new learning method used for binary classification. The basic idea is to and a hyper plane which separates the d-dimensional data perfectly into its two classes. However, since example data is often not linearly separable, SVM's introduce the notion of a “kernel induced feature space" which casts the data into a higher dimensional space where the data is separable. Typically, casting into such a space would cause problems computationally, and with over thing. The key insight used in SVM's is that the higher-dimensional space doesn't need to be dealt with directly (as it turns out, only the formula for the dot-product in that space is needed), which eliminates the above concerns. Furthermore, the VC-dimension (a measure of a system's likelihood to perform well on unseen data) of SVM's can be explicitly calculated, unlike other learning methods like neural networks, for which there is no measure. Overall, SVM's are intuitive, theoretically well- founded, and have shown to be practically successful. SVM's have also been extended to solve regression tasks (where the system is trained to output a numerical value, rather than yes/no" classification). 3.2.1.2 History Support Vector Machines were introduced by Vladimir Vapnik and colleagues. The earliest mention was in (Vapnik, 1979), but the first main paper seems to be (Vapnik, 1995). 3.2.1.3 Mathematics We are given L training examples { }; i = 1, . . . l , where each example has d inputs ( ), and a class label with one of two values ( {-1, 1}). Now, all hyper planes in are parameterized by a vector (w), and a constant (b), expressed in the equation w.x + b = 0 (1) (Recall that w is in fact the vector orthogonal to the hyper plane.) Given such a hyper plane (w, b) that separates the data, this gives the function f(x) = sign(w. x + b) (2) This correctly classifies the training data (and hopefully other “testing” data it hasn’t seen yet). However, a given hyper plane represented by (w, b) is equally expressed by all pairs ( w, b ) for . So we define the canonical hyper plane to be that which separates the data from the hyper plane by a “distance" of at least 1. That is, we consider those that satisfy: .w + b ≥ +1 when = +1 (3) 18
  • 19. .w + b ≤ -1 when = -1 (4) or more compactly: ( .w + b) ≥ 1 i (5) All such hyper planes have a “functional distance" ≥ 1 (quite literally, the function's value is ≥ 1). This shouldn't be confused with the “geometric" or “Euclidean distance" (also known as the margin). For a given hyper plane (w, b), all pairs ( w, b ) define the exact same hyper plane, but each has a Different functional distance to a given data point. To obtain the geometric distance from the hyper plane to a data point, we must normalize by the magnitude of w. This distance is simply: (6) Intuitively, we want the hyper plane that maximizes the geometric distance to the closest data points. (See figure 1.) Figure 1: Choosing the hyper plane that maximizes the margin. 19
  • 20. From the equation we see this is accomplished by minimizing (subject to the distance constraints). The main method of doing this is with Lagrange multipliers. (See (Vapnik, 1995), or (Burges, 1998) for derivation details.) The problem is eventually transformed into: minimize: subject to: Where is the vector of l non-negative Lagrange multipliers to be determined, and C is a constant (to be explained later). We can de ne the matrix and introduce more compact notation: minimize: (7) subject to: (8) (9) (This minimization problem is what is known as a Quadratic Programming Problem (QP). Fortunately, many techniques have been developed to solve them.) In addition, from the derivation of these equations, it was seen that the optimal hyper plane can be written as: W= (10) That is, the vector w is just a linear combination of the training examples. Interestingly, it can also be shown that which is just a fancy way of saying that when the functional distance of an example is strictly greater than 1 (when ) then . . So only the closest data points contribute to w. These training examples for which are termed support vectors. They are the only ones needed in defining (and finding) the optimal hyper plane. Intuitively, the support-vectors are the “borderline cases" in the decision function we are trying to learn. Even more interesting is that can be thought of as a “difficulty rating" for the example - how important that example was in determining the hyper plane. Assuming we have the optimal (from which we construct w), we must still determine b to fully specify the hyper plane. To do this, take any “positive" and “negative" support vector, , for which we know 20
  • 21. Solving these equations gives us b= (11) Now, you may have wondered the need for the constraint (eq. 9) When C = , the optimal hyper plane will be the one that completely separates the data (assuming one exists). For finite C, this changes the problem to finding a “soft-margin" classifier 4, which allows for some of the data to be misclassified. One can think of C as a tunable parameter: higher C corresponds to more importance on classifying all the training data correctly, lower C results in a more flexible" hyper plane that tries to minimize the margin error (how badly ) for each example. Finite values of C are useful in situations where the data is not easily separable (perhaps because the input data { } are noisy). 3.2.1.4 The Generalization Ability of Perfectly Trained SVM's Suppose we find the optimal hyper plane separating the data. And of the training examples, of them are support vectors. It can then be shown that the expected out-of-sample error (the portion of unseen data that will be misclassified), bound by (12) This is a very useful result. It ties together the notions that simpler systems are better (Ockham's Razor principle) and that for SVM's, fewer support vectors are in fact a more “compact" and “simpler" representation of the hyper plane and hence should perform better. If the data cannot be separated however, no such theorem applies, which at this point seems to be a potential setback for SVM's. 3.2.1.5 Mapping the Inputs to other dimensions – the use of Kernels Now just because a data set is not linearly separable, doesn't mean there isn't some other concise way to separate the data. For example, it might be easier to separate the data using polynomial curves, or circles. However, finding the optimal curve to fit the data is difficult, and it would be a shame not to use the method of finding the optimal hyper plane that we investigated in the previous section. Indeed there is a way to “preprocess" the data in such a way that the problem is transformed into one of finding a simple 21
  • 22. Hyper plane. To do this, we define a mapping z = that transforms the d dimensional input vector x into a (usually higher) d’ dimensional vector z. We hope to choose a new training data ( ) g is separable by a hyper plane. (See Figure 2.) so that the This method looks like it might work, but there are some concerns. Firstly, how do we go about choosing It would be a lot of work to have to construct one explicitly for any data set we are given. Not to fear, if casts the input vector into a high enough space (d’ ≥ d), casts the input vector into a high enough space that does this for most data... But casting into a very high dimensional space is also worry some. Computationally, this creates much more of a burden. Recall that the construction of the matrix H requires the dot products ( ). if d’ is exponentially larger than d (and it very well could be), the computation of H becomes prohibitive (not to mention the extra space requirements). Also, by increasing the complexity of our system in such a way, over fitting becomes a concern. By casting into a high enough dimensional space, it is a fact that we can separate any data set. How can we be sure that the system isn't just fitting the idiosyncrasies of the training data, but is actually learning a legitimate pattern that will generalize to other data it hasn't been trained on? As we'll see, SVM's avoid these problems. Given a mapping we simply replace all occurrences of x with . , to set up our new optimization problem, Our QP problem (recall eq. 7) would still be minimize: but instead of ( ), it is ( . eq. 10 would be 22
  • 23. W= And eq. 2 would be f(x) = sign(w. + b) = = The important observation in all this, is that any time a appears, it is always in a dot product with some other That is, if we knew the formula (called a kernel) for the dot product in the higher dimensional feature space, (13) we would never need to deal with the mapping z = ( )). And directly. The matrix in our optimization would simply be our classifier f(x) = Once the problem is set up in this way, finding the optimal hyper plane proceeds as usual, only the hyper plane will be in some unknown feature space. In the original input space, the data will be separated by some curved, possibly non-continuous contour. It may not seem obvious why the use of a kernel alleviates our concerns, but it does. Earlier, we mentioned that it would be tedious to have to design a different feature map for any training set we are given, in order to linearly separate the data. Fortunately, useful kernels have already been discovered. Consider the “polynomial kernel" (14) Where p is a tunable parameter, which in practice varies from 1 to ~ 10. Notice that evaluating K involves only an extra addition and exponentiation more than computing the original dot product. You might wonder what the implicit mapping was in the creation of this kernel. Well, if you were to expand the dot product inside K ... (15) and multiply these (d+1) terms by each other p times, it would result in terms each of which are polynomials of varying degrees of the input vectors. Thus, one can think of this polynomial kernel as the dot product of two exponentially large z vectors. By using a larger value of p the dimension of the feature space is implicitly larger, where the data will likely be easier to separate. (However, in a larger dimensional space, there might be more support vectors, which we saw leads to worse generalization.) 3.2.1.6 Other kernels Another popular one is the Gaussian RBF Kernel 23
  • 24. (16) where is a tunable parameter. Using this kernel results in the classifier f(x) = which is really just a Radial Basis Function, with the support vectors as the centers. So here, a SVM was implicitly used to find the number (and location) of centers needed to form the RBF network with the highest expected generalization performance. At this point one might wonder what other kernels exist, and if making your own kernel is as simple as just dreaming up some function As it turns out, K must in fact be the dot product in a feature space for some , if all the theory behind SVM's is to go through. Now there are two ways to ensure this. The first, is to create some mapping z = and then derive the analytic expression for . This kernel is most definitely the dot product in a feature space, since it was created as such. The second way is to dream up some function K and then check if it is valid by applying Mercer's condition. Without giving too many details, the condition states: Suppose K can be written as for some choice of the If K is indeed a dot product The mathematically inclined reader interested in the derivation details is encouraged to see (Cristianini, ShaweTaylor, 2000). It is indeed a strange mathematical requirement. Fortunately for us, the polynomial and RBF kernels have already been proven to be valid. And most of the literature presenting results using SVM's all use these two simple kernels. So most SVM users need not be concerned with creating new kernels, and checking that they meet Mercer's condition. (Interestingly though, kernels satisfy many closure properties. That is, addition, multiplication, and composition of valid kernels all result in valid kernels. Again, see (Cristianini, ShaweTaylor, 2000).) 24
  • 25. 3.2.2 Naïve Bayes classification 3.2.2.1 Introduction to Bayesian Classification The Bayesian Classification represents a supervised learning method as well as a statistical method for classification. Assumes an underlying probabilistic model and it allows us to capture uncertainty about the model in a principled way by determining probabilities of the outcomes. It can solve diagnostic and predictive problems. This Classification is named after Thomas Bayes (1702 - 1761), who proposed the Bayes Theorem. Bayesian classification provides practical learning algorithms and prior knowledge and observed data can be combined. Bayesian Classification provides a useful perspective for understanding and evaluating many learning algorithms. It calculates explicit probabilities for hypothesis and it is robust to noise in input data. Uses of Naive Bayes classification: 1. Naive Bayes text classification (http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html) The Bayesian classification is used as a probabilistic learning method (Naive Bayes text classification). Naive Bayes classifiers are among the most successful known algorithms for learning to classify text documents. 2. Spam filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering) Spam filtering is the best known use of Naive Bayesian text classification. It makes use of a Naive Bayes classifier to identify spam e-mail. Bayesian spam filtering has become a popular mechanism to distinguish illegitimate spam email from legitimate email (sometimes called "ham" or "bacn").[4] Many modern mail clients implement Bayesian spam filtering. Users can also install separate email filtering programs. Server-side email filters, such as DSPAM, Spam Assassin, Spam Bayes, Bogofilter and ASSP, make use of Bayesian spam filtering techniques, and the functionality is sometimes embedded within mail server software itself. 3. Hybrid Recommender System Using Naive Bayes Classifier and Collaborative Filtering (http://eprints.ecs.soton.ac.uk/18483/) Recommender Systems apply machine learning and data mining techniques for filtering unseen information and can predict whether a user would like a given resource. It is proposed a unique switching hybrid recommendation approach by combining a Naive Bayes classification approach with the collaborative filtering. Experimental results on two different data sets, show that the proposed algorithm is scalable and provide better performance-in terms of accuracy and coverage-than other algorithms while at the same time eliminates some recorded problems with the recommender systems. 25
  • 26. 4. Online applications (http://www.convo.co.uk/x02/) This online application has been set up as a simple example of supervised machine learning and affective computing. Using a training set of examples which reflect nice, nasty or neutral sentiments, we're training Ditto to distinguish between them. Simple Emotion modeling combines a statistically based classifier with a dynamical model. The Naive Bayes classifier employs single words and word pairs as features. It allocates user utterances into nice, nasty and neutral classes, labeled +1, -1 and 0 respectively. This numerical output drives a simple first-order dynamical system, whose state represents the simulated emotional state of the experiment's personification, Ditto the donkey. Independence Example Suppose there are two events: M: Manuela teaches the class (otherwise it's Andrew) S: It is sunny "The sunshine levels do not depend on and do not influence who is teaching." Theory: From P(S | M) = P(S), the rules of probability imply: P(~S | M) = P(~S) P(M | S) = P(M) P(M ^ S) = P(M) P(S) P(~M ^ S) = P(~M) P(S) P(M^~S) = P(M)P(~S) P(~M^~S) = P(~M)P(~S) Theory applied on previous example: "The sunshine levels do not depend on and do not influence who is teaching." can be specified very simply: P(S | M) = P(S) "Two events A and B are statistically independent if the probability of A is the same value when B occurs, when B does not occur or when nothing is known about the occurrence of B" 26
  • 27. 3.2.2.2 Conditional Probability 3.2.2.3 Simple Example: H = "Have a headache" F = "Coming down with Flu" P(H) = 1/10 P(F) =1/40 P(H|F) = 1/2 P( A | B)P(A | B) 1 "Headaches are rare and flu is rarer, but if you're coming down with 'flu there's a 50-50 chance you'll have a headache." P(H|F) = Fraction of flu-inflicted worlds in which you have a headache = #worlds with flu and headache Area of "H and F" region P(H ^ F) = -----------------------------------= ------------------------------------- = ----------#worlds with flu Area of "F" region P(F) 3.2.2.4 Theory: P(A|B) = Fraction of worlds in which B is true that also have A true P(A ^ B) P(A|B) = -----------------P(B) Corollary: P(A ^ B) = P(A|B) P(B) P(A|B)+P( A|B) = 1 n  P( A | B) 1 kk1 27
  • 28. 3.2.2.5 Detailed Example M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late. Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela. Let's begin with writing down the knowledge: P(S M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is not independent of the weather and is not independent of the lecturer. Therefore Lateness is dependant on both weather and lecturer 28
  • 29. 3.2.3 Conditional Independence 3.2.3.1 Example: Suppose we have these three events: M : Lecture taught by Manuela L : Lecturer arrives late R : Lecture concerns robots Suppose: Andrew has a higher chance of being late than Manuela. Andrew has a higher chance of giving robotics lectures. 29
  • 30. 3.2.3.2 Theory: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x M=y ^ L=z) = P(R=x M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S1's assignments| S2's assignments & S3's assignments)= P(S1's assignments| S3's assignments) P(A|B) = P(A ^B)/P(B) Therefore P(A^B) = P(A|B).P(B) - also known as Chain Rule Also P(A^B) = P(B|A).P(A) Therefore P(A|B) = P(B|A).P(A)/P(B) P(A,B|C) = P(A^B^C)/P(C) = P(A|B,C).P(B^C)/P(C ) - applying chain rule = P(A|B,C).P(B|C) = P(A|C).P(B|C) , If A and B are conditionally independent given C. This can be extended for n values as P(A1,A2An|C) = P(A1|C).P(A2|C)P(An|C) if A1, A2An are conditionally independent given C. 3.2.3.3 Theory applied on previous example: For the previous example, we can use the following notations: P(R| M,L) = P(R| M) and P(R| ~M,L) = P(R| ~M) We express this in the following way: "R and L are conditionally independent given M" 30
  • 31. 3.2.4 Bayes Theorem Bayesian reasoning is applied to decision making and inferential statistics that deals with Probability inference. It is used the knowledge of prior events to predict future events. Example: Predicting the color of marbles in a basket 3.2.4.1 Example: Table1: Data table 3.2.4.2 Theory: The Bayes Theorem: P(h/D)= P(D/h) P(h) P(D) 31
  • 32. P(h) : Prior probability of hypothesis h P(D) : Prior probability of training data D P(h/D) : Probability of h given D P(D/h) : Probability of D given h 3.2.4.3 Theory applied on previous example: D : 35 year old customer with an income of $50,000 PA h : Hypothesis that our customer will buy our computer P(h/D) : Probability that customer D will buy our computer given that we know his age and income P(h) : Probability that any customer will buy our computer regardless of age (Prior Probability) P(D/h) : Probability that the customer is 35 yrs old and earns $50,000, given that he has bought our computer (Posterior Probability) P(D) : Probability that a person from our set of customers is 35 yrs old and earns $50,000 3.2.5 Maximum A Posteriori (MAP) Hypothesis 3.2.5.1 Example: h1: Customer buys a computer = Yes h2 : Customer buys a computer = No where h1 and h2 are subsets of our Hypothesis Space 'H' P(h/D) (Final Outcome) = arg max{ P( D/h1) P(h1) , P(D/h2) P(h2)} P(D) can be ignored as it is the same for both the terms 3.2.5.2 Theory: Generally we want the most probable hypothesis given the training data hMAP = arg max P(h/D) (where h belongs to H and H is the hypothesis space) hMAP = arg max P(D/h) P(h) P(D) hMAP = arg max P(D/h) P(h) 32
  • 33. 3.2.6 Maximum Likelihood (ML) Hypothesis 3.2.6.1 Example: Table 2 3.2.6.2 Theory: If we assume P(hi) = P(hj) where the calculated probabilities amount to the same Further simplification leads to: hML = arg max P(D/hi) (where hi belongs to H) 2.5.3. Theory applied on previous example: P (buys computer = yes) = 5/10 = 0.5 P (buys computer = no) = 5/10 = 0.5 P (customer is 35 yrs & earns $50,000) = 4/10 = 0.4 P (customer is 35 yrs & earns $50,000 / buys computer = yes) = 3/5 =0.6 P (customer is 35 yrs & earns $50,000 / buys computer = no) = 1/5 = 0.2 Customer buys a computer P(h1/D) = P(h1) * P (D/ h1) / P(D) = 0.5 * 0.6 / 0.4 Customer does not buy a computer P(h2/D) = P(h2) * P (D/ h2) / P(D) = 0.5 * 0.2 / 0.4 33
  • 34. Final Outcome = arg max {P(h1/D) , P(h2/D)} = max(0.6, 0.2) => Customer buys a computer 3.2.7 Naïve Bayesian Classification It is based on the Bayesian theorem It is particularly suited when the dimensionality of the inputs is high. Parameter estimation for naive Bayes models uses the method of maximum likelihood. In spite over-simplified assumptions, it often performs better in many complex realworld situations Advantage: Requires a small amount of training data to estimate the parameters 3.2.7.1 Example X = ( age= youth, income = medium, student = yes, credit_rating = fair) A person belonging to tuple X will buy a computer? 3.2.7.2 Theory: Derivation: D : Set of tuples Each Tuple is an 'n' dimensional attribute vector X : (x1,x2,x3,. xn) Let there be 'm' Classes : C1,C2,C3Cm Naïve Bayes classifier predicts X belongs to Class Ci iff 34
  • 35. P (Ci/X) > P(Cj/X) for 1<= j <= m , j <> i Maximum Posteriori Hypothesis P(Ci/X) = P(X/Ci) P(Ci) / P(X) Maximize P(X/Ci) P(Ci) as P(X) is constant With many attributes, it is computationally expensive to evaluate P(X/Ci). Naïve Assumption of "class conditional independence" n P( X / .Ci) P(xk / Ci) k1 P(X/Ci) = P(x1/Ci) * P(x2/Ci) ** P(xn/ Ci) 3.2.7.3 Theory applied on previous example: P(C1) = P(buys_computer = yes) = 9/14 =0.643 P(C2) = P(buys_computer = no) = 5/14= 0.357 P(age=youth /buys_computer = yes) = 2/9 =0.222 P(age=youth /buys_computer = no) = 3/5 =0.600 P(income=medium /buys_computer = yes) = 4/9 =0.444 P(income=medium /buys_computer = no) = 2/5 =0.400 P(student=yes /buys_computer = yes) = 6/9 =0.667 P(student=yes/buys_computer = no) = 1/5 =0.200 P(credit rating=fair /buys_computer = yes) = 6/9 =0.667 P(credit rating=fair /buys_computer = no) = 2/5 =0.400 P(X/Buys a computer = yes) = P(age=youth /buys_computer = yes) * P(income=medium /buys_computer = yes) * P(student=yes /buys_computer = yes) * P(credit rating=fair /buys_computer = yes) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044 P(X/Buys a computer = No) = 0.600 * 0.400 * 0.200 * 0.400 = 0.019 Find class Ci that Maximizes P(X/Ci) * P(Ci) =>P(X/Buys a computer = yes) * P(buys_computer = yes) = 0.028 =>P(X/Buys a computer = No) * P(buys_computer = no) = 0.007 Prediction : Buys a computer for Tuple X 35
  • 36. 3.3 Evaluation measures Text classification rules are typically evaluated using performance measures from information retrieval. Common metrics for text categorization evaluation include recall, precision, accuracy and error rate and F1. Given a test set of N documents, a two-by-two contingency table (see Table 1) with four cells can be constructed for each binary classification problem. The cells contain the counts for true positive (TP), false positive (FP), true negative (TN) and false negative (FN), respectively. Clearly, N = TP + FP + TN + FN. Table 1 predicted class (expectation) actual class (observation) TP (true positive) Correct result FN (false negative) Missing result FP (false positive) Unexpected result TN (true negative) Correct absence of result the terms true positives, true negatives, false positives, and false negatives compare the results of the classifier under test with trusted external judgments. The terms positive and negative refer to the classifier's prediction (sometimes known as the expectation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the observation). Hence the metrics for binary-decisions are defined as: 3.3.1 Precision Is the proportion of Predicted Positive cases that are correctly Real Positives. This is what Machine Learning, Data Mining and Information Retrieval focus on, but it is totally ignored in ROC analysis. It can however analogously be called True Positive Accuracy (tpa), being a measure of accuracy of Predicted Positives in contrast with the rate of discovery of Real Positives (tpr). In precise it represents how many of the returned documents or topics are correctly predicted Precision = TP / (TP + FP) Issue in Precision When a system outputs only confident topics, the precision easily reaches a high percentage. 36
  • 37. 3.3.2 Recall (Sensitivity) Is the proportion of Real Positive cases that are correctly Predicted Positive. This measures the Coverage of the Real Positive cases by the +P (Predicted Positive) rule. Its desirable feature is that it reflects how many of the relevant cases the +P rule picks up. In precise it represents what percent of positive cases were caught Recall = TP / (TP + FN) When a system outputs loosely, the recall easily reaches a high percentage. 3.3.3 Accuracy Accuracy represents what percent of the prediction were correct The rate of correctly predicted topics Figure 2 Accuracy = Issue in Accuracy When a certain topic (e.g., not-spam) is a majority, the accuracy easily reaches a high percentage. 3.3.4 F1 –measure F1 measure effectively references the True Positives to the Arithmetic Mean of Predicted Positives and Real Positives, being a constructed rate normalized to an idealized value, and expressed in this form it is known in statistics as a Proportion of Specific Agreement as it is a applied to a specific class, so applied to the Positive Class F1 measure = 2*Recall*Precision/(Recall + Precision) Since there is a trade-off between recall an precision, F-measure is widely used to evaluate text classification system. 37
  • 38. Chapter 4 Opinion lexicons 4.1 Definition Opinion lexicons are resources that associate sentiment orientation and words. Their use in opinion mining research stems from the hypothesis that individual words can be considered as a unit of opinion information, and therefore may provide clues to document sentiment and subjectivity. Manually created opinion lexicons were applied to sentiment classification as seen in [13], where a prediction of document polarity is given by counting positive and negative terms. A similar approach is presented in the work of Kennedy and Inkpen [10], this time using an opinion lexicon based on the combination of other existing resources. Manually built lexicons however tend to be constrained to a small number of terms. By its nature, building manual lists is a time consuming effort, and may be subject to annotator bias. To overcome these issues lexical induction approaches have been proposed in the literature with a view to extend the size of opinion lexicons from a core set of seed terms, either by exploring term relationships, or by evaluating similarities in document corpora. Early work in this area seen in [9] extends a list of positive and negative adjectives by evaluating conjunctive statements in a document corpus. Another common approach is to derive opinion terms from the WordNet database of terms and relationships [12], typically by examining the semantic relationships of a term such as synonyms and antonyms. In this work two commonly used opinion lexicon are used, first is SentiWordNet 3.0 and the second is Opinion lexicon created by Dr. Bing Liu. 4.2 SENTIWORDNET 3.0 An enhanced lexical resource explicitly devised for supporting sentiment classification and opinion mining applications (Pang and Lee, 2008). SENTIWORDNET 3.0 is an improved version of SENTIWORDNET 1.0 (Esuli and Sebastiani, 2006), a lexical resource publicly available for research purposes, now currently licensed to more than 300 research groups and used in a variety of research projects worldwide. SENTIWORDNET is the result of the automatic annotation of all the synsets of WORDNET according to the notions of “positivity”, “negativity”, and “neutrality”. Each synset s is associated to three numerical scores P os(s), Neg(s), and Obj(s) which indicate how positive, negative, and “objective” (i.e., neutral) the terms contained in the synset are. Different senses of the same term may thus have different opinion-related properties. For example , in SENTIWORDNET 1.0 the synset [estimable(J,3)], corresponding to the sense “may be computed or estimated” of the adjective estimable, has an Obj score of 1:0 (and P pos and Neg scores of 0.0), while the synset *estimable(J,1)+ corresponding to the sense “deserving of respect or high regard” has a P os score of 0:75, a Neg score of 0:0, and an Obj score of 0:25. Each of the three scores ranges in the interval [0:0;1:0], and their sum is 1:0 for each synset. This means that a synset may have nonzero scores for all the three categories, which would indicate that the corresponding terms have, in the sense indicated by the synset, each of the three opinions related properties to a certain degree. Each set of terms sharing the same meaning in SentiWordNet (synsets) is associated with two numerical scores ranging from 0 to 1, each indicating the synset’s positive and negative bias. The scores reflect the agreement amongst the classifier committee on the positive or negative label for a term, thus one distinct aspect of SentiWordNet is that it is possible for a term to have non-zero values for both positive and negative scores, according to the formula: Pos. Score(term) + Neg. Score(term) + Objective Score(term) = 1 38
  • 39. Terms in the SentiWordNet database follow the categorization into parts of speech derived from WordNet, and therefore to correctly apply scores to terms, a part of speech tagger program was applied to the polarity data set. In our experiment, the Stanford Part of Speech Tagger was used. The opinion lexicon can be downloaded freely for research purposes from the following link: http://sentiwordnet.isti.cnr.it/ 4.3 (Bing Liu) Opinion lexicon 4.3.1 Who is Dr. Bing Liu? Dr. Bing Liu is a professor in department of computer science in university of Illinois at Chicago (UIC) whose research interests are Sentiment Analysis, Opinion Mining, Data and Web Mining, Machine and Constraint satisfaction, AI scheduling. He has a history full of publications especially in the field of opinion mining and data mining in general. The following are examples of his publications in this field: Mining and summarizing customer reviews Opinion observer: analyzing and comparing opinions on the Web Mining opinion features in customer reviews Sentiment analysis and subjectivity A holistic lexicon-based approach to opinion mining Opinion spam and analysis The following link contains most of his publications: http://www.cs.uic.edu/~liub/publications/papers_chron.html 4.3.2 (Bing Liu) Opinion lexicon Opinion lexicon is a list of positive and negative opinion words or sentiment words for English (around 6800 words which are divided into two separate files one contains the positive words and the other conations the negative words). This list was compiled over many years starting from his first paper (Hu and Liu, KDD-2004). The opinion lexicon can be downloaded freely for research purposes from the following link: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html 39
  • 40. Chapter 5 Experimental results 5.1 Data collection The used data set in this literature is an Amazon product review data (Jindal and Liu, WSDM-2008) used in (Jindal and Liu, WWW-2007, WSDM-2008; Lim et al, CIKM-2010; Jindal, Liu and Lim, CIKM-2010; Mukherjee et al. WWW-2011; Mukherjee, Liu and Glance, WWW-2012) for opinion spam (fake review) detection. The dataset consists of more than 2.8 million product reviews in multiple domains. In this literature we extracted 2000 reviews (1000 positive review and 1000 negative review) in the Digital cameras domain. The extracted data were split later as follows (90% of the data “1800 review” for the training of classifiers and 10% of the data “200 reviews” as a test set). The full dataset is free for research purposes and can be downloaded from the following link: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html The following steps were applied to the dataset to extract the data of the desired domain: Converted to database using SQL Server 2008 Import and Export Wizard tool Figure 3 40
  • 41. 5.2 Feature extraction We used the weka filter “StringToWordVector”, to extract the features as unigrams and bigrams. 5.2.1 Unigram: Figure 4 41
  • 42. 5.2.2 Bigram: Figure 5 Also all stop words were selected to be exempted from the feature vector in both uni-gram feature vector and bi-gram feature vector. Stop words that were chosen to be removed are gathered from the following link: https://code.google.com/p/textminingtools/source/browse/trunk/data/stopwords.txt?r=5 In both feature vectors (unigram & bigram) the “IteratedLovinsStemmer” stemmer were chosen and the tokenizer in bigram included both tokens consists of one word (unigram) and tokens consists of two words. Also TFTransform and IDFTransform were both applied on both feature vectors. 42
  • 43. 5.3 Feature selection For dimensionality reduction the following patterns were removed from the feature vector. Patterns excluded from the uni-gram feature vector: Patterns excluded:- (using weka) Numbers only: ([0-9]+) Contains special characters: (.*[^a-z0-9 ]+.*) One character: (.) Two characters: (..) The above patterns were excluded to remove features that consists of numbers only, features that include special characters and features that consists of one or only two characters. Patterns excluded from the bi-gram feature vector: Numbers only: ([0-9]+) Contains special characters: (.*[^a-z0-9 ]+.*) One character: (.) Two characters: (..) One word is one letter: ([a-z] .*)|(.* [a-z]) One word is two letters: ([a-z][a-z] .*)|(.* [a-z][a-z]) 2nd word is a number: (.* [0-9]+) The above patterns were excluded to remove features that consists of numbers only, features that include special characters, features that consists of one or only two characters, features that have one word is only one character , features that include a word that consists of two characters only and the features that has the second word as number. A further dimensionality reduction is applied on the uni-gram feature vector and the bi-gram feature vector using the information gain algorithm. The following are the results of running information gain algorithm on both feature vectors (unigram and Bi-gram respectively) in weka: The results of applying information gain attribute selection algorithm On unigram feature vector: === Run information === Evaluator: weka.attributeSelection.InfoGainAttributeEval Search: weka.attributeSelection.Ranker -T 0.0 -N -1 Instances: 2000 Attributes: 8081 Evaluation mode: evaluate on all training data === Attribute Selection on all input data === Search Method: Attribute ranking. Threshold for discarding attributes: 0 Attribute Evaluator (supervised, Class (nominal): 1 ReviewClass): Information Gain Ranking Filter Selected attributes: 1034 attribute 43
  • 44. The results of applying information gain attribute selection algorithm On Bi-gram feature vector: === Run information === Evaluator: weka.attributeSelection.InfoGainAttributeEval Search: weka.attributeSelection.Ranker -T 0.0 -N -1 Instances: 2000 Attributes: 107765 Evaluation mode: evaluate on all training data === Attribute Selection on all input data === Search Method: Attribute ranking. Threshold for discarding attributes: 0 Attribute Evaluator (supervised, Class (nominal): 1 ReviewClass): Information Gain Ranking Filter Selected attributes: 3740 A further dimensionality reduction is applied on the uni-gram feature vector and the bi-gram feature vector using the principle components attribute selection algorithm. The following are the results of running PCA algorithm on both feature vectors (unigram and Bi-gram respectively) in Weka: The results of applying PCA attribute selection algorithm On unigram feature vector: === Run information === Evaluator: weka.attributeSelection.PrincipalComponents -R 0.2 -A 5 Search:weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1 Instances: 2000 Attributes: 1035 Evaluation mode:evaluate on all training data === Attribute Selection on all input data === Search Method: Attribute ranking. Attribute Evaluator (unsupervised): Principal Components Attribute Transformer Correlation matrix 1 0.22 0.13 0.22 1 0.13 . . . . . . . . . eigenvalue 16.34076 9.9449 7.42351 6.98087 6.53621 6.13852 proportion 0.0158 0.00962 0.00718 0.00675 0.00632 0.00594 0.14 0.19 . . . 0.17 . . . . .. 0.23 . . . . . . . . . .. . . . . . . . . . . .. . .. . . cumulative 0.0158 0.02542 0.0326 0.03935 0.04567 0.05161 -0.117mod-0.1featur-0.099shoot-0.097control-0.093ma... -0.3rebuilt-0.3unreasolut-0.287apar-0.278bureau 0.2... 0.241tech+0.231awar+0.224proper+0.221easyshar+0.204p... 0.329deceit+0.321downright+0.318death+0.303cnet+0.27... 0.319ent+0.313cluster+0.312shiver+0.286univer+0.28 k... 0.158len-0.136card-0.123usb+0.111foc-0.111vid... 44
  • 45. 5.64292 ........ ........ 0.00546 0.05707 -0.358dieg-0.358transcript-0.356raynox-0.329unedit-.... ........ ........ ......................................... ........ ........ ......................................... Ranked attributes: 0.984 0.975 1 -0.117mod-0.1featur-0.099shoot-0.097control-0.093manu... 2 -0.3rebuilt-0.3unreasolut-0.287apar-0.278bureau-0.268gask... . . . 0.798 55 0.116piec+0.111algorithm+0.11 snapshot-0.108laser+0.106doubl... Selected attributes: 55 The results of applying information gain attribute selection algorithm On Bi-gram feature vector: Selected attributes: 58 The following table (Table 2) summarizes the evolution of the feature vectors through the different phases of feature extraction and feature selection (dimensionality reduction) applied on the data set used in this literature starting from the original feature vector and ending with the least possible obtained feature vector. That is also visualized in (Figure 6 and Figure 7) below. Table 2 feature vector size The applied feature selection, extraction Unigram Bigram Original feature vector 12974 165788 After removing stop words 12733 165547 12294 165349 After removing patterns: [.] 12280 165311 After removing patterns: [..] 11805 164765 8081 146376 n/a 132961 After removing patterns:([a z] [a z] .*)|(.*[a z] [a z] ) n/a 100665 After removing patterns: (.*[0 9]+) n/a 97631 1043 3740 55 58 After removing patterns: [0 9]* After removing patterns: ( .*[^a z0 9]+.*) After removing patterns: ([a z] .*)|(.*[a z] ) After applying information gain attribute selection After applying Principle components analysis 45
  • 46. Figure 6 Figure 7 46
  • 47. 5.4 Results of the classifiers 5.4.1 Unigram NaiveBayes without attributes selection === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 179 21 0.79 0.105 0.324 21 64.8074 89.5 50 200 89.5 10.5 % % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.930 0.860 0.895 FP Rate 0.140 0.070 0.105 Precision 0.869 0.925 0.897 Recall 0.930 0.860 0.895 F-Measure 0.899 0.891 0.895 MCC 0.792 0.792 0.792 ROC Area 0.917 0.948 0.933 PRC Area 0.864 0.920 0.892 Class + - PRC Area 0.893 0.908 0.900 Class + - === Confusion Matrix === a b <-- classified as 93 7 | a = + 14 86 | b = NaiveBayes after attribute selection (using Information gain only) === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 186 14 0.86 0.07 0.2646 14 52.915 93 50 200 93 7 % % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.950 0.910 0.930 FP Rate 0.090 0.050 0.070 Precision 0.913 0.948 0.931 Recall 0.950 0.910 0.930 F-Measure 0.931 0.929 0.930 MCC 0.861 0.861 0.861 ROC Area 0.930 0.930 0.930 === Confusion Matrix === a b <-- classified as 95 5 | a = + 9 91 | b = - 47
  • 48. NaiveBayes after using Principle components analysis (PCA) === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 176 26 0.7426 0.1269 0.3397 25.3785 67.9319 91.5842 53.7129 202 87.1287 % 12.8713 % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.941 0.802 0.871 FP Rate 0.198 0.059 0.129 Precision 0.826 0.931 0.879 Recall 0.941 0.802 0.871 F-Measure 0.880 0.862 0.871 MCC 0.750 0.750 0.750 ROC Area 0.957 0.957 0.957 PRC Area 0.958 0.959 0.959 Class + - === Confusion Matrix === a b <-- classified as 95 6 | a = + 20 81 | b = SVM without attributes selection === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 186 14 0.86 0.07 0.2646 14 52.915 93 50 200 93 7 % % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.950 0.910 0.930 FP Rate 0.090 0.050 0.070 Precision 0.913 0.948 0.931 Recall 0.950 0.910 0.930 F-Measure 0.931 0.929 0.930 MCC 0.861 0.861 0.861 ROC Area 0.930 0.930 0.930 PRC Area 0.893 0.908 0.900 Class + - === Confusion Matrix === a b <-- classified as 95 5 | a = + 9 91 | b = 48
  • 49. SVM after attributes selection (using information gain) === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 189 11 0.89 0.055 0.2345 11 46.9042 94.5 50 200 94.5 5.5 % % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.960 0.930 0.945 FP Rate 0.070 0.040 0.055 Precision 0.932 0.959 0.945 Recall 0.960 0.930 0.945 F-Measure 0.946 0.944 0.945 MCC 0.890 0.890 0.890 ROC Area 0.945 0.945 0.945 PRC Area 0.915 0.927 0.921 Class + - === Confusion Matrix === a b <-- classified as 96 4 | a = + 7 93 | b = SVM after using Principle components analysis (PCA) === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 186 16 0.8416 0.0792 0.2814 15.8416 56.2878 92.0792 50 202 92.0792 % 7.9208 % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.980 0.861 0.921 FP Rate 0.139 0.020 0.079 Precision 0.876 0.978 0.927 Recall 0.980 0.861 0.921 F-Measure 0.925 0.916 0.921 MCC 0.848 0.848 0.848 ROC Area 0.921 0.921 0.921 PRC Area 0.869 0.911 0.890 Class + - === Confusion Matrix === a b <-- classified as 99 2 | a = + 14 87 | b = 49
  • 50. 5.4.2 Bigram NaiveBayes without attributes selection === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 183 17 0.83 0.085 0.2915 17 58.3095 91.5 50 200 91.5 8.5 % % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.930 0.900 0.915 FP Rate 0.100 0.070 0.085 Precision 0.903 0.928 0.915 Recall 0.930 0.900 0.915 F-Measure 0.916 0.914 0.915 MCC 0.830 0.830 0.830 ROC Area 0.920 0.938 0.929 PRC Area 0.883 0.907 0.895 Class + - === Confusion Matrix === a b <-- classified as 93 7 | a = + 10 90 | b = NaiveBayes after attribute selection (with IG) === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 181 19 0.81 0.095 0.3082 19 61.6441 90.5 50 200 90.5 9.5 % % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.970 0.840 0.905 FP Rate 0.160 0.030 0.095 Precision 0.858 0.966 0.912 Recall 0.970 0.840 0.905 F-Measure 0.911 0.898 0.905 MCC 0.817 0.817 0.817 ROC Area 0.913 0.904 0.908 PRC Area 0.855 0.891 0.873 Class + - === Confusion Matrix === a b <-- classified as 97 3 | a = + 16 84 | b = 50
  • 51. NaiveBayes after using Principle components analysis (PCA) === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 195 7 0.9307 0.0366 0.1802 7.3105 36.0455 97.0297 51.2376 202 96.5347 % 3.4653 % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.980 0.950 0.965 FP Rate 0.050 0.020 0.035 Precision 0.952 0.980 0.966 Recall 0.980 0.950 0.965 F-Measure 0.966 0.965 0.965 MCC 0.931 0.931 0.931 ROC Area 0.982 0.983 0.983 PRC Area 0.973 0.986 0.980 Class + - ROC Area 0.645 0.645 0.645 PRC Area 0.585 0.645 0.615 Class + - === Confusion Matrix === a b <-- classified as 99 2 | a = + 5 96 | b = - SVM without attributes selection === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 129 71 0.29 0.355 0.5958 71 119.1638 64.5 50 200 64.5 35.5 % % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 1.000 0.290 0.645 FP Rate 0.710 0.000 0.355 Precision 0.585 1.000 0.792 Recall 1.000 0.290 0.645 F-Measure 0.738 0.450 0.594 MCC 0.412 0.412 0.412 === Confusion Matrix === a 100 71 b <-- classified as 0 | a = + 29 | b = - 51
  • 52. SVM after attributes selection (IG) === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 191 9 0.91 0.045 0.2121 9 42.4264 95.5 50 200 95.5 4.5 % % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.980 0.930 0.955 FP Rate 0.070 0.020 0.045 Precision 0.933 0.979 0.956 Recall 0.980 0.930 0.955 F-Measure 0.956 0.954 0.955 MCC 0.911 0.911 0.911 ROC Area 0.955 0.955 0.955 PRC Area 0.925 0.945 0.935 Class + - PRC Area 0.943 0.956 0.949 Class + - === Confusion Matrix === a b <-- classified as 98 2 | a = + 7 93 | b = SVM after using Principle components analysis (PCA) === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 195 7 0.9307 0.0347 0.1862 6.9307 37.2309 96.5347 50 202 96.5347 % 3.4653 % % % % % === Detailed Accuracy By Class === Weighted Avg. TP Rate 0.980 0.950 0.965 FP Rate 0.050 0.020 0.035 Precision 0.952 0.980 0.966 Recall 0.980 0.950 0.965 F-Measure 0.966 0.965 0.965 MCC 0.931 0.931 0.931 ROC Area 0.965 0.965 0.965 === Confusion Matrix === a b <-- classified as 99 2 | a = + 5 96 | b = 52
  • 53. 5.5 Comparison The following table (Table 3) shows a summarization of the classification results of the classifiers used in this literature which are also described more clearly in figures (Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12). Table 3 N-Gram Attributes All Attributes Uni Selected Attrs (Info. Gain) Using PCA All Attributes Bi+Uni Selected Attrs (Info. Gain) Using PCA Lexicon Classifier/Lex Accuracy Naive Bayes SVM Naive Bayes SVM Naive Bayes SVM Naive Bayes SVM Naive Bayes SVM Naive Bayes SVM SentiWordNet Bing Lui’s 89.50% 93.00% 89.50% 94.50% 87.13% 92.08% 91.50% 64.50% 90.50% 95.50% (96.53%) (96.53%) 72.50% 76.00% RMS 0.324 0.2646 0.3242 0.2345 0.3397 0.2814 0.2915 0.5958 0.3082 0.2121 (0.1802) 0.1862 0.5244 0.4899 Precision 0.897 0.931 0.902 0.945 0.879 0.927 0.915 0.792 0.912 0.956 (0.966) (0.966) 0.6552 0.6857 Recall 0.895 0.93 0.895 0.945 0.871 0.921 0.915 0.645 0.905 0.955 (0.965) (0.965) 0.95 0.96 F-Measure 0.895 0.93 0.895 0.945 0.871 0.921 0.915 0.594 0.905 0.955 (0.965) (0.965) 0.776 0.8 53
  • 54. 5.6 Charts Accuracy 100.00% 90.00% 80.00% 70.00% 60.00% Lex B-P B-S B-A Naive Bayes/SWN U-P U-S U-A U-S U-A SVM/Bing Lui’s Figure 8 Root mean squared error (RMS) 0.55 0.45 0.35 0.25 0.15 Lex B-P B-S B-A Naive Bayes/SWN U-P SVM/Bing Lui’s Figure 9 54
  • 55. Precision 1 0.9 0.8 0.7 0.6 Lex B-P B-S B-A Naive Bayes/SWN U-P U-S U-A U-S U-A SVM/Bing Lui’s Figure 10 Recall 1 0.9 0.8 0.7 0.6 Lex B-P B-S B-A Naive Bayes/SWN U-P SVM/Bing Lui’s Figure 11 55
  • 56. F-Measure 0.95 0.85 0.75 0.65 0.55 Lex B-P B-S B-A Naive Bayes/SWN U-P U-S U-A SVM/Bing Lui’s Figure 12 Where : U-A = Unigram - All attributes U-S = Unigram - Selected attributes U-P = Unigram - PCA B-A = Bigram - All attributes B-S = Bigram - Selected attributes B-P = Bigram - PCA Lex = Lexicon 56
  • 57. 5.7 UI for predictions preview To make it easier to display the detailed predictions for each classifier/lexicon, we developed an application with C# (see Figure 13) to navigate on the product reviews, listing the prediction of every classifier/lexicon for a product review, and compare it with its actual class. The test set used in this literature (200 of product reviews in cameras domain) are loaded to this application so we may scan through each sentiment/review of the 200 product reviews to display its actual class and the predictions of the different classifiers used in this literature. Figure 13 As shown in the above figure the displayed sentiment has a negative actual class and is classified correctly with the following classifiers: Naïve Bayes with all attributes on unigram or bigram feature vectors, Naïve Bayes after attribute selection using information Gain on unigram or bigram feature vectors, SVM with IG attribute selection, with PCA attribute selection, or without attribute selection on either unigram or bigram, Naïve Bayes with attribute selection using PCA on Bigram and SetiwordNet Lexicon. But this review or sentiment were not correctly classified using Naïve Bayes classifier after attribute selection using PCA on the Unigram feature vector and also misclassified with the opinion lexicon. 57
  • 58. 5.8 Application for live sentiment analysis As it's interesting to analyze live product reviews, we developed an application with Java (see Figure 14) to fetch all the reviews of a product from its web page to analyize. The product is targeted with its URL on Amazon or gsmArena websites, or the user can type a review manually. Considering the chosen lexicon (SentiWordNet or Bing Liu's), the application analyzes the review(s) and displays how much these are positive and/or negative, and also the score for every reviews file or URL. Figure 14 The input of this application could be a reviews file or document in the extention of .arff or can be an URL of a product reviews page on Amazon.com or gsmarean.com. The output of the application could be in three shapes (see Figure 15): First it may be exported to an external xml file. Secondly it may be represented as a result of sentiments in the document/URL page as a pie graph. Third, results may be represented as a table as shown in the following figure. 58
  • 59. Figure 15 59
  • 60. Chapter 6 Conclusion Obviously in the experimental work it is very clear that spending some efforts in the preprocessing phase and carefully apply the appropriate attribute extraction and attribute selection methods will definitely lead to a better classification results even with less features and less classification cost. In this case the principle components attribute selection algorithm has proven that it typically suits the text classification area given the highest classification accuracy, precession, recall and F-measure. In general we applied two different approaches to sentiment analysis. The opinion lexicons approach (SentiWordNet and Bing Liu’s) and the supervised machine learning approach (NaiveBayes and SVM). The supervised machine learning approach consistently demonstrated high quality results of 96.53% for product reviews, 88∼ 96.6% (precision) and 87∼ 96.5% (accuracy) for cameras and photos product reviews comparing with the relatively low measures given by the opinion lexicons approach. The explanation why lexicon approaches have had a poor classification results as mentioned before in chapter one is that opinion lexicon is necessary but not sufficient for sentiment classification. However, from our initial experience with sentiment detection, we have identified a few areas of potentially substantial improvements in the opinion lexicons classification area. We expect applying negation detection would provide better polarity detection while using the opinion lexicons approach, thus better analysis results. Second, more advanced sentiment patterns currently require a fair amount of manual validation. Although some amount of human expert involvement may be inevitable in the validation to handle the semantics accurately, we plan on more research on increasing the accuracy of the sentiment analysis. As some potential improvements were provided above it is also important to state that there is some issues that are until this moment is very hard for researchers to solve in the opinion lexicon classification field, some of which are discussed earlier in chapter two section 2.4. 60
  • 61. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] Narendra Ahuja, Ming-Hsuan Yang. A Geometric Approach to Train Support Vector Machines" Proceedings of the 2000 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2000), pp. 430-437, vol. 1, Hilton Head Island, June, 2000. Bernhard E. Boser, Isabelle M. Guyon, Vladimir Vapnik. A Training Algorithm for Optimal Margin Classifiers." Fifth Annual Workshop on Computational Learning Theory. ACM Press, Pittsburgh. 1992 Christopher J.C. Burges, Alexander J. Smola, and Bernhard Scholkopf (editors). Advances in Kernel Methods - Support Vector Learning MIT Press, Cambridge, USA, 1999 Christopher J.C. Burges. "A Tutorial on Support Vector Machines for Pattern Recognition", Data Mining and Knowledge Discovery 2, 121-167, 1998 Nello Cristianini, John Shawe-Taylor. An Introduction to Support Vector Networks and other kernelbased learning methods. Cambridge University Pres 2000 Flake, G. W., Lawrence, S. “Efficient SVM Regression Training with SMO." NEC Research Institute, (submitted to Machine Learning, special issue on Support Vector Machines). 2000 Robert Freund, Federico Girosi, Edgar Osuna. “Training Support Vector Machines: an Application to Face Detection." IEEE Conference on Computer Vision and Pattern Recognition, pages 130-136, 1997a Robert Freund, Federico Girosi, Edgar Osuna. “An Improved TraininAlgorithm for Support Vector Machines." In J. Principe, L. Gile, N.Morgan, and E. Wilson, editors, Neural Networks for Signal Processing VII { Proceeding of the 1997 IEEE Workshop, pages 276-285, New York, 1997b Thorsten Joachims. "Text Categorization with Support Vector Machines: Learning with Many Relevant Features", 1998 Thorsten Joachims. "Making Large-Scale SVM Learning Practical" 1999 (Chapter 11 of (Burges, 1999)) Linda Kaufman. Solving the Quadratic Programming Problem Arising in Support Vector Classification", 1999 (Chapter 10 of (Burges, 1999)) S.S. Keerthi, S.K. Shevade, C. Bhattacharyya and K.R.K. Murthy.”A fast iterative nearest point algorithm for support vector machine classifier design, "Technical Report TR-ISL-99-03, Intelligent Systems Lab, Dept. of Computer Science & Automation, Indian Institute of Science, 1999a. S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy. “Improvements to Platt's SMO algorithm for SVM classifier design." Technical report, Dept of CSA, IISc, Bangalore, India, 1999b. John C. Platt. “Fast Training of Support Vector Machines using Sequential Minimal Optimization" (Chapter 12 of (Burges, 1999)) Robert Vanderbei. Loqo: An Interior Point Code for Quadratic Programming." Technical Report SOR 9415, Princeton University, 1994 Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995 Vladimir Vapnik, Corinna Cortes. "Support vector networks," Machine Learning, vol. 20, pp.273-297, 1995. Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani “SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining” T. Mitchell, Machine Learning, McGraw Hill, 1997. Chai, K.; H. T. Hn, H. L. Chieu; "Bayesian Online Classifiers for Text Classification and Filtering", Proceedings of the 25th annual international ACM SIGIR conference on Research and Development in Information Retrieval, August 2002, pp 97-104 DATA MINING Concepts and Techniques,Jiawei Han, Micheline Kamber Morgan Kaufman Publishers, 2003 Abdi. H., & Williams, L.J. (2010). "Principal component analysis.". Wiley Interdisciplinary Reviews: Computational Statistics, 2: 433–459. ^ a b Olson, David L.; and Delen, Dursun (2008); Advanced Data Mining Techniques, Springer, 1st edition (February 1, 2008), page 138, ISBN 3-540-76916-1 ^ a b c Powers, David M W (2007/2011). "Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies 2 (1): 37–63. 61