Your SlideShare is downloading. ×
Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment Analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment Analysis

299

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
299
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment Analysis Khan Mostafa Graduate Student, Computer Science, Stony Brook University, NY 11794, USA Email: khan.mostafa@stonybrook.edu Student ID# 109365509 ABSTRACT This Reaction Paper is submitted as an assignment to critique and brainstorm upon reading few papers. 1 INTRODUCTION In this article I am discussing four papers, two in field of outlier detection and two in field of sentiment analysis. The first publication I discuss introduced LOF (Local Outlier Factor) as a density based approach to detect outliers. LOF is a very useful and widely employed approach although will not work very well in high-dimensions. Second article proposes using angular measures to detect outliers in high-dimension. Next two articles address sentiment analysis; one examined appraisal taxonomies for sentiment analysis and other one is about using Twitter as corpus for sentiment analysis. 2 RELATED TERMS 2.1 Outlier detection Two of the discussed paper addressed outlier detection. An outlier is significantly different from the rest (a.k.a. normal). In clusters, an outlier is some point which do not fit into any cluster. Outliers can be of interest to many. Especially, it is interesting to detect anomalies. Anomalous events or objects cannot be detected using supervised learning as we the nature of anomalies in unknown. Thus some unsupervised method can be suitable. Outlier detection can be also used, before clustering a dataset. It helps by removing outlying objects and thus better performing in producing clusters. Sometimes, outliers are outstanding or crucial points of a system. 2.2 Sentiment Analysis Identifying sentiment is important to many. Especially, corporations, politicians, banks want to know how people are feeling about some certain product, campaign or thing. Sentiment is generally expression of emotion or feeling regarding some object. Generally, human can identify sentiment by reading text. But, to understand public opinion, applications need to understand sentiment from massive amount of text. To do this, approaches Reaction Paper submitted for CSE590 Networks and Data Mining Techniques on 22/10/2013
  • 2. Khan Mostafa Student ID# 109365509 are taken from fields spanning data mining, natural language processing, data mining and statistics. The trend in sentiment analysis is to identify if some text is subjective and whether they convey positive sentiment of negative. 3 REACTIONS AND OUTLINES In this section methods presented in each paper is briefly outlined and then reacted upon. 3.1 Local Outlier Factor for Outlier Detection Earlier approaches for outlier detection considered outliers globally. However, a more appropriate way of measuring outlier is measuring how outlying they are from the cluster they were supposed to be if they were not outliers. Outliers should be calculated locally based on how deviant it is from its neighbors. One early approach to consider outliers locally was by Knorr and Ng (Knorr and Ng, Finding Intensional Knowledge of Distance-Based Outliers 1999) (Knorr and Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets 1998) where they proposed the notion of distance-based outlier detection. A more efficient algorithm proposed considering distance to k nearest neighbors (Ramaswamy, Rastogi and Shim 2000). However, distance is not an appropriate measure when density of clusters vary. The work being examined (Breunig, et al. 2000) have advanced local approach by introducing a density based concept, Local Outlier Factor (LOF). Authors posit, “being outlying is not a binary property”. Hence, for each point a score, LOF, is calculated which estimates the degree of being an outlier. For this, they calculate reachability distance of each point and then estimates reachability density. Reachability density is the inverse of average reachability distance in terms of k (= MinPts) nearest neighbors. Then LOF for a point p is calculated as, “the average of the ratio of the local reachability density of p and those of p’s MinPts-nearest neighbors”. LOF is higher when the point is further from its nearest neighbor. Reachability density of a point deep in cluster (points that are not outlying) will have similar reachability distance as of its neighbors. Hence, it has been shown that, LOF for non-outlying points will be approximately one. Estimation of LOF is largely influenced by the parameter MinPts. MinPts is the number of neighboring points in terms of which the reachability density is measured. If MinPts is larger than the number of some cluster C, then all points in C will have LOF much larger than 1. Again, if MinPts is much smaller, then outliers that are neighbors to n<MinPts outlying points may have a LOF score approximately one. Therefore, estimation of MinPts are suggested to be done heuristically. After being proposed, LOF has gained a lot of attention and has widely been studied in last decade. As LOF depends on parameter MinPts, another approach was suggested (Papadimitriou, et al. 2003) to calculate Local Outlier Correlation Integral (LOCI). Here, a sampling neighborhood of radius r and a counting neighborhood of radius αr is considered. For a point p, points {… , 𝑝𝑛 𝑖 , … } that are in sampling neighborhood are taken. Then cardinality of each point 𝑝𝑛 are respect to points within its counting neighborhood. These calculations combined with mean and standard deviation are used. They also introduced concept of LOCI plot. However, estimates of LOCI depends on input parameter α. Some other variations and extensions studied are, 2
  • 3. Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment Analysis      Outlier Detection using In-degree Number (ODIN) (Hautamaki, Karkkainen and Franti 2004) Connectivity-based Outlier Factor (COF) (Tang, et al. 2002) Using probabilistic suffix tree (PST) for detecting nearest neighbors An approach to enhance efficiency is studied (Jin, Tung and Han 2001) to create micro clusters first Another study covered the case, when clusters of different densities lie closely Many other studies are done, covering each and every variations and extensions that comes in mind, rather not survey them all. LOF is a very useful measure as it can identify outliers in local domain. It indeed covers global outliers as all global outlier is also local outliers. A major weakness is however its computation cost which is 𝑂(𝑛2 ). To reduced complexity, one approach might be to use some kind of locality hashing. Where, a prior run will be made to hash each point into a bucket consisting of neighboring points to it. A grid based approach can also be employed. For a point, k= (MinPts) number of points will be chosen randomly from the bucket (or grid) where it belongs to. If there are fewer points than k in a grid, nearby grids can be taken. Another improvement can be to calculate reachability distance for each grid a priori while grid spacing them. Some normal points can have very few neighbors – in such cases, LOF might yield a high LOF for them indicating them as outlying. LOF is density bases, density is defined in in terms of distance. In higher dimensions, distance are almost similar (curse of dimensionality) for each points. In such case, LOF cannot be directly employed. However, feature bagging is often suggested. In high dimension, when there is a need to select few features, LOF can be used. In such case, few features can be used each time to estimate LOFs. Those sets of features, which yield less diverse LOFs (i.e. yields high LOF for fewer points) can be potentially good feature approximations. LOF is spatial algorithm. Hence, it cannot be used in situations where there is not distance measure. LOFs can be also used to cluster points. In this case, a hierarchical clustering can be employed. When a point is calculated to have LOF approximately 1 then it can be assigned to the cluster in which its neighbors are belonging to. LOF can be used to identify anomalies within clusters. Say, a small portion within a class is significantly more or less dense. Points amongst them will result LOF scores which will be different from LOF scores of other points of the cluster. 3.2 Angle Based Outlier Detection in Higher Dimension In higher dimension, distance is uniform. But, an assumption was posed by (Hans-peter Kriegel 2008) that, angles are more stable and outliers reside in periphery. This method (ABOD) would calculate angular distance of other points from the point in question. A point is said to be outlying if most points rest in one side of the point i.e. if angular distance from this point is similar. While, normal points will have sparse angular distance. 3
  • 4. Khan Mostafa Student ID# 109365509 ABOD has falls back in that it requires a very high computational complexity. This can be minimized, by selecting random points. Angular distances will be similar from an outlying point even if the points chosen in random. Another way can be to further subspace dividing features and then calculate angular distances in these subspaces; these measures can be collectively used. 3.3 Appraisal Taxonomies in Sentiment Analysis Sentiment analysis is being heavily investigated for more than a decade, instigated by Pang, et al (Pang, Lee and Vaithyanathan 2002) when they attempted to solve the problem of sentiment classification as a case of topic based categorization. Sentiment, however, in any granularity level (viz. article, paragraph and sentence) is generally perceived by human as appraisal. Many words and phrases are used to praise and many are to express negative comment about things. This case was investigated by Whitelaw, et al. (Whitelaw, Garg and Argamon 2005). They identified the need for semantic analysis of attitude expression and also hypothesized that, atomic units of sentiment expression are not individual word but rather appraisal groups (Attitude, Orientation, Graduation and Polarity). [See Appraisal Theory (Martin and White 2005)]. Basing on WordNet and two other thesauri they constructed a lexicon. They used coarse ranking of relevance to enlist such terms. However, final set of terms were produced using manual examination. Then they tested several feature sets and found that, union of bag-of-words and appraisal group by attitude & orientation (BoW + GAO) yields best result. Proposed approach is not very scalable as they require a lot of manual labor and cannot create absolute appraisal estimates for a lot of words. It also employs much computation intensive classification technique. Though, this investigation brings light to the case that, appraisal are not simply done with adjectives alone. Other parts of speech in sentences are also responsible for sentiments in it. Several studies have tried to employ adverb and verbs along with adjectives to estimate sentiment. Overall, it can be said that, sentiment is expressed with tone of the sentence and different POS occur differently in positive and negative statements. Hence, subjectivity of statements can be scored using parts of speech tagging and estimating then by using some classifier. Furthermore, nouns and names can also embody polarity. Especially, when comparative phrases are used. Same word can also express different feeling in different context. An approach can be to enlist appraisal scores of words along with contexts. Yet some problem may remain when, qualifiers may indicate opposite feeling when used with different words (viz. fast access as opposed to fast heating in PC RAM description). 3.4 Sentiment Analysis in Twitter Twitter is a widely used blog-sphere where people often covey sentiment. Several studies have tried to analyze sentiment in such platform. One of them is by A. Pak & P. Paroubek (Pak and Paroubek 2010) where authors build a sentiment corpus by using tweets as corpus. They exploited that user put emoticons and used them to build a sentiment lexicon. Along with that they trained classifier based on parts of speech tagging. They also used then to estimate sentiment of tweets with help of n-gram and POS classifiers. 4
  • 5. Reaction Paper Discussing Articles in Fields of Outlier Detection & Sentiment Analysis Their work is elegant one as it can estimate sentiments in real time. However, an extension to their work can be to use sentiment score, in place of strict classifying (negative, positive, objective). This approach cannot deal contextual sentiment dependency. An approach to deal with it can be, to first extract keywords from the tweet. Then associate appraisal key-words with objective key terms to estimate how this subjective word express sentiment for that particular key term. Even, key terms can be used to identify context category first. Sentiment analysis troubles when sarcastic and ironic speeches are used. Although, there are some studies to solve this, it requires more investigation and may be more rigorous language processing and logic techniques might be needed to more effectively estimate irony. Hence, a perfect sentiment analysis tool is yet to emerge. 4 CONCLUSION In this paper, four articles are discussed. First pair of papers which are about outlier detection address different portions of same problem. LOF is widely studied and used, hence there is a multitude of approaches to enhance, extend it as well as to improve its computational complexity. Second article (ABOD) is also motivated from LOF. In present there are not much connection found between sentiment analysis and outlier detection. However, in opinion mining of mass data it can be useful. When, opinion about some entity is mined, first approach is to pull statements about that entity. These pulled statements might also include some statement which is actually not about that very entity. These outlying statements can be filtered to better reflect sentiment about the entity. 5 BIBLIOGRAPHY Breunig, Markus M., Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. "LOF: Identifying Density-Based Local Outliers." International Conference on Management of Data SIGMOD. ACM. 93-104. Hans-peter Kriegel, Matthias Schubert, Arthur Zimek. 2008. "Angle-based outlier detection in high-dimensional data." Knowledge Discovery and Data Mining - KDD. 444-452. Hautamaki, Ville, Ismo Karkkainen, and asi Franti. 2004. "Outlier detection using k-nearest neighbour graph." Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004. IEEE. 430-433. Jin, Wen, Anthony K. H. Tung, and Jiawei Han. 2001. "Mining top-n local outliers in large databases." Knowledge Discovery and Data Mining - KDD. 293-298. Knorr, Edwin M., and Raymond T. Ng. 1998. "Algorithms for Mining Distance-Based Outliers in Large Datasets." Very Large Data Bases - VLDB. 392-403. —. 1999. "Finding Intensional Knowledge of Distance-Based Outliers." Very Large Data Bases VLDB. 211-222. 5
  • 6. Khan Mostafa Student ID# 109365509 Martin, J. R., and P. R. R. White. 2005. Language of Evaluation: Appraisal in English. London: Palgrave. http://grammatics.com/appraisal/. Pak, Alexander, and Patrick Paroubek. 2010. "Twitter as a Corpus for Sentiment Analysis and Opinion Mining." Language Resources and Evaluation. 1320-1326. Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. 2002. "Thumbs up? Sentiment Classification using Machine Learning Techniques." Proceedings of the ACL-02 conference on Empirical methods in natural language processing. Philadelphia, PA, USA: Association for Computational Linguistics. 79-86. Papadimitriou, Spiros, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Faloutsos. 2003. "Loci: Fast outlier detection using the local correlation integral." Proceedings. 19th International Conference on Data Engineering. IEEE. 315-326. Ramaswamy, Sridhar, Rajeev Rastogi, and Kyuseok Shim. 2000. "Efficient Algorithms for Mining Outliers from Large Data Sets." Proc. ACM SIDMOD Int. Conf. on Management of Data. ACM. 427-438. Tang, Jian, Zhixiang Chen, Ada Wai-Chee Fu, and David W. Cheung. 2002. "Enhancing effectiveness of outlier detections for low density patterns." In Advances in Knowledge Discovery and Data Mining, 535-548. Springer Berlin Heidelberg. Whitelaw, Casey, Navendu Garg, and Shlomo Argamon. 2005. "Using appraisal groups for sentiment analysis." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM. 625-631. 6

×