This review is negative (-) based on the following clues:
- The critique mentions it presents a "cool idea" in a "very bad package"
- It says "these folks just didn't snag this one correctly" implying they failed to execute the concept well
- It directly asks "what are the problems with the movie?" suggesting there are significant issues
- The rest of the critique goes on to list specific problems with the movie's execution
The tone is one of disappointment in a missed opportunity rather than praise for a job well done. Words like "bad", "didn't snag correctly" and the focus on problems indicate negative sentiment toward the film.
Various examples of observational studies, mostly fo the analysis of social media.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Basic concepts about natural experiments, based mostly on Dunning's book.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Predictions of links in graphs based on content and information propagations.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Brief tutorial on Influence and Homophily in social networks. Key concepts. How to distinguish influence from correlation. Information diffusion processes. Influence Maximization Problem
and viral marketing.
Various examples of observational studies, mostly fo the analysis of social media.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Basic concepts about natural experiments, based mostly on Dunning's book.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Predictions of links in graphs based on content and information propagations.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Brief tutorial on Influence and Homophily in social networks. Key concepts. How to distinguish influence from correlation. Information diffusion processes. Influence Maximization Problem
and viral marketing.
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
Bayesian classifier works efficiently on some fields, and badly on some. The performance of Bayesian Classifier suffers in fields that involve correlated features. Feature selection is beneficial in reducing dimensionality, removing irrelevant data, incrementing learning accuracy, and improving result comprehensibility. But, the recent increase of dimensionality of data place a hard challenge to many existing feature selection methods with respect to efficiency and effectiveness. In this paper, Bayesian Classifier with Correlation Based Feature Selection is introduced which can key out relevant features as well as redundancy among relevant features without pair wise correlation analysis. The efficiency and effectiveness of our method is presented through broad.
Identifying Prominent Life Events on Twitter - K-Cap 2015Tom Dickinson
Social media is a common place for people to post and share
digital reflections of their life events, including major events
such as getting married, having children, graduating, etc.
Although the creation of such posts is straightforward, the
identification of events on online media remains a challenge.
Much research in recent years focused on extracting major
events from Twitter, such as earthquakes, storms, and
floods. This paper however, targets the automatic detection
of personal life events, focusing on five events that psychologists
found to be the most prominent in people lives. We
define a variety of features (user, content, semantic and interaction)
to capture the characteristics of those life events
and present the results of several classification methods to
automatically identify these events in Twitter. Our proposed
classification methods obtain results between 0.84 and
0.92 F1-measure for the different types of life events. A novel
contribution of this work also lies in a new corpus of tweets,
which has been annotated by using crowdsourcing and that
constitutes, to the best of our knowledge, the first publicly
available dataset for the automatic identification of personal
life events from Twitter
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
UNIT V TEXT AND OPINION MINING
Text Mining in Social Networks -Opinion extraction – Sentiment classification and clustering -
Temporal sentiment analysis - Irony detection in opinion mining - Wish analysis – Product review mining – Review Classification – Tracking sentiments towards topics over time
Social Media Mining - Chapter 9 (Recommendation in Social Media)SocialMediaMining
R. Zafarani, M. A. Abbasi, and H. Liu, Social Media Mining: An Introduction, Cambridge University Press, 2014.
Free book and slides at http://socialmediamining.info/
Practical Opinion Mining for Social MediaDiana Maynard
This tutorial will introduce the concepts of sentiment analysis and opinion mining from unstructured text in social media, looking at why they are useful and what tools and techniques are available. It will cover both rule-based and machine learning techniques, provide some background information on the key underlying NLP processes required, and look in detail at some of the major problems and solutions, such as detection of sarcasm, use of informal language, spam opinion detection, trustworthiness of opinion holders, and so on. The techniques will be demonstrated with real applications developed in GATE, an open-source language processing toolkit. Links are provided to some hands-on material to try at home.
These are 5 Fraud Cases Discovered by means of "Data Trends". All 5 were identified by analysis of Data simply. Pursuing to confirm the Fraud required more effort and which required in some cases to review system reports and interviews.
I conclude that Audit Trails in ERP applications can be a vital starting point when investigating suspicious activities by End Users
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
Bayesian classifier works efficiently on some fields, and badly on some. The performance of Bayesian Classifier suffers in fields that involve correlated features. Feature selection is beneficial in reducing dimensionality, removing irrelevant data, incrementing learning accuracy, and improving result comprehensibility. But, the recent increase of dimensionality of data place a hard challenge to many existing feature selection methods with respect to efficiency and effectiveness. In this paper, Bayesian Classifier with Correlation Based Feature Selection is introduced which can key out relevant features as well as redundancy among relevant features without pair wise correlation analysis. The efficiency and effectiveness of our method is presented through broad.
Identifying Prominent Life Events on Twitter - K-Cap 2015Tom Dickinson
Social media is a common place for people to post and share
digital reflections of their life events, including major events
such as getting married, having children, graduating, etc.
Although the creation of such posts is straightforward, the
identification of events on online media remains a challenge.
Much research in recent years focused on extracting major
events from Twitter, such as earthquakes, storms, and
floods. This paper however, targets the automatic detection
of personal life events, focusing on five events that psychologists
found to be the most prominent in people lives. We
define a variety of features (user, content, semantic and interaction)
to capture the characteristics of those life events
and present the results of several classification methods to
automatically identify these events in Twitter. Our proposed
classification methods obtain results between 0.84 and
0.92 F1-measure for the different types of life events. A novel
contribution of this work also lies in a new corpus of tweets,
which has been annotated by using crowdsourcing and that
constitutes, to the best of our knowledge, the first publicly
available dataset for the automatic identification of personal
life events from Twitter
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
UNIT V TEXT AND OPINION MINING
Text Mining in Social Networks -Opinion extraction – Sentiment classification and clustering -
Temporal sentiment analysis - Irony detection in opinion mining - Wish analysis – Product review mining – Review Classification – Tracking sentiments towards topics over time
Social Media Mining - Chapter 9 (Recommendation in Social Media)SocialMediaMining
R. Zafarani, M. A. Abbasi, and H. Liu, Social Media Mining: An Introduction, Cambridge University Press, 2014.
Free book and slides at http://socialmediamining.info/
Practical Opinion Mining for Social MediaDiana Maynard
This tutorial will introduce the concepts of sentiment analysis and opinion mining from unstructured text in social media, looking at why they are useful and what tools and techniques are available. It will cover both rule-based and machine learning techniques, provide some background information on the key underlying NLP processes required, and look in detail at some of the major problems and solutions, such as detection of sarcasm, use of informal language, spam opinion detection, trustworthiness of opinion holders, and so on. The techniques will be demonstrated with real applications developed in GATE, an open-source language processing toolkit. Links are provided to some hands-on material to try at home.
These are 5 Fraud Cases Discovered by means of "Data Trends". All 5 were identified by analysis of Data simply. Pursuing to confirm the Fraud required more effort and which required in some cases to review system reports and interviews.
I conclude that Audit Trails in ERP applications can be a vital starting point when investigating suspicious activities by End Users
The Employee Point of View: The Economic DownturnCitrix Online
This new SHRM Research white paper takes a close look at how employees are responding to the economic recession and anticipates likely workforce trends during the recovery.
A publication of the leading Greek newspaper "Proto Thema" on Dimitris Tsigos participation to G20Y and G20YES conferences, representing the young entrepreneurs of Europe
Sentiment analysis, also known as opinion mining, is a field of computer science that focuses on automatically identifying the opinions and feelings expressed in text, audio and video. It aims to determine whether a document expresses a subjective view (positive, negative, or neutral) or presents objective facts.
Sentiment analysis involves determining the sentiment expressed by a writer in a document. The objective of the opinion-mining field is to conduct subjectivity analysis, indicating whether a document is subjective or objective. Subjectivity implies the presence of sentiment, while objectivity signifies content devoid of sentiment. Currently, an abundance of information about a specific product is available, with a single product often garnering hundreds of reviews across various webpages. Numerous websites, such as imdb.com, amazon.com, idlebrain.com, among others, aggregate user information and expert opinions to publish reviews. Experts meticulously analyze reviews, extract opinions, and generate ratings related to the dataset provided by the requesting agencies. However, handling the vast amount of data is a labor-intensive task for experts. The continuously growing volume of web data poses challenges in extracting precise opinions from content. Hence, there is a need to design a system that can efficiently perform these tasks with human-like accuracy.
In this research work, the propose approach enough capable of handling and analyzing large amounts of reviews. The reviews considered of analyzing are pre-analyzed with existing algorithms and further processed through the approach proposed in the present research work. The working capacity of the proposed approach extracts sentiment from the available content (dataset) and determines polarity degree using sentiment polarity and degree management. It also measures sentiment degrees based on user-provided target document features. The outcome is a summary comprising highly sentiment-related sentences, providing valuable insights to the users. The goal is to streamline sentiment analysis processes and enhance accuracy in a manner that aligns with human-like comprehension.
Lean Analytics: Using Data to Build a Better Business FasterLean Startup Co.
Alistair Croll, Solve for Interesting , @acroll
At the core of Lean Startup approaches is a continuous cycle of measurement and learning. But what should you measure? To find the right metric, you need to understand the stage you’re at and the business model you’re in, as well as where to draw the line so you know when to cut your losses—and when to step on the gas. In these two sessions, entrepreneur and best-selling author of Lean Analytics Alistair Croll will show you how to put data to work.
How to think about data and what makes a good metric
The importance of cohorts and proper analysis
The five stages every startup goes through
Six business model archetypes and how to find your own
What “good enough” looks like and how to run experiments
What works for larger organizations trying to change and innovate.
This session is relevant for both early-stage founders and intrapreneurs in large organizations. Based on interviews with over 130 analysts, entrepreneurs, and investors, this session is packed with practical information, hard numbers, and concrete steps you can put to work immediately. Attendees need not be technical but should come armed with a basic understanding of web analytics, business metrics, and their current business model, plus a willingness to share with one another.
This workshop is sponsored by Amplitude.
Determine the sentiment of sentence that is positive or negative based on the presence of part of
speech tag, the emoticons present in the sentences. For this research we use the most popular microblogging sit
twitter for sentiment orientation. In this paper we want to extract tweets form the twitter related to the product
like mobile phones, home appliances, vehicle etc. After retrieving tweets we perform some preprocessing on it
like remove retweets, remove tweets containing few words with minimum threshold of length five, remove tweets
containing only urls. After this the remaining tweets are pre-processed like that transform all letters of the
tweets to the lower case then remove punctuation from the tweets because it reduces the accuracy of result.
After this remove extra white spaces from the tweets, then we apply a pos tagger to tag each word. The tuple
after the applying above steps contain (word, pos tag, English-word, stop-word). We are interested in only
tweets that contain opinion and eliminate the remaining non-opinion tweets from the data set. For this we use
the Naïve Bays classification algorithm. After this we use short text classification on tweets i.e., the word having
different meaning in different domain. In order to solve this problem we use two different feature selection
algorithms the mutual information (MI) and the X2 feature selection. At final stage predicting the orientation of
an opinion sentence that is positive or negative as we mentioned above. For this we use two model like unigram
model and opinion miner.
Twitter Sentiment Analysis Project Done using R.
In these Project we deal with the tweets database that are avaialble to us by the Twitter. We clean the tweets and break them out into tokens and than analysis each word using Bag of Word concept and than rate each word on the basis of the score wheter it is positive, negative and neutral.
We used Naive Baye's Classifier as our base.
Sentiment Analysis Using Hybrid Approach: A SurveyIJERA Editor
Sentiment analysis is the process of identifying people’s attitude and emotional state’s from language. The main objective is realized by identifying a set of potential features in the review and extracting opinion expressions about those features by exploiting their associations. Opinion mining, also known as Sentiment analysis, plays an important role in this process. It is the study of emotions i.e. Sentiments, Expressions that are stated in natural language. Natural language techniques are applied to extract emotions from unstructured data. There are several techniques which can be used to analysis such type of data. Here, we are categorizing these techniques broadly as ”supervised learning”, ”unsupervised learning” and ”hybrid techniques”. The objective of this paper is to provide the overview of Sentiment Analysis, their challenges and a comparative analysis of it’s techniques in the field of Natural Language Processing.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Digital Artifact 2 - Investigating Pavilion Designs
Text mining and analytics v6 - p2
1. Tutorial: Text Data Mining and Analytics: Part 2 HICSS 44 – January 2011 Dave King
2. Text Mining: Payoff from Simple Approaches Many of the applications of data mining to text “have proved remarkably successful without understanding specific properties of text such as the concepts of grammar or the meaning of words. Strictly low-level frequency information is used, such as the number of times a word appears in a document, and then well-known methods of machine learning are applied.” Source: S. Weiss, et. al. Text Mining: Predictive Methods for Analyzing Unstructured Information, 2005
4. Text Mining:Here’s a fun job! Google News is a computer-generated news site that aggregates headlines from news sources worldwide, groups similar stories together and displays them according to each reader's personalized interests…Google News has no human editors …
5. Text Mining:Text Categorization (Classification) Probably the most frequently used TM technique. Often employed in applications where there is a flow of dynamic information (emails, news articles, blogs, scientific articles, patents, medical claims, legal data …), requiring automated handling and routing. ? Category News Articles
6. Text Mining:Text Categorization (Classification) Inductive, supervised machine learning process the classifies or categorizes a given document instance (of unknown classification) into one of a set of predetermined categories. Docs w/ known classification – training corpa Documents w/ unknown classification Validate Test Train Feature Extraction/Learning Feature Extraction Classification Algorithm Predetermined Categories 1 2 3 n
9. Text Categorization:An Example “We invite you to come see the 2020 and hear about the DECSystem-20 family.’’ Gary Thuerk, DEC Marketing, 1978 DECSYSTEM-2020: a bit-slice processor with up to 512 kilowords of solid state RAM Source: http://www.newyorker.com/reporting/2007/08/06/070806fa_fact_specter#ixzz16zE3E2zO
11. Spam Detection:Size of the Problem 90 trillion – The number of emails sent on the Internet in 2009. 247 billion – Average number of email messages per day. 1.4 billion – The number of email users worldwide. 100 million – New email users since the year before. 81% – The percentage of emails that were spam. 92% – Peak spam levels late in the year. 24% – Increase in spam since last year. 200 billion – The number of spam emails per day (assuming 81% are spam).
12. Spam Detection:Size of the Problem Estimated Annual Costs of Spam in the US (in $billions) Source: blog.epostmarks.com/team-blog/2009/3/21/the-true-corporate-and-consumer-costs-of-spam.html
13. Spam Detection:Size of the Problem (Yale Univ.) Measured in millions http://www.yale.edu/its/metrics/email/index.html
15. Spam Detection:General Approaches Rules Is this email from someone@spam.com? Blacklists & Whitelists Check the subject and body of the message for particular words or phrases Problem: Need new rules to handle dynamic data Ways to alter the data (add spaces at random, non-alpha characters, misspellings, composite words, …)
17. Beginning Example:Yale University Spam Management Blocks messages from known spammers using a service called SpamHaus, a real-time database of IP addresses of verified spam sources. Content-based, central spam detection using SpamAssassin. Messages scored as spam are moved away from a user’s inbox to the Tagged-Spam folder on the server. Rules used for tagging spam are conservative. For that reason some spam gets through the first two levels of filtering. End users should train email clients to recognize and manage spam. Mail clients like Eudora or Outlook have built-in spam filters that you can train to filter messages you consider spam.
18. Spam Detection: Yale University Spam Management A set of Perl programs that uses the combined score from multiple types of checks to determine if a given message is spam including Bayesian filtering. Microsoft Outlook utilizes its SmartScreen Technology which is based on a machine-learning Bayesian technology that employs a probability-based algorithm, to determine whether email is legitimate or spam.
19. Spam Detection:Genesis of Content-Based Control “I think it’s possible to stop spam, and that content-based filters are the way to do it. The Achilles’ heel of the spammers is their message. They can circumvent any other barrier you set up. But they have to deliver their message, whatever it is. There is no way they can get around that… I think we will be able to solve the problem with fairly simple algorithms. In fact, I've found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words. Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives. Paul Graham, A Plan for Spam, 2002
21. Spam Detection:Naïve Bayesian Classifier P(H/D) = P(D/H) * P(H)/P(D) H is the hypothesis and D is the data P(H) is the prior probability of H: the probability that H is correct before the data D are seen . P(D/H) is the conditional probability of seeing the data D given that the hypothesis H is true. This conditional probability is called the likelihood. P(D) is the marginal probability of D. P(H/D) is the posterior probability: the probability that the hypothesis is true, given the data and the previous state of belief about the hypothesis. Thomas Bayes
24. Sentiment Analysis:The Issues and Payoffs Every hour of every day they share their opinions, issues, thoughts and sentiments about products, brands, services and companies.
25. Sentiment Analysis:Some Survey Data Activity 81% of Internet users (or 60% of Americans) have done online research on a product at least once 20% (15% of all Americans) do so on a typical day 32% have provided a rating on a product, service, or person via an online ratings system, and 30% (including 18% of online senior citizens) have posted an online comment or review regarding a product or service.2 Impact Among readers of online reviews of restaurants, hotels,andvarious services (e.g., travel agencies or doctors), between 73% and 87% report that reviews had a significant influence on their purchase Consumers report being willing to pay from 20% to 99% more for a 5-star-rated item than a 4-star-rated item (the variance stems from what type of item or service is considered) Pew Internet & American Life Project Report, 2008.
26. Sentiment Analysis:The Issues and Payoff This evaluative text data is extremely valuable to customer-facing organizations Marketing -- Inform targeted marketing and help determine which marketing messages resonate with customers Service -- Provide more rapid response to perceived customer issues and determine the steps to take to satisfy customers Products -- Quickly determine whether there are emerging product issues, how to position products and where development dollars should be focused. It is also very voluminous – beyond addressing with armies of staff manually sifting through the data
27. Sentiment Analysis:What is it? Also called opinion mining or voice of the customer (VOC) Involves using text mining to classifying subjective opinions in text into categories like "positive" or "negative” extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.
28. Sentiment Analysis: How do you know if the review is “-” or “+” plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what's the deal ? watch the movie and " sorta " find out . . . critique : a mind-xxx movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it's simply too jumbled . having not seen , " who framed roger rabbit " in over 10 years , and not remembering much besides that i liked it then , i decided to rent it recently . watching it iwas struck by just how brilliant a film it is . aside from the fact that it's a milestone in animation in movies ( it's the first film to combine real actors and cartoon characters , have them interact , and make it convincingly real ) and a great entertainment it's also quite an effective comedy/mystery . while the plot may be somewhat familiar the characters are original , especially baby herman , and watching them together is a lot of fun . … `who framed roger rabbit' is a rare film . one that not only presented a great challenge to the filmmakers but one that can be enjoyed by the whole family ( although some very young viewers may be a little scared by judge doom ) . do yourself a favor and rent it , `p-p-p-p-please . "
29. Sentiment Analysis:Underlying Assumption There are opinion words (aka polar words, opinion-bearing words, and sentiment words) used to express state. Positive opinion words are used to express desired states (e.g. beautiful, wonderful, good, and amazing) Negative opinion words are used to express undesired states (bad, poor, and terrible) There are also opinion phrases and idioms ( e.g. cost someone an arm and a leg) Collectively, they are called the Opinion Lexicon.
30. Sentiment Analysis:Types Sentiment Classification – document level, classified as positive or negative Feature-based opinion – sentence level, determines which aspects of an object people like or dislike Comparative sentence and relationship mining – sentence level comparisons of one object against another (to determine which is better than the other)
31. Sentiment Analysis:Which type is best? From one type to the next (classification, features, comparisons), it becomes more complex to extract the information needed to perform the analysis. However, once extracted, standard text mining techniques can be used to classify and compare the opinions expressed in the documents, statements, sentences, and phrases. Simple techniques (like naïve Bayesian) often produce excellent results (e.g. 80+% accuracy)
32. Text Mining and Analytics:Applications JetBlue Airways Uses Attensity to analyze the large volume of e-mail messages it receives from customers. By matching specific comments and comment patterns with structured data, airline personnel can solve problems rapidly, before they jeopardize the carrier's satisfaction rating. Rosetta Stone Uses IBM SPSS text analytics software to analyze answers to open-ended questions in surveys of current and potential customers. Combines text analysis along with other identification information (e.g. products purchased, demographics) to drive decisions on advertising, marketing and product development as well as strategic planning. Gaylord Hotels Uses Clarabridge software to make sense of thousands of customer satisfaction surveys gathered each day Spots positive and negative comments that helps track trends in customer satisfaction and spot problems -- as well as best practices -- tied to particular properties, departments or employees.
33. Text Mining:Clustering (Setting the Stage) A common problem: Establishing categories or topic structures for Free-form survey data Customer complaints/comments, incident reports and warranty claims Blogs and discussion forums Search results Common answer: Clustering
34. Text Mining:Clustering (Defined) The unsupervised, automated grouping of records, observations, or cases into classes of similar objects called clusters. Document Collection Similarities stronger within clusters than between (i.e. distances shorter) C1 Freq W1 Clustering Algorithm C3 Clusters C2 1 2 3 n Freq W2
35. Text Mining:Clustering (Measuring Distance) In a term-doc matrix treat the docs as vectors and the topics as variables and measure the distance/similarity between them. 3 Euclidean Distance: SQRT(Sum(Xi-Yi)^2)) 2 D1 T1 D2 1 D3 1 2 3 0 T2
36. Text Mining:Clustering (Measuring Distance) Squared Euclidean: Sum of squared differences City Block or Manhattan: Sum of absolute differences Minkowski: hth root of the sum of absolute differences raised to the hth power Matching Distance: For binary – number of (mis)matches divided by number of comparisons (like Jaccard Similarity) Correlation: 1 – 2r where r is corr. coeff. Cosine: angle between the vectors
37. Text Mining:Clustering Methods Hierarchical: Produces a Tree-Like Structure of Clusters (Divisive and Agglomerative) Partitioning: Organizes objects into k partitions (k<=n) where each partition is a cluster
39. Text Mining:Clustering (Simple Example) T1 - The Neatest Guide to Stock Market Investing T2 - Investing For Dummies, 4th Edition T3 - The Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns T4 – The Book of ValueInvesting T5 - ValueInvesting: From Graham to Buffett and Beyond T6 - RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not! T7 - Investing in RealEstate, 5th Edition T8 - StockInvesting For Dummies" T9 - RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss Focused on (exact) indexed words – appears in at least 2 titles and is not a stop word
40. Text Mining:Clustering Method - Hierarchical Calculate distances between docs Select 2 closest docs and put them into a cluster Now determine closest doc among the remaining individual docs and existing clusters [utilizing either single (nearest), complete (farthest) or average linkage] Repeat process until a single cluster is formed Level Plot
42. Text Mining:Clustering Method – K-Means Determine the number of clusters “k”<=n Randomly assign k docs to be the initial cluster center locations (centroids) Repeat until termination For each doc calculate the (Euclidean) distance from the center locations and assign them to the cluster with the nearest center. For every cluster, recompute the centroid based on current members Check for termination – minimal or no changes in doc assigments Return the list of clusters
43. Text Mining:Clustering (K-Means Example) Cluster 1: T1, T3 T1 - The Neatest Guide to Stock Market Investing T3 - The Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns Cluster 2: T6, T7, T9 T6 - RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not! T7 - Investing in RealEstate, 5th Edition T9 - RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss Cluster 3: T2, T4, T5, T8 T2 - Investing For Dummies, 4th Edition T4 – The Book of ValueInvesting T5 - ValueInvesting: From Graham to Buffett and Beyond T8 - StockInvesting For Dummies"
51. Text Mining:Clustering Process Many people imagine that it will produce neatly separated clusters like those that (appear in relatively simple examples), but it almost never does. Such ideal clusters are rarely encountered in real data, so we often need to modify our objective from “find the natural clusters in the data” to “organize the cases into groups that are similar in some way.” Cook and Swayne, Interactive and Dynamic Graphics for Data Analysis
52. Text Mining:Real World Clustering Example “Text Mining Warranty and Call Center Data: Early Warning for Product Quality Awareness” (Wallace & Cermack, SUGI29, 2004) Goal: Develop a system that would enable an early warning, alerting system for product quality problems (for American Honda Motors) Problem – most of the information is in text documents Warranty: when dealers complete warranty service claims, a comment field is available to further describe the problem. Customer Relations: the call center logs parts of conversations and written communications with customers. Techline: calls from dealer service technicians to specialized mechanics create more text data.
54. Text Mining:Real World Clustering Example Changes in cluster size Appearance of new words Changes in Shape Alerts
55. Text Mining:Real World Clustering Example Integrated warranty business rules. Emerging issues. Drill-to from emerging issues. Drill on multiple points. Analyze by alert. Ad hoc analysis. Advanced warranty analysis. SAS Warranty Analysis 4.2
59. Text Mining:Information Extraction (Goals) Type of IR Goal is to automatically extract structured information (e.g. entities, concepts and topics) from unstructured text from contextually and semantically well-defined data usually from well-defined domain (sometimes called content analysis) Named-Entity Recognition Subtask of IE, also known as entity identification and entity extraction Seeks to locate and classify atomic elements in text into predefined categories (e.g. names of persons, organizations, locations, dates, quantities, monetary values, percentages and so on) The end goal is usually to fill in templates codifying the extracted information (e.g. entity relationship structures <entity><rel><entity>)
63. Information Extraction:Process (Part-of-Speech Tagging) Part-of-speech tagging is the process of converting a sentence, in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb, and so on. Variety of tagging strategies, most of which are “trainable.”
64. Information Extraction:Process (Part-of-Speech Tagging) The pilot had to bank the plane because it was headed right for the downtown branch bank which was located next to the river bank. Taggers (examples) Training for N-Gram Taggers (sequences of N words): Trigram, Bigram, Unigram Employs training and test sets like other classification systems Utilizes various classification algorithms for training then actual classification
65. Information Extraction:Process (Part-of-Speech Tagging) Sample sentence: CVS Caremark Corporation agreed to buy the Medicare Part D unit of Universal American Financial Corporation for about $1.25 billion. Tagged sentence: [('CVS', 'NNP'), ('Caremark', 'NNP'), ('Corporation', 'NNP'), ('agreed', 'VBD'), ('to', 'TO'), ('buy', 'VB'), ('the', 'DT'), ('Medicare', 'NNP'), ('Part', 'NNP'), ('D', 'NNP'), ('unit', 'NN'), ('of', 'IN'), ('Universal', 'NNP'), ('American', 'NNP'), ('Financial', 'NNP'), ('Corporation', 'NNP'), ('for', 'IN'), ('about', 'IN'), ('$', '$'), ('1.25', 'CD'), ('billion', 'CD')]
66. Information Extraction:Process (Entity Recognition) Chunking Basic technique which segments and labels multi-token sequences Sequences are non-overlapping Usually employs a combination of a “templated” grammar couched as regular expressions along with tagger & classification processes to do the segmenting Simple Example – NP Chunker grammar = "NP:{<DT>?<JJ.*>*<NN.*>+}"
68. Information Extraction:Process (Entity Recognition) Named Entity Recognition – Identify all textual mentions of the named entities Hard to rely on precompiled lists of names, locations, … especially in dynamically changing domains A starting point is provided by the “named” entity chunkersfound in toolkits like NLTK
71. Text Mining & Analysis:Tools kdnuggets.com/software/text.html digitalresearchtools.pbworks.com/
72. Text Mining and Analysis:Lessons Learned There are practical applications in business, scientific and government arenas with substantial payback Text can be analyzed with many of the same analytical (data mining) techniques applied to structured data, although the text must first be transformed into structured data for this to occur. Many practical applications of text analysis and mining rest on treating documents as “bag of words” and on utilizing simpler versus more complex mining techniques. This techniques often have the same payoffs as more complex techniques