NE7012- SOCIAL NETWORK ANALYSIS

NE7012 SOCIAL
NETWORK ANATYSIS
PREPARED BY: A.RATHNADEVI A.V.C COLLEGE OF
ENGINEERING
UNIT 5-TEXT AND OPINION MINING

UNIT V TEXT AND OPINION MINING
Text Mining in Social Networks -Opinion extraction – Sentiment classification and clustering -
Temporal sentiment analysis - Irony detection in opinion mining - Wish analysis – Product
review mining – Review Classification – Tracking sentiments towards topics over time
5.1 Text Mining in Social Networks
5.1.1 Text mining definition
 The objective of Text Mining is to exploit information contained in textual documents in
various ways, including discovery of patterns and trends in data, associations among
entities, predictive rules, etc
 The results can be important both for:
 the analysis of the collection, and
 providing intelligent navigation and browsing methods
5.1.2 Text mining pipeline
5.1.3 Motivation for Text Mining
 Approximately 90% of the world’s data is held in unstructured formats (source:
Oracle Corporation)
 Information intensive business processes demand that we transcend from simple
document retrieval to “knowledge” discovery.
 The justification for the interest in text mining is the same as for the interest in
knowledge retrieval (search and categorization).

 The shear amount of unstructured data (mostly textual) out there calls for more than just
document retrieval. Tools and techniques exist to mine this data and realize value in the
same way that data mining taps structured data for business intelligence and knowledge
discovery.
5.1.4 Text mining process
 Text preprocessing
- Syntactic/Semantic text analysis
 Features Generation
- Bag of words
 Features Selection

- Simple counting
- Statistics
 Text/Data Mining
- Classification- Supervised learning
- Clustering- Unsupervised learning
 Analyzing results
- Mapping/Visualization
- Result interpretation
5.1.5 Challenges in text mining
 Data collection is “free text”, is not well-organized (Semi-structured or unstructured)
 No uniform access over all sources, each source has separate storage and algebra,
examples: email, databases, applications, web
 A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information
 Learning techniques for processing text typically need annotated training
 XML as the common model, it allows:
o Manipulation data with standards
o Mining becomes more data mining
o RDF emerging as a complementary model
 The more structure you can explore the better you can do mining
5.1.6 Text mining actors

5.1.7 Text mining tasks
5.1.8 Applications of Text Mining
 Keyword Search
 Classification
 Clustering

 Linkage-based Cross Domain Learning
5.1.8.1 Keyword Search
 simple but user-friendly interface for information retrieval on the Web.
 Proves to be an effective method for accessing structured data.
 The challenges lie in three aspects:
o Query semantics
o Ranking strategy
o Query efficiency
Keyword Search Algorithms
 Query Semantics and Answer Ranking
 Keyword search over XML and relational data
 Keyword search over graph data
5.1.8.2 Classification Algorithms
 Content-based text classification
o Naive Bayes classifier, TFIDF classifier and Probabilistic Indexing classifier
 Challenges in the context of text classification:
o Social networks contain a much larger and non-standard vocabulary
o The labels in social networks may often be quite sparse
o use of content can greatly improve the effectiveness of the link-based
classification process
5.1.8.3 Clustering Algorithms
 Related to the traditional problem of graph partitioning
 The problem of graph partitioning is NP-hard and often does not scale very well to large
networks.
 Methods:

o The Kerninghan-Lin algorithm
o link-based clustering
o clustering graph streams
 uses only the structure of the network for the clustering process.
 Improve the quality of clustering by using the text content in the nodes of the social
network.
 use a number of variants of traditional clustering algorithms for multi-dimensional data.
 Most of these methods are variants of the k-means method
o start off with a set of k seeds and build the clusters iteratively around these seeds.
o The seeds and cluster membership are iteratively defined with respect to each
other, until we converge to an effective solution.
 Perform the clustering with the use of both content and structure information.
 constructs a new graph which takes into account both the structure and attribute
information.
 Such a graph has two kinds of edges:
 structure edges from the original graph, and
 attribute edges, which are based on the nature of the attributes in the different nodes.
 A random walk approach is used over this graph in order to define the underlying
clusters.
 Each edge is associated with a weight, which is used in order to control the probability of
the random walk across the different nodes.
 These weights are updated during an iterative process, and the clusters and the weights
are successively used in order to refine each other.
 weights and the clusters will naturally converge, as the clustering process progresses
5.2 Sentiment analysis
5.2.1 Introduction

 Sentiment analysis (opinion mining): Computational and automatic study of people’s
opinions expressed in written language or text.
 Two types of information are in text data:
 Objective information: facts.
 Subjective information: opinions.
 The focus of sentiment analysis:
 subjective part of text à identify opinionated information rather than mining and retrieval
of factual information.
 Sentiment analysis brings together various fields of research: text mining, Natural
Language Processing, Data mining.
5.2.2 APPLICATIONS
 Review summarizations.
- Review-oriented search engines.
- Search for people’s opinions: How do people think about iPhone 5s?
 Recommendation systems.
- If you can do sentiment analysis, then the recommendation system can recommend
items with positive feedback and not recommend items with negative feedback.
 Information extraction systems.
- These systems focus on objective parts to extract factual information.
- They can discard subjective sentences.
 Question-answering systems.
- Different types of questions: definitional and opinion oriented questions.
- Both individuals and organizations can take advantage of sentiment analysis.
5.2.3 Levels Of Sentiment Analysis
 Document level
- Identify the opinion orientation of the whole document.
 Sentence level
- Identify whether the sentence is subjective or objective.
- Identify the opinion orientation of subjective sentences.
 Aspect level
- Identify the aspects that the users are commenting on.
- Identify the opinion orientation about each aspect.
5.2.4 System process

5.2.5 ASPECT IDENTIFICATION
 Using clustering to find similar sentences.
 It is likely that similar sentences are about similar aspects.
 For sentence clustering the method that we use for representing each sentence is
important.
 The major reason that regular clustering algorithms did not work (Gamon et al [2005]) is
the lack of proper method to represent each sentence.
 Sentences representation
 BOW representation: considers all terms in the sentence.
 BON representation: considers only nouns of the sentence.
5.2.6 Sentiment Identification
 Machine learning approach sees the sentiment identification problem as a classification
problem. Make use of manually labeled training data.
 Two major tasks in designing a classifier
 Feature extraction: come up with a set of features that represents your problem properly.
 Classifier selection: choose a classifier among KNN, Naïve Bayes, SVM, Maximum
Entropy.
 Our approaches are related to feature extraction steps.
 Support Vector Machines are widely used in text classification. We use SVM as well.

5.2.7 Sentiment classification
 Classify sentences/documents (e.g. reviews)/features based on the overall sentiments
expressed by authors
o positive, negative and (possibly) neutral
 Similar to topic-based text classification
o Topic-based classification: topic words are important
o Sentiment classification: sentiment words are more important (e.g: great,
excellent, horrible, bad, worst)
 In summary, approaches used in sentiment classification
o Unsupervised – eg: NLP pattern @ NLP patterns with lexicon
o Supervised – eg: SVM, Naive Bayes..etc (with varying features like POS tags,
word phrases)
o Semi Supervised – eg: lexicon+classifier
1) Supervised Learning
 Supervised learning (or called classification) is one of the major tasks in the research
areas such as machine learning, artificial intelligence, data mining, and so forth.
 A supervised learning algorithm commonly first trains a classifier (or inferred function)
by analyzing the given training data and then classify (or give class label to) those test
data.
 One typical example for supervised learning in web mining is that if we are given many
already known web pages with labels (i.e., topics in Yahoo!), how to automatically set
labels to the new web pages.
 In this section, we briefly introduce some most commonly used techniques for supervised
learning. More kinds of strategies and algorithms can be found.
 Nearest Neighbor Classifiers
 Decision Tree
 Bayesian Classifiers
 Neural Networks Classifier
.
2) Unsupervised Learning
 In this section, we will introduce major techniques of unsupervised learning (or
clustering).
 Among a large amount of approaches that have been proposed, there are three
representative unsupervised learning strategies, i.e., k-means, hierarchical clustering and
density based clustering.

3) Semi-supervised Learning
 In the previous two sections, we have introduced the learning issues on the labeled data
(supervised learning or classification), and the unlabeled data (unsupervised learning or
clustering).
 In this chapter, we will present the basic learning techniques when both of the two kind
of data are given.
 The intuition is that large amount of unlabeled data is easier to obtain (e.g., pages crawled
by Google) yet only a small part of them could be labeled due to resource limitation.
 The research is so-called semi-supervised learning (or semi-supervised classi f ication),
which aims to address the problem by using large amount of unlabeled data, together
with the labeled data, to build better classifiers.
 There are many approaches proposed for semi-supervised classification, in which the
representatives are self-training, co-training, generative models, graph-based methods.
5.3 Temporal sentiment analysis
5.3.1 Overview
 The method produces topic graph and sentiment graph by using sentiment phrases which
are patterns of sentiment expression such as “happy” or “delighted at”.
 We extracted 383 sentiment phrases from Japanese news articles manually, and classified
them into eight categories: anxiety, sorrow, anger, happiness, suffering, fatigue,
complaint, and shock.

5.3.2 Procedure for Making a Topic Graph
Following is the procedure for making a topic graph. Given: one of sentiment category S which
is specified by a user period of time: D=(d1, d2, …, dl)
Step 1: For each day di in D, retrieve articles containing sentiment phrases of sentiment s.
Step 2: Extract keywords from retrieved articles by using a keyword extraction system called
GENSEN-Web3 that can extract compound nouns as a keyword.
Step 3: For each extracted keywords wj(j=1,2,…,N), calculate an average correlation c between
wj and sentiment phrases contained in S. We use the Dice coefficient for calculating correlation.
Step 4: Extract top n keywords according to the score defined by the products of (1) number of
days in which keywords appears, (2) inverse frequency of number of days, and (3) scores
provided by GENSEN-Web. Step 4’(optional): Put keywords into clusters based on correlation
coefficient over timeline and the Dice coefficient in an article.
Step 5: Generate a temporal graph for each n keywords (or clusters). For viewability of the
graph, we apply moving average.
5.3.4 Procedure for Making a Sentiment Graph
Following is the procedure for making a sentiment graph. Given: a keyword w which is specified
by a user period of time: D=(d1, d2, …, dl)
Step 1: Retrieve articles containing keyword w for each day di(i=1,2,…,l).
Step 2: For each articles, calculate the sum of frequency of sentiment phrases for all sentiment
categories.
Step 3: Generate a temporal graph of frequency of sentiment phrases for each sentiment
category. Then, moving average is applied to the graph.

5.4 Irony detection in opinion mining
 In video/spoken discourse, especially in a conversational context, we are usually able to
detect a variety of external clues (e.g. facial expression, intonation, pause duration) that
enable the perception of irony. In written text, a set of more or less explicit linguistic
strategies is also used to express irony. In the next subsections, we describe eight
linguistic patterns that we have previously identified to be related to the expression of

irony (Table 1). Some are specific to Portuguese (e.g. morphological patterns) while
others seem to be language independent (e.g. emoticons).
1. P𝑑𝑖𝑚: Diminutive Forms
Diminutives are commonly used in Portuguese, often with the purpose of expressing
positive sentiments, like affect, tenderness and intimacy. However, they can also be
sarcastically and ironically used for expressing an insult or depreciation towards the
entity they represent. This is especially so when diminutives are found in NE mentioning
well-known personalities, such as political entities (e.g. “Socratezinho” for the current
Portuguese prime-minister, José Sócrates).
2. P𝑑𝑒𝑚: Demonstrative Determiners
In Portuguese, the occurrence of any demonstrative form – namely, “este” (this), “esse”
and “aquele” (that) – before an human NE usually indicates that such entity is being
negatively or pejoratively mentioned. In some cases, demonstratives (DEM ) are the
unique explicit clue that signals the presence of irony (e.g. “Este Sócrates é muito
amigo do Sr. Jack” / “This Sócrates is a very good friend of Mr. Jack”).
3. P𝑖𝑡𝑗 : Interjections
Interjections abound in subjective texts, particularly in UGC, carrying on valuable
information concerning authors’ emotions, feelings and attitudes. We believe that some
interjections can be used as potential clues for irony detection, when they appear in
specific contexts, such as the ones represented in the Pattern P𝑖 . Since we are especially
interested in recognizing irony in prior positive text, we confined our analysis to a small
set of interjections that are commonly used to express positive sentiments, namely:
“bravo”, “for¸ca”, “muito obrigado/a”, “obrigado/a”, “obrigadinho/a”, “parabéns”,
“muitos parabéns” and “viva”.
4. P𝑣𝑒𝑟𝑏: Verb Morphology
The type of pronoun used for addressing people can also be an important clue for irony
detection in UGC, especially in languages like Portuguese, where the choice of a specific

pronoun or way of expression (e.g. “tu” vs. “vocˆe”, both translatable by “you”) may
depend on the degree of proximity/familiarity between the speaker and the NE it refers
to. The pronoun “tu” is used in a familiar context (e.g. with friends and family). In our
experiments, we analyze to what extent the use of the pronoun “tu” for addressing a
wellknow named entity can be used as a clue for irony detection in UGC. As represented
in P𝑣𝑒𝑟𝑏, the pronoun can be either explicitly referred in the text or it can be embedded
in the morphology of the verb (which is in the second-person singular). We confined the
analysis to the verb “ser” (to be).
5. P𝑐𝑟𝑜𝑠𝑠: Cross-constructions
In Portuguese, evaluative adjectives with a prior positive or neutral polarity usually take a
negative or ironic interpretation whenever they appear in cross-constructions, where
adjectives relate to the noun they modify through the preposition “de” (e.g. “O comunista
do ministro” / “The communist of the minister”) [2]. Pattern P𝑐𝑟𝑜𝑠𝑠 recognizes cross-
constructions headed by a positive or neutral adjective (ADJ𝑝𝑜𝑠 or ADJ𝑛𝑒𝑢𝑡,
respectively), which modify a human NE. Adjectives are preceded by a demonstrative
(DEM ) or an article (ART) determiner.
6. P𝑝𝑢𝑛𝑐𝑡: Heavy Punctuation
In UGC, punctuation is frequently used both for verbalizing user immediate emotions and
feelings and for intentionally signaling humoristic or ironic text. We assume that the
presence in a sentence of a sequence composed of more than one exclamation point
and/or question mark can be used as a clue for irony detection.
7. P𝑞𝑢𝑜𝑡𝑒: Quotation Marks
Quotation marks are also frequently used to express and emphasize an ironic content,
especially if the content has a prior positive polarity (e.g. positive adjective qualifying an
entity). In our experiments, we tried to find possible ironic sentences by searching quoted
sequences composed of one or two words, corresponding, at least one of them, to a
positive adjective or noun.
8. P𝑙𝑎𝑢𝑔ℎ: Laughter Expressions
Internet slang contains a variety of widespread expressions and symbols that typically
represent a sensory expression, suggesting different attitudes or emotions. In our
experiments, we considered (i) the acronyms “lol” and corresponding variations (LOL),
(ii) onomatopoeic expressions such as “ah”, “eh” and “hi” (AH) and (iii) the prior
positive emoticons “:)”“;-)” and “:P” (EMO+). In this particular case, we did not
constraint the polarity of elements contained in the sentence. We assume that laugh
expressions are intrinsically positive or ironic
5.5 Product review mining
5.5.1 Motivation
 A rapid expansion of e-commerce, where more and more products are sold via online
portals (Amazon, eBay … )

 Online product reviews thus become an important resource:
o Customers to share and find opinions about products easily
o Producers to get certain degrees of feedback
5.5.2 Related works
 Single-document summarization
o Extractive-based approach
 Sentence score + ranking
 Machine learning technique
o Abstractive-based approach
 Template
 Concept hierarchy
 Multi-document summarization
o Extractive-based approach
 Sentence score + ranking + MMR + Ordering
o Abstractive-based approach
 Template
 Concept hierarchy
 Sentence fusion with paraphrasing rules
 Sentiment analysis
o Reviews polarity classification
o PROS/ CONS identification
o Mining review opinions
 Identify product facets
 Identify opinion orientation on the facet
5.5.3 Process

5.5.4 Product facets identification
o Association rule mining
 Each transaction consists of nouns/noun phrases from single sentence
 The frequent itemsets are the candidate product facets
o Redundancy pruning
 Removing redundant facets that contain only single words. (e.g. life ->
battery life)
o Compactness pruning
 Removing meaningless facets that contain multiple words

NE7012- SOCIAL NETWORK ANALYSIS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NE7012- SOCIAL NETWORK ANALYSIS

Similar to NE7012- SOCIAL NETWORK ANALYSIS (20)

Recently uploaded

Recently uploaded (20)

NE7012- SOCIAL NETWORK ANALYSIS