2. Content
• Introduction to text mining in relation with social media
• Unique features of texts in social media
• Applying Text analytics in social media
• Example of text analytics in social media
3. Text Mining and
Social Media
The picture here shows the 10 top sites
that generates a lot of traffic. And
majority are under the social media
umbrella.
Social media can then be said to be a
medium whereby information and
communication can be accessed, shared
and discussed
4. Text Mining and
Social Media
Category Representatives sites
Wiki Wikipedia, Scholarpedia
Blogging Blogger, LiveJournal, Wordpress
Social news Digg, Briefing, Mixx, Slashdot
Micro Blogging Twitter, Google Buzz
Opinion & Reviews ePinions, Yelp
Question Answering Stack Overflow, Yahoo! Answers,
Quora
Media Sharing Flickr, Youtube
Social Bookmarking Delicious, CiteULike
Social Networking Facebook, LinkedIn, MySpace
The table shows the various
categories where we could classify
social media.
It contains various types of
services thereby resulting into
various kinds of data format.
The information in most social
media site are in text format.
5. Text Mining and Social Media
• With the current trend of Data Mining techniques and Business intelligence
from data, this question arises relating to social media.
“How can I get valuable information from the texts in
social media platform?”
6. Unique features of texts in social media
• With different kind of social media, there would definitely be some distinct
characteristics of this text and how they occur.
• Text Analytics describes a set of linguistic, statistical, and machine learning
techniques that model and structure the information content of textual
sources for business intelligence, exploratory data analysis, research, or
investigation
• This section gives us a hint on how to answer our previous question.
7.
8. Unique features of texts in social media
• Text preprocessing is making the input more consistent to facilitate text
representation. text preprocessing methods include stop word removal and
stemming.
• Features Generation/ Text Representation. The most common ways is to
transform them into numeric vectors. Its representation is called BOW or
VSM.
• Knowledge Discovery: Where we apply machine learning or data mining
methods to discover pattern or insight.
9. Unique features of texts in social media
• Time Sensitivity.
An important and common feature of many social media services is their real-
time nature. Bloggers may update their post every x nos of days but most
networking sites gets updates regularly like in minutes.
The text in social media is not an independent and identically distributed data
anymore due to the sensitivity and timeliness of the textual data.
10. Unique features of texts in social media
• Short Length
As short messages enhances the participation of users on social media sites, it
poses a great challenge in mining with clustering or classification as a large
number of text provide sufficient context information for effective similarity
measure which is a basis for many text processing methods.
Example. Twitter is limited to 140 characters, Windows Live messenger is
limited to 512 characters but Facebook has 63,026 characters.
11. Unique features of texts in social media
• Unstructured Phrases
The main challenge posed by content in social media sites is the fact that the
distribution of quality has high variance: from very high-quality items to low-
quality. This can be attributed to the people’s attitudes when posting a
microblogging message or answering a question in a forum.
The difficulty here is how to accurately identify the semantic meaning from
more than 1 word that’s been abbreviated.
12. Applying Text analytics in social media
• Event detection
• Event Detection aims to monitor a data source and detect the occurrence of an event
that is captured within that source
• Collaborative Question Answering:
• Analyzing the differences between conversational questions and informational
questions
13. Illustrative Example.
• This example illustrates how to utilize text analytics to solve problems identified in
its application to social media.
• We want to improve the short text representation quality by integrating semantic
knowledge resources found to be useful in dealing with the semantic gap.
• This has 3 steps:
• Seed Phase Extraction
• Semantic features Generation
• Feature Space Construction.
14. Seed Phase Extraction
• Problem Statement
• Given a sentence level feature T = {t1,t2,…tn}, the phrase levels ti contained in
T. The similarity between the ti and {t1,t2,…,tn} is given by:
InfoScore(ti) = 𝒋=𝟏,𝒋≠𝒊
𝒏
𝒔𝒆𝒎(𝒕𝒊, 𝒕𝒋)
t* = 𝒂𝒓𝒈 𝐦𝐚𝐱
𝒕𝒊 ∈{t1,t2,…tn}
𝑰𝒏𝒇𝒐𝑺𝒄𝒐𝒓𝒆(𝒕𝒊)
Where t* is denoted as the phrasal level feature
15. Semantic features Generation
• Now the seed phrases has been extracted in the first step.
• What this steps aim to achieve is to generate semantic features on the seed
phrases. What the seed phrase has help us to do is to obtain an informative
and effective basic representation of the input text
• We use Wikipedia as our target social media.
17. Feature Space Construction
• For the sake of data quality, effectiveness and valuable original information,
we conduct 2 more important basic steps in this process.
• Feature filtering to refine meaningless features
• Feature selection to avoid aggravating the “curse of dimensionality”
18. Feature Space Construction
• Feature Filtering
For the Wikipedia example, we formulate rules to refine the unstructured
features. Some rules could be
Remove features generated form too general seed phrases.
Transform features e.g List of hotels >>>hotels
Remove features related to chronology.
19. Feature Space Construction
• Feature Selection
• We need to select semantic features to construct feature space for various
tasks.
• The number of needed features is determined by specific tasks.
20. Feature Space Construction
• First we calculate the tf-idf weights of all generated features. term
frequency–inverse document frequency, is a numerical statistic that is
intended to reflect how important a word is to a document in a collection
or corpus.
• One seed phrase may generate k semantic features denoted by {fi1,fi2,…,fik}.
• The selection here is one seed phase, one feature
fi
* = arg max
𝑓𝑖𝑗
∈{𝑓𝑖1
,
𝑓𝑖2
,…,
𝑓𝑖𝑗}
𝑡𝑓_𝑖𝑑𝑓(𝑓𝑖𝑗)
21. Feature Space Construction
• Second the top n features are extracted from the remaining semantic features
based on their frequency.
• These frequently appearing features, together with the features from the first
step, are used to construct the m+n semantic features.
22. Finally
• With all the processes, and the feature space generated, we can then apply
text clustering or any other text analytics methods.
• In conclusion, though research is still intense on this subject, nevertheless
this short presentation has opened the way for us on how to apply text
analytics in social media resources.