1. User-Generated Content
on Social Media
Challenges, Opportunities
Meena Nagarajan, KNO.E.SIS, Wright State
meena@knoesis.org, http://knoesis.wright.edu/researchers/meena/
1
Tuesday, October 27, 2009 1
2. The Shift.. in the rules of the game
• Online Media: Packaged Goods Media
to a Conversational Media
• Variety of networked
interactions, many in near
real-time
• Information economy: from dearth of
signals to plenty much!
2 http://gregverdino.typepad.com/greg_verdinos_blog/images/2007/07/09/web2_logos.jpg
Tuesday, October 27, 2009 2
3. Social Media Investigations
• Network: Social structure
emerges from the aggregate of
relationships (ties)
• People: poster identities, the !"
!"
active effort of accomplishing
interaction
• Content : studying the content of
communication."Who says what, to
whom, why, to what extent and with
what effect?" [Laswell]
3
Tuesday, October 27, 2009 3
4. Effects of Networked Publics
• Certain social phenomenon admittedly
more complex
• begs for a people-content-network confluence
• Micro-level variations of Content-
people-network on macro-level features
• “How do the topic of discussion, emotional charge
of a conversation, poster characteristics & network
connections affect ....?”
4
Tuesday, October 27, 2009 4
5. People-Content-Network - Possibilities
• Emerging social order in online
conversations
• How are the people-content-network
dynamics shaping online
conversations?
• Can we understand the Influentials
theory, information diffusion properties
in networks (etc.) while taking people
and content into account?
5
Tuesday, October 27, 2009 5
6. Stand on the shoulder of micro-giants
• The point is that we need a strong grasp
on the micro-level variables of the
content, people and network
dimensions to begin explaining what
they are doing to any social
phenomenon...
• My focus is on the micro-level variables
in the content dimension.
6
Tuesday, October 27, 2009 6
8. Dimensions Of Analysis
• Named Entity Identification
WHAT and Disambiguation
• Cultural Named Entities
• Music artist, track named
• What are the Named Entities
entities (IBM) [ISWC09a,
and topics that people are
VLDB09], Movie named
making references to?
entities (MSR) [WWW2010]
• How are they interpreting any
• Summaries of user
situation in local contexts and
perceptions behind real-time
supporting them in their
events from Twitter
variable observations?
8
Tuesday, October 27, 2009 8
9. Dimensions Of Analysis
WHAT
• What are the Named Entities
www.evri.com and topics that people are
http://memetracker.org/ making references to?
• How are they interpreting any
situation in local contexts and
supporting them in their
variable observations?
9
Tuesday, October 27, 2009 9
10. Dimensions Of Analysis
WHY
• What are the diverse intentions
that produce the diverse content
on social media?
• Why we share by looking at what
we predominantly do with the
medium. Value derived,
repurposing..
• Emotion, sentiment expressions..
10
Tuesday, October 27, 2009 10
11. Dimensions Of Analysis
• Mapping User Intentions WHY
• Information Seeking, Sharing,
Transactional intents [WI09]
• What is the intention landscape of
social media
• where is the monetization potential
11
Tuesday, October 27, 2009 11
12. Dimensions Of Analysis
• Self-presentation in Online
HOW Dating Profiles
(with Prof. Marti Hearst, UC
Berkeley) [ICWSM09]
• What do word usages tell us
about an active population,
or about the medium?
• Dynamics of a conversation -
snubs, flaming words,
coordination.. or lack
thereof!
12
Tuesday, October 27, 2009 12
13. Dimensions Of Analysis
HOW
• What do word usages tell us
about an active population?
• Self-presentation
• Dynamics of a conversation -
snubs, flaming words,
coordination.. or lack thereof!
http://wordwatchers.wordpress.com/2008/10/06/language-in-speeches-and-interviews-summary-comparisons/
13
Tuesday, October 27, 2009 13
14. The Social Media Content
Landscape..
14
Tuesday, October 27, 2009 14
15. +,'((*-./&0)*
!"#$%$#&'()* 9"$:/.,*7'.83*-./&0)*
;353./83"3/&)**
1'.$'2(3*4/"5.$2&6/"*7'.83*-./&0)* 1'.$'2(3*4/"5.$2&6/"**
7'.83*-./&0)*
Population, Medium Diversity
Some mediation
Rate of exchange (asynchronous, synchronous)
Many-to-many reach
Shared Contexts
Slangs, abbreviations, grammar, spelling, media-specific vocabulary
Interpersonal interactions
15
!"#$%%&&&'()*+,'*-.%#!-/-0%123,45--/%676879:8;8%0)<40%-%==
Tuesday, October 27, 2009 15
16. Variety & Formality
Formality Score = (noun frequency + adjective freq. + preposition freq. + article freq. – pronoun
freq. – verb freq. – adverb freq. – interjection freq. + 100)/2 *
TYPE OF DATA FORMALITY
Nat Broadcast Reportage 62.2
Informational writing 61
Academic Social Science 60.6
Writing 58
Professional Letters 57.5
Non Acad Social Science 56.9
Broadcasts 55
Blog corpus 53.3
Scripted Speech 53 TYPE OF DATA FORMALITY
Email Corpus 50.8 Prepared speeches 50
Personal Letters 49.7
Imaginative writing 47
Fiction Prose 46.3
Interviews 46
Unscripted Speeches 44.4
Spontaneous speech 44
Conversations 38
Phone Conversations 36
* Heylighen, F. & Dewaele, J. Variation in the contextuality of language: An empirical measure Foundations of Science, 2002, 293-340
16
Weblogs, Genres and Individual Differences: How bloggers write for who they write for; Scott Nowson
Tuesday, October 27, 2009 16
17. Variety & Formality
Formality Score = (noun frequency + adjective freq. + preposition freq. + article freq. – pronoun
freq. – verb freq. – adverb freq. – interjection freq. + 100)/2 *
TYPE OF DATA FORMALITY
Nat Broadcast Reportage 62.2 TYPE OF DATA FORMALITY
Informational writing 61 Broadcasts 55
Academic Social Science 60.6 Blog corpus 53.3
Writing 58 Scripted Speech 53
Professional Letters 57.5 Email Corpus 50.8
Non Acad Social Science 56.9 Critic Music reviews from
50.13
www.metacritic.com/music/
Broadcasts 55
Yahoo Personals AboutMe 50.10
Blog corpus 53.3
MySpace About Me 50.07
Scripted Speech 53 TYPE OF DATA FORMALITY
MySpace - comments on Artist Pages 50.06
Email Corpus 50.8 Prepared speeches 50
Prepared speeches 50
Personal Letters 49.7
Personal Letters 49.7
Imaginative writing 47
Twitter 49.46
Fiction Prose 46.3
Facebook posts 48.20
Interviews 46
Imaginative writing 47
Unscripted Speeches 44.4
Fiction Prose 46.3
Spontaneous speech 44
Conversations 38
Phone Conversations 36
* Heylighen, F. & Dewaele, J. Variation in the contextuality of language: An empirical measure Foundations of Science, 2002, 293-340
16
Weblogs, Genres and Individual Differences: How bloggers write for who they write for; Scott Nowson
Tuesday, October 27, 2009 16
18. Making up for lack of context..
• Supplement what the data is showing
you with what you already know..
• Statistical NLP + Contextual
Knowledge
• Ontologies, Taxonomies, Dictionaries,
social medium, shared spatio-
temporal contexts..
17
Tuesday, October 27, 2009 17
20. Cultural NER
WHAT
19
Tuesday, October 27, 2009 19
21. Cultural NER
WHAT
It was THE HANGOVER of the
year..lasted forever.. so I went
to the movies..bad choice
picking “GI Jane” worse now
19
Tuesday, October 27, 2009 19
22. Cultural NER
WHAT
It was THE HANGOVER of the
year..lasted forever.. so I went
to the movies..bad choice
picking “GI Jane” worse now
LOVED UR MUSIC YESTERDAY!
19
Tuesday, October 27, 2009 19
23. Cultural NER
WHAT
I decided to check out the Wanted demo today
even though I really did not like the movie
It was THE HANGOVER of the minus Mrs Jolie a.k.a Fox of course!
year..lasted forever.. so I went
to the movies..bad choice
picking “GI Jane” worse now
LOVED UR MUSIC YESTERDAY!
19
Tuesday, October 27, 2009 19
24. Cultural NER
WHAT
I decided to check out the Wanted demo today
even though I really did not like the movie
It was THE HANGOVER of the minus Mrs Jolie a.k.a Fox of course!
year..lasted forever.. so I went
to the movies..bad choice
picking “GI Jane” worse now
LOVED UR MUSIC YESTERDAY!
Obama the Dark Knight of
socialism.. the man is not as
impressive as Ledger yea
19
Tuesday, October 27, 2009 19
25. Intuitions..
• Spotting and Sense Identification
It was THE HANGOVER of the
year..lasted forever.. so I went to the
• Open vs. Closed world movies..bad choice picking “GI
Jane” worse now
• unlike person, location, named entities, contexts and
senses change fairly rapidly
• We assume an open-world wrt senses
• No comprehensive sense knowledge base
• Reduce it to a spotting and binary sense classification
problem
20
Tuesday, October 27, 2009 20
26. Two flavors..
• Artist and tracks spotting in MySpace
music forums
• using the MusicBrainz Taxonomy
• with Daniel Gruhl, Jan Pieper, Christine Robson, IBM
Almaden, Amit Sheth, Knoesis [ISWC09a]
• on Thursday Oct 29, Session: Discovering Semantics
• Movie names from Weblogs
• with Amir Padovitz, Social Streams MSR, [WWW2010]
21
Tuesday, October 27, 2009 21
27. Cultural NER in Weblogs
• Goal: Supplement classifiers with
information that will help them
disambiguate the reference of a term better!
• A Complexity of Extraction measure
associated with an entity in target sense in
a corpus
• with all cues equal, systems that are ‘complexity
aware’ will treat cues differently
22
Tuesday, October 27, 2009 22
28. Measure of Extraction Complexity
• Feature extraction: Graph-based
spreading activation and Extracted
Complexity
clustering (general weblogs)
Time Travellerʼs Wife
Angels and Demons
• entity sense definition from ..
Wikipedia + evidence a corpus The Hangover
..
presents for the target sense of the Star Trek
..
entity Wanted
Up
Twilight
• Ranked list speaks for itself ...
• More varied senses and contexts,
implies higher extraction complexity
23
Tuesday, October 27, 2009 23
29. Feature as a Prior
Decision Tree and Boosting Classifiers
X axis: precision
1500+ hand-labeled data points Y axis: recall
Blue: basic features
Red: with Entropy baseline
Green: with our Complexity of Extraction feature
24
Tuesday, October 27, 2009 24
30. As a Prior in Binary Classification
Average F-measure over 1000 decision tree, boosting models
Average Accuracy over 1000 decision tree, boosting models
1500+ hand-labeled data points
Blue: basic features
Red: with Entropy baseline
Green: with our Complexity of
25 Extraction feature
Tuesday, October 27, 2009 25
31. To chew on..
• The concept of ‘Extraction Complexity’
as an additional prior is very promising
• applies to general NER
26
Tuesday, October 27, 2009 26
32. User Intention Mapping
WHY
• Unlike Web search intent, entity alone is
in-sufficient to characterize intent here..
• Three broad intentions: information
seeking, sharing, transactional,
combinations thereof.
• ‘i am thinking of getting X’ (transactional)
‘i like my new X’ (information sharing)
‘what do you think about X’ (information seeking)
27
Tuesday, October 27, 2009 27
33. Action Patterns
• Resorted to ‘action patterns’ surrounding named
entities
• “where can i find a psp cam..”
• A minimally supervised bootstrapping
algorithm
• 10 seed action patterns, learn new ones from unannotated corpus,
relying on a empirical and semantic similarity with seed patterns
• semantic similarity from communicative functions of words
Linguistic Inquiry Word Count (www.LIWC.net)
28
Tuesday, October 27, 2009 28
34. Information Seeking, Transactional
• Patterns learned using 8000 uncategorized posts
on MySpace forums
Sample learned patterns
does anyone know how
know where i can
was wondering if someone
Im not sure how
someone tell me how
• Intent recognition recall using pre-classified user
posts from Facebook Marketplace (to buy): 81%
29
Tuesday, October 27, 2009 29
35. Impact on Online Advertising?
• Generate ads from user profile
(interests, hobbies) or from posts with
monetizable intents?
30
Tuesday, October 27, 2009 30
36. Targeted Content Delivery Platform
• Of all the ads generated using profile (hobbies,
interests) information, 7% received attention
• Ads generated using authored, monetizable
posts, 59% received attention What
More at [WI09], Beyond Search and Internet Economics Workshop, Why
MSR, Redmond, WA
http://research.microsoft.com/en-us/um/redmond/about/collaboration/awards/beyondsearchawards.aspx
31
Tuesday, October 27, 2009 31
37. Self-Presentation
HOW
• On Online-dating profiles [ICWSM09] (with Prof.
Marti Hearst, UCB)
• quantifying usages of words from linguistic,
personal and psychological categories in LIWC
• Exploratory Factor Analysis to identify
systematic co-occurrence patterns among LIWC
variables
• grouping user profiles on the basis of their
shared multi-dimensional features to compare
and contrast self-presentation
32
Tuesday, October 27, 2009 32
38. Imitate to Impress !?
• More similarities than differences
• Men displaying a higher usage of tentative
words (maybe, perhaps..)
• typically attributed to feminine discourse
• Many similarities in word combinations and
words used!
• Perhaps, self-expression tends towards
attempting homophily in online dating..
33
Tuesday, October 27, 2009 33
40. BBC SoundIndex, IBM
“A pioneering project to tap into the online buzz surrounding
What
artists and songs, by leveraging several popular online sources” Why
De-spam, slang transliterations, entity identification, voting theory When,
to combine multi-modal online data sources [ICSC08a,VLDB09] Where,
http://www.almaden.ibm.com/cs/projects/iis/sound/ Who
35
Tuesday, October 27, 2009 35
41. Twitris: Kno.e.sis
Real-time user perceptions as the fulcrum for
browsing the Web [ISWC09b] What
When,
Where
36
Tuesday, October 27, 2009 36
42. Iran elections: Discussions in the US and Iran on the same day
The mystery of Soylent Green: information where you can use it
37
Tuesday, October 27, 2009 37
43. every day; legalize is no more the
th lk tio n o
cr o ro
op eo s ab n f
illegal immigrants news for Obama!
ta lec io
m es cy t
de pr a u
E pin
in the healthcare captured October
on si ,
n
ra ,
st on
O
tio
context on 12.
September 18.
Find resources related to
social perceptions
Twitris: Twitter through The fourth estate
perspective
space,time,theme
News and
Wikipedia articles
to put extracted
descriptors in
context Integrate user observations with
news on a particular day;
Correlate citizen journalism with
the fourth estate; On September
18, Obama was talking about
Illegal immigrants in the context of
health care;
semantics for thematic aggregation Little statistics from Tiwtris (unit: tweets)
a tweet
ck and checkl new events on twitris #twitris" Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)
of a tweet; # (hashtags) user generated meta; @- refer to
Obama (Oct 8 - 20): 312 K (US Only)
es (Twitter, news services, Wikipedia, and other Web
Come see & play with Twitris @ the
H1N1 (Oct 5 - 20) : 232 K (US Only)
Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)
Concept Cloud,
News and related
International Semantic Web Challenge
`
twi
tris
inte
articles 140 rnals
at ISWC ’09
cha in le
Twitris rac s
Parallel crawling to scale
ters s tha
Context Data processing pipeline to streamline n
+ Selected Twitter, geocode services, data analytics,
Term to handle heterogeneity
Live resource aggregation
Near real time: Processing upto a day
ogle DBpedia before
widget widget Spatio-temporally weighted text analytics
Data Processing
TFIDF Spatio, Temporal, Extracting
Twitris DB based Thematic storylines Cavetas and Future work
descriptor descriptor around
extraction extraction descriptors 1. Handle Twitter constructs such as hashtags,
retweets, mentions and replies better
2. Different viz widgets such as time series to
http://twitris.knoesis.org
show changing perceptions from a place for an
Data Collection event and demographic based visualizations.
S S
h
Geocode Lookup
. h
Data Dumper
.
3. Sentiment analysis
a .
.
a .
. 4. Robust computing approaches (Cloud, Hadoop)
r . r .
e
Geocode Lookup
.
e Data Dumper
.
5. FB Connect for sharing and personalization
d . d .
. .
. .
M Geocode Lookup M Data Dumper
e e
m m
o o
r r
y y
knoesis.org
A tetris like approach to twitter to gather
Twtitris with everyone
aggregated social signals is defined as
38
Tuesday, October 27, 2009 38
45. References
http://knoesis.wright.edu/researchers/meena/pubs.php
[WISE09] Meenakshi Nagarajan, Karthik Gomadam, Amit Sheth, Ajith Ranabahu, Raghava Mutharaju and Ashutosh
Jadhav, Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data - Challenges and Experiences, Tenth
International Conference on Web Information Systems Engineering, Oct 5-7, 2009.
[ISWC09a] Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Context and Domain
Knowledge Enhanced Entity Spotting in Informal Text, The 8th International Semantic Web Conference, 2009.
[ISWC09b] Twitris, Submission to the International Semantic Web Challenge, collocated with the International
Semantic Web Conference 2009.
[VLDB09] Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Multimodal Social
Intelligence in a Realtime Dashboard System, Pending Review, VLDB Journal, Special Issue on "Data Management
and Mining on Social Networks and Social Media", 2009.
[WWW2010] Meenakshi Nagarajan, Amir Padovitz, A Measure of Extraction Complexity: a Novel Prior for Improving
Recognition of Cultural Entities, Manuscript in Preparation, for The Nineteenth International World Wide Web
Conference, 2010.
[ICSC08a] Alfredo Alba, Varun Bhagwan, Julia Grace, Daniel Gruhl, Kevin Haas, Meenakshi Nagarajan, Jan Pieper,
Christine Robson, Nachiketa Sahoo. Applications of Voting Theory to Information Mashups, Second IEEE
International Conference on Semantic Computing, ICSC 2008.
[ICSC08b] Meenakshi Nagarajan, Cartic Ramakrishnan, Amit Sheth, “Text Analytics for Semantic Computing - the
good, the bad and the ugly”, Second IEEE International Conference on Semantic Computing Santa Clara, CA, USA,
2008.
[WI09] Meenakshi Nagarajan, Kamal Baid, Amit P. Sheth, and Shaojun Wang, Monetizing User Activity on Social
Networks - Challenges and Experiences, 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Sep
15-18 2009.
[ICWSM09] Meenakshi Nagarajan, Marti A. Hearst. An Examination of Language Use in Online Dating Personals, 3rd
Int'l AAAI Conference on Weblogs and Social Media, ICWSM 2009
[IC09] Amit Sheth, Meenakshi Nagarajan. Semantics-Empowered Social Computing IEEE Internet Computing 13(1),
2009.
40
Tuesday, October 27, 2009 40