Twitris

Twitris
Browsing real-time data by space,
time and theme
http://twitris.knoesis.org

Motivation, Goals
Mumbai Terror Attack 2008
Citizen sensor observations (flickr, twitter,
blogs..)
No matter where you looked, tapping into a
cultural perception was impossible

We wanted to know what people in India
were saying vs. those in Pakistan or the
U.S.A

Spatio-Temporal-Thematic Slices of
Real-time Data

Around NEWS-WORTHY EVENTS
Using space and time as cues for extracting
social perceptions (behind signals)
Summarizing hundreds and thousands of
real-time observations

The Health Care Reform Debate
in the U.S

in the U.S
Temporal navigation

in the U.S
Temporal navigation Spatial Markers

Find resources related to
Find resources related to
social perceptions
social perceptions

Browsing Real-time Data in Context
News and
News and
Wikipedia articles
Wikipedia articles
toto put extracted
put extracted
SOYLENT GREEN and the HEALTH CARE REFORM descriptors in
descriptors in
context
context

News and
Wikipedia articles
to put extracted
descriptors in
context

✓Exploit spatio, temporal semantics for thematic aggregation
Exploit spatio, temporal semantics for thematic aggregation

Core of Twitris
n-gram summaries - Spatio-temporal-thematic
event descriptors

Architecture
Step1 : Gathering event-
relevant tweets

Because tweets are not
pre-categorized

Skip if I run out of time ..

Topical Tweets
Gathering event-specific tweets: Iran Election

Topical Tweets
1: Pick trending hashtags from Twitter -
#iranelection; #iran ..

Topical Tweets
1: Pick trending hashtags from Twitter -
#iranelection; #iran ..

2: Google insights to expand hashtag list

Topical Tweets

3. Issue a Twitter Search (API) every 30 seconds
for every hashtag, keyword
1500 tweets per query

Topical Tweets


4. Obtain other Hashtags in crawled tweets

Topical Tweets


Check for topic drifts

Topical Tweets


Check for topic drifts

5. Repeat from Step 3 and babysit!

Architecture
relevant tweets

Step2: Spatial, Temporal
ata Collection, analysis metadata of tweets
and visualizing in

ly Relevant Data
ning citizen observations from Twitte

Geo-Coordinates of Tweets
Location a tweet originates from
Location it mentions
Approximation: Poster location on Twitter
profile

Location: Dayton, OH (Google geocoder service, GeoDB)
Location: “best place in the world” (fail!)

Architecture
relevant tweets
metadata of tweets
ta Collection, analysis and visualizing in
Step3: Spatio-temporal
clusters

y Relevant Data

Spatio-Temporal Clusters of Tweets
Because every event is different.. and we want to preserve social perceptions
that generated this data!

Long-running, world-wide events (Iran Election Protest)
clusters by country and week?
Short, world-wide events (Olympics)
clusters by country and day?
Long-running, evolving, local events (Health Care
Reform Debate)
clusters by state and day?
Tunable parameters

Tweets in a Spatio-Temporal Cluster

Spatio-temporal bias dictate granularity of
processing tweets
Mumbai Terror Attack
Cluster1: Tweets from India, 08/1/08
Cluster2: Tweets from Pakistan, 08/1/08
Cluster n: Tweets from USA, 08/13/08

Architecture
relevant tweets
metadata of tweets
Step3: Spatio-temporal
ta Collection, analysis andclusters
visualizing in
Step4: Thematic Descriptors
in spatio-temporal cluster
y Relevant Data

Thematic Descriptors

An event descriptor is an n-gram
1,2 and 3 grams

n-gram descriptors
“President Obama in trying to regain control of the

health-care debate will likely shift his pitch in September”

1-grams: President, Obama, in, trying, to, regain, ...
2-grams: “President Obama”, “Obama in”, “in
trying”, “trying to”...
3-grams: “President Obama in”, “Obama in trying”;
“in trying to”...

“President” “President Obama” “President Obama in”

A descriptor is an n-gram weighted by:


Thematic Importance
redundancy: statistically discriminatory in nature
variability: contextually important


Thematic Importance

Spatial Importance (local vs. global popularity)


Thematic Importance

Spatial Importance (local vs. global popularity)
Temporal Importance (always popular vs. currently
trending)

Thematic Importance of an n-gram

Exploiting Redundancy
tfidf of n-gram (Lucene Index)
amplify by fraction of nouns in the n-gram
(Stanford Natural Language Parser)
amplify by fraction of non-stop words (‘going to
try’)

Exploiting Variability
Big three/Big 3; Ford, GM, Chrysler, General
Motors..
Contextually relevant words boost statistical
importance #)$
*&'+,-('$

Focus word (fw) : “big three” #(1('2-$
)/%/',$
!"#$%&'(($

Associated words (awi) : ./'0$

co-occurring in spatio-temporal set of tweets

#)$
*&'+,-('$
focus word (fw): Big Three
#(1('2-$ !"#$%&'(($
)/%/',$
associated word (awi): Ford
./'0$

Thematic importance of focus word:

tfidf of fw tfidf of awi

association strength of fw and awi

focus word in the given spatio-temporal corpus. The goal is to
o measure strength of associations is to useassociated words
of the focus word only with the strongly word co-occu
nguage [9]. Borrowing fromassociations is in thisword co-occure
to measure strength of past success to use area, we mea
rengthlanguage [9]. Borrowingwordpast success in this area, words a
between the focus from and the associated we meas
Contextual Relevance
strength between the focus word and the associated words as
he notion of point-wise mutual information in terms of co-o
the notion of point-wise mutual information in terms of co-oc
We measure assocstr scores as aas a function ofthe point-wisem
We measure assocstr scores function of the point-wise
etweenbetween the word Strengthcontextandawi .i . This is done
the focus focus word and the context of awi This is done
Association and the of fw of aw
ssociation strengths are determined in in the contexts thatthe d
association strengths are determined the contexts that the
Let us depends on contexts Cawi ={caw1 ,caw ..} where caw
et us call thecall the contexts foras iCawi ={caw1 ,caw22 ..},, wherecawk
contexts for awi aw as
strong descriptors collocate with awawiassoc str(f w,aw) )isis
rong descriptors that that collocate with . . assoc (f w,awi c
i str i
Contexts of associated P (pmi(f w,caw ))
word awi : ‘Ford’
assocstr (f w,awP (pmi(f w,caw k ,∀cawk ∈Cawi
i )=
k
k ))
|Cawi |
!"#$%&'(($ assocstr (f w,awi )= k ,∀cawk ∈Caw
|Cawi |
where the point-wise mutual information between f w and ca
here the i)*'+$is calculated as:
aw ),point-wise mutual information between f w and c
Pointwise Mutual Information
wi ), is calculated big
chrysler, GM, as: 3 p(f w,caw )
k p(cawk |f w)
pmi(f w,cawk )=log p(f w)p(caw )
=log p(cawk )
k

focus, model, release.. w,cawk )=log p(f w)p(caw ) ) is thep(cawk |f)
where p(f w)= pmi(f k |f w)=
n(f w)
;p(caw
p(f w,cawk
n(cawk ,f w)
w)
; n(f w) =log frequency
p(caw
N n(f w) k k

ig. 2: (a) Extracted descriptors sorted by TFIDF vs. spatio-tempo
b) Top 15 extracted descriptors in the US for Mumbai attack even
ocus word and all associations in Cf w . The thematic weights of
long with Temporal Importance of a1 to compu
their strengths are plugged into Eqn
Descriptor
hematic score ngrami (th), of the n-gram descriptor.
B. Temporal Importance of an event descriptor: While th
re good indicators of what will always dominate
Certain descriptors is important in a spatio-tempora
escriptors tend to dominate discussions. In order to allow
discussions
ossibly interesting descriptors to surface, we discount the th
“Terrorism” in Mumbai Terror Attack Tweets
escriptor depending on how popular it has been in the recent p
iscount score for a n-gram, a Care reform debatedepending on
“Healthcare” in Health tuneable factor
vent, is calculated over a period of time as:
Allow recent (possibly interesting) ones to
surface ngram (te)=temporal ∗
PD ngrami (th)d
i bias d=1 d

0-1 bias: less to more importance
here ngrami (th)d is the enhanced thematic score
to recent n-grams of the descri

ration for which we wish to apply the dampening factor, for exa
nt week. However, this temporal discount might not be relevant f
ons. For this reason, we also apply a temporalbias weight ranging fr
weight closer to 1 Importance of while a weight closer to 0
Spatial activity.
gives more importance, a Descriptor
portance to past

ial Importance of an event descriptor: We also discount the im
a descriptor based on its occurence in other spatio-temporal sets
is that Local descriptors are more interesting compared ar
descriptors that occur all over the world on a given day
sting compared to those that occur only in the spatio-temporal set
to global ones
We deﬁne the spatial discount score for an n-gram as a fraction of sp
Spatial discount
artitions (e.g. countries) that had activity surrounding this descri

k
ngrami (sp)= |spatio−temporalsets| ∗(1−spatialbias )

fraction of spatio-temporal closer to 0 = global
clusters n-gram occurred in importance

of importance to the global presence of the descripto
ng on the event of interest, both these discounting fa
rent spatio-temporal sets. For example, when processi
STT Score of an n-gram
Mumbai attack setting the spatialbias to 1 eliminate
ial signals. While processing tweets from the US, on
obal bias given that the event did not originate the
are setSpatio-temporal-thematic score of aof observations
before we begin the processing descriptor
he spatial thematic score - spatio-temporal discountsfrom
= and temporal effects are discounted
final spatio-temporal-thematic (STT) weight of the n

wi =ngrami (th)−ngrami (te)−ngrami (sp)

illustrates the effect of our enhanced STT weights
ptors pertaining to the Mumbai terror attack event,

higher-order n-
grams picked over
lower-order n-
grams (if same
scores)

Top X Descriptor Tag Cloud

Tag size proportional to enhanced STT score

Twitris

Recommended

Recommended

More Related Content

Similar to Twitris

Similar to Twitris (20)

Recently uploaded

Recently uploaded (20)

Twitris