Development of a system that automatically generates (kind of) storylines out of social media aggregated around hashtags, following links being shared.
Extracting Resources that Help Tell Events' Stories
1. Extracting Resources that Help
Tell Events' Stories
Raphaël Troncy
raphael.troncy@eurecom.fr
Mor Naaman
mor.naaman@cornell.edu
Carlo Andrea Conte
carloandreaconte@icloud.com
@mrgreenh!
2. InputsIntroduction
People share a huge amount of diverse content about real-world events
Links to these resources are often associated with #hashtags
3. OutputsIntroduction
Assuming that Real-world events can
be associated to particular #hashtags at particular points in time
Resources can be filtered and aggregated into storylines
4. Our ApproachIntroduction
Near Realtime!
~ 15-30 mins.
Extracting links from an event's tweets
Resolving shortened urls
Rank (and filter when necessary) links
Collect data by dereferencing those links
ExperimentalResults
7. Seen.coSystem Architecture
- Starting point for our system
- Rich database of #hashtag-defined events
- For every event, raw Twitter data is available
8. Seen.coSystem Architecture
Automatically organizes content shared on
Twitter and Instagram about an event.!
The resulting narrations highlight the most
important moments and let the user quickly see
what happened.
9. Seen.coSystem Architecture
Our system aims at gathering together a more
diverse set of media, by focusing on the
referenced resources rather than on the tweets
themselves.
10. RequirementsSystem Architecture
- Extracting links from an event's tweets
- Resolving shortened urls
- Rank links
- Collect metadata by dereferencing those links
Near Realtime!
~ 15-30 mins.
Severe Bottlenecks
11. RequirementsSystem Architecture
Process Parallelization
- Extracting links from an event's tweets
- Resolving shortened urls
- Rank links
- Collect metadata by dereferencing those links
Near Realtime!
~ 15-30 mins.
Severe Bottlenecks
12. ArchitectureSystem Architecture
Links
Dispatcher
Server 2
Content db
WebDB
Links
Mapping
Collection
Pages
Metadata
Collection
Event
Links
Collection
Links
Appearances
Collection
Links Resolutor
Queue
Decider Links Metadata
Queue
Links Mappings
save
lookup
Links
Appearances
save
query
Pages Metadata
Event Links
Links Resolutor Page Scrapers
Server 1
save
lookup
save
13. ArchitectureSystem Architecture
Links
Dispatcher
Server 2
Content db
WebDB
Links
Mapping
Collection
Pages
Metadata
Collection
Event
Links
Collection
Links
Appearances
Collection
Links Resolutor
Queue
Decider Links Metadata
Queue
Links Mappings
save
lookup
Links
Appearances
save
query
Pages Metadata
Event Links
Links Resolutor Page Scrapers
Server 1
save
lookup
save
14. ArchitectureSystem Architecture
Links
Dispatcher
Server 2
Content db
WebDB
Links
Mapping
Collection
Pages
Metadata
Collection
Event
Links
Collection
Links
Appearances
Collection
Links Resolutor
Queue
Decider Links Metadata
Queue
Links Mappings
save
lookup
Links
Appearances
save
query
Pages Metadata
Event Links
Links Resolutor Page Scrapers
Server 1
save
lookup
save
20. ArchitectureSystem Architecture
Links
Dispatcher
Server 2
Content db
WebDB
Links
Mapping
Collection
Pages
Metadata
Collection
Event
Links
Collection
Links
Appearances
Collection
Links Resolutor
Queue
Decider Links Metadata
Queue
Links Mappings
save
lookup
Links
Appearances
save
query
Pages Metadata
Event Links
Links Resolutor Page Scrapers
Server 1
save
lookup
save
21. Queues and WorkersSystem Architecture
Links Resolutor
Queue
Links Metadata
Queue
Queues
Page ScrapersLinks Resolutor
"Workers" pop jobs from the queues
Setup flexibility, easily scalable
22. Link Score Processors (LSP)System Architecture
Links
Dispatcher
Server 2
Content db
WebDB
Links
Mapping
Collection
Pages
Metadata
Collection
Event
Links
Collection
Links
Appearances
Collection
Links Resolutor
Queue
Decider Links Metadata
Queue
Links Mappings
save
lookup
Links
Appearances
save
query
Pages Metadata
Event Links
Links Resolutor Page Scrapers
Server 1
save
lookup
save
Decider
LSP
23. Link Score Processors (LSP)System Architecture
Responsible of ranking extracted links
Decider
LSP
24. Link Score Processors (LSP)System Architecture
Links
Dispatcher
Server 2
Content db
WebDB
Links
Mapping
Collection
Pages
Metadata
Collection
Event
Links
Collection
Links
Appearances
Collection
Links Resolutor
Queue
Decider Links Metadata
Queue
Links Mappings
save
lookup
Links
Appearances
save
query
Pages Metadata
Event Links
Links Resolutor Page Scrapers
Server 1
save
lookup
save
Decider
LSP
LSP
LSP
25. Link Score Processors (LSP)System Architecture
Decider
LSP
LSP
LSP
Multiple LSPs can be defined to
adopt different score functions
- Velocity based LSP
- Volume based LSP
32. Building a StorylineExperiments
Front-end Interface
- The story follows a chronological timeline
- Links are illustrated with metadata extracted from
those resources
How to visualize the extracted links in order to tell a story?
41. ProblemsExperiments
- Spam (e.g. tweets advertising unrelated pages,
Twitter bots) easily reaches high volumes
- Content is sometimes removed after publication
(broken links)
- Query parameters in urls cause some duplicated links
42. NumberofLinks
1
10
100
1000
Score Range
1 10 19 28 35
Results True-positives
LSP PerformancesExperiments
Tech Crunch Disrupt
- In this particular case, volume
based results have higher
precision
- Links are better classified
along the scores range
NumberofLinks
1
10
100
1000
Score Range
1 27 53 79 102
Results True-positives
Volume Based
Velocity Based
5
43. LSP PerformancesExperimentsNumberofLinks
1
10
100
1000
10000
Threshold
1 301 601 887
Results True-positives
Volume Based
SFBatkid
Velocity Based
President Obama's Vine
- In the velocity based results,
the top relevant links usually
occupy a relatively big slice
of the topmost part of the
scores range
NumberofLinks
1
10
100
1000
10000
Threshold
1 301 601 901 1201 1501 1801 2101
Results True-positives
32
44. Sources CompositionExperiments
TCDisrupt (no filtering)
Sources Composition
- Different classes of events
have different sources
composition
- This division changes with
increasing threshold
- It is possible to define
"fingerprints" for different
classes of event
Given a specific "fingerprint",
outliers can help define official
sources
47. Conclusions
- Development of a system that automatically generates
(kind of) storylines out of social media aggregated around
hashtags, following links being shared
- Extracting, ranking, filtering links
- Select the content which is the most shared
- Generate the results while the event is happening (with
a 15 - 30 minutes delay)
- Implementation of a front-end interface for exploring the
stories generated:
- Fills up a template, following a chronological timeline
- Different genres of events generate different outputs
- Musical events are illustrated with photos and videos,
while breaking news events are described with articles
from newspapers and blogs
48. Conclusions
- More experiments are needed to confirm the pattern we
have described with those three events
- Can we automatically classify an event based on its
signature? Can we train a classifier where the features
would be tweets, velocity and the source domains
composition?
- Implementation of an advanced scoring function taking
into account additional features of the user (lists in which
the user is appearing, etc.) and of the tweets (number of
favorites, number of RT, etc.)
Future work