Combining Storytelling
and Web Archives
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Group
ws-dl.cs.odu.edu
@WebSciDL
This work is supported in part by IMLS LG-71-15-0077
Old Dominion University ECE Department Colloquium
2015-11-13
IMLS-Funded Research
1. Use small “stories” to summarize much larger
collections of archived web pages
– big  small
2. Generate web archive collections by mining
user-generated stories for seed URIs
• small  big
http://ws-dl.blogspot.com/2015/10/2015-10-07-imls-and-nsf-fund-web.html
The Web is important for
our cultural heritage.
How to preserve the
Web??!!!
3
What are web archives?
2002
2005
2009
2011 4
Memento: an archived snapshot as
it appeared in the past
2002
2005
2009
2011 5
What are some web archives?
• The entire web
• On-demand free services
• National libraries
• Theme-based collections
6
For more information: https://en.wikipedia.org/wiki/Web_archiving
http://mementoweb.org/
Archive-It, a subscription-based service,
hosts curated web collections
7
> 3,000
collections
~340
institutions
> 10B archived
pages
8
Collection
title
Collection
categorization
based on the
curator
Seed URI
Metadata
about the
collection
Text search
box
The group that
the resource
belongs to
List of
the seed
URIs
Timespan of the
resource
and the number
of times it has
been captured
What is the problem with the archived
collections?
9
There is more than one collection about
“Egyptian Revolution”
10
• “2010-2011 Arab Spring” https://archive-it.org/collections/3101
• “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349
• “Egypt Revolution and Politics” https://archive-it.org/collections/2358
Collection understanding and
collection summarization are not
supported currently
Not easy to answer “what’s in that collection?”
11
Our Early Attempts at Collection
Understanding Tried To Include
Everything…
15
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
Timelines
Treemaps
Treemaps
(1000s of Seeds X 1000s of Mementos) +
Dimension of Time ==
Conventional Viz Methods Not Applicable
19
Idea:
Storytelling
20
Stories in Literature
Story elements: setting, characters, sequence, exposition, conflict,
climax, resolution
21
Once upon a time
http://www.learner.org/interactives/story/
Stories in social media
22
“It's hard to define a story, but I know it when I see it” (Alexander, 2008)
A loose context of the conventional definition in Literature
“storytelling” is becoming a popular
technique in social media
23
There is interest in
choosing k items from N
where k << N
24
I have 1000s of posts, images, friends’
posts, etc. in Facebook
25
Facebook reduces this to 1 minute
video (Facebook Look Back)
• https://www.youtube.com/watch?v=AWoBAjo
vHT4
26
Interest from Twitter in storytelling
27
1 Second Everyday
28
What are the limitations of
storytelling services?
29
30
The Egyptian Revolution on Storify
31
Bookmarking, not preserving!
Social media can go off-topic
or disappear
32
A year after publication, about 11% of content
shared on social media will be gone (SalahEldeen, 2012)*
* SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: How many resources shared on social media have been lost? TPDL’12.
Despite these limitations, how do we
combine storytelling & archives?
33
We sample k mementos from N pages of
the collection to create a summary story
S
1
S
2
S
3
S
4
S
2
S
1
S
3
Collection Y
S
3
S
2
S
1
Collection Z
Collection X
Use interface people already know how to use
to summarize collections
35
Archived collectionsStorytelling services
Archived enriched
stories
Research Question
Can we (semi-)automatically identify,
evaluate, and select candidate archived
web pages from archived collections for
generating stories that summarize the
collection?
36
The archived collection has
two dimensions
37
Time
URI
The Two Dimensions
Applied to Stories
Fixed Page – Fixed Time:
differences in GeoIP,
mobile, etc. (Kelly, 2013)
Fixed Page – Sliding Time:
evolution of a single page
(or domain) through time
Sliding Page – Fixed Time:
different perspectives on a
point in time
Sliding Page – Sliding Time:
broadest possible coverage
of a collection
38
same
Time
different
URI
same
different
Fixed Pages, Fixed Time
R1
R1
R1
R1
t1 t3t2 t5t4 t6
39
Fixed Page, Fixed Time
40
A desktop Chrome user-agent
http://www.cnn.com/2014/02/24/world/africa/egypt-
politics/index.html?hpt=wo_c2
Andriod Chrome user-agent
http://www.cnn.com/2014/02/24/world/africa/egypt-
politics/index.html?hpt=wo_c2
Richard Schneider and Frank McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013.
Fixed Page, Sliding Time
R R R R R R
t1 t3t2 t5t4 t6
41
Feb 1 Feb 1 Feb 2
Feb 4 Feb 5 Feb 7
Feb 9 Feb 11
Feb 11
Sliding Page, Fixed Time
R1
R2
R3
R4
t1 t3t2 t5t4 t6
43
Feb. 11, 2011
Mubarak resigns
44
Sliding Page, Sliding Time
R1
R2
R1
R3
R4
R2
t1 t3t2 t5t4 t6
45
Jan 27 Jan 31
Feb 7Feb 4
Feb 11 Feb 11
Feb 2
Jan 25
Feb 10
Handcrafted Stories in Storify
47
Sliding Page, Sliding TimeSliding Page, Fixed Time
http://storify.com/yasmina_anwar/the-egyptian-revolution-on-archive-it-
collection
http://storify.com/yasmina_anwar/the-egyptian-revolution-story-
from-archive-it-hete
Other Interfaces Possible…
48
Replaying Story of Egyptian Revolution
http://www.dipity.com/YasminAlnoamany/Jan-25-Egyptian-Revolution/
Yasmin’s Research Status
1. Establishing a baseline
a. For archived collections (Visualizing Digital Collections at Archive-It, JCDL 2012)
b. For human stories (Characteristics of Social Media Stories, TPDL 2015)
2. Calculate the aboutness of the collections
3. Find off-topic URIs (Detecting Off-Topic Pages in Web Archives, TPDL 2015)
4. Identify the best k mementos that summarize the
collection for desired story type
5. Evaluate & visualize the generated stories
49
What about using stories to seed web
archive collections?
50
Creating collections is hard.
Requires combination of awareness,
domain knowledge, crowdsourcing, etc.
51
Ukrainian Themed Collections, Prior to February 2014
Ukrainian Conflict Collection Started February 2014
Crowdsourcing Collection of Seed URIs
What about social media, especially
curated social media?
55
https://storify.com/TheInkBee/riots-persist-in-kyiv
(late January, 2014)
https://storify.com/jesseoriordan/ukraine-fight-for-their-rights
(late January, 2014)
https://storify.com/jesseoriordan/ukraine-fight-for-their-rights
(late January, 2014)
Compelling narratives are created by
knowledgeable (and biased) users arranging news
stories, images, social media, commentary, etc.
Intuitively, this is a “good story”.
How do we quantify this?
Characteristics of
Social Media Stories
Yasmin AlNoamany, Michele C. Weigle, and
Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Group
http://ws-dl.cs.odu.edu/, @WebSciDL
59
TPDL 2015
16.09.2015, Poznań, Poland
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-stories.pdf
“Storytelling” is becoming a
popular technique in social media
60
Story creation interface on Storify
61
Research Question
62
We’d like to automatically generate stories,
but we don’t yet know what constitutes a
“good” story.
Q: What are the structural characteristics
of popular (i.e., receiving the most views)
human-generated stories?
What is the length of a story
(the number of resources per story)?
• This story has
31 resources
63
1
3
2
What are the types of resources
that compose a story?
• This story has
– 19 quotes
– 8 images
– 4 videos
64
Quotes
Video
What are the most frequently
used domains?
• 90% twitter.com
• 7% instagram.com
• 3% facebook.com
65
Twitter.com
Twitter.com
Twitter.com
The popularity of the most used
domains in the stories
66
mr_sedivy.tripod.com is ranked
3,084,064 by Alexa global rank
Twitter.com is ranked 9 by Alexa
global rank
What is the timespan (editing
time) of the stories?
• This story covers ~4
hours
(from 3:12 PM – 7:27 PM)
• Is there a relation
between the
timespan and the
features of the
story?
67
http://storify.com/yahoonews/boston-marathon-bombing
What is the “HTTP 404” rate
of the resources?
68
68
Can we find these missing
resources in the archives?
69
http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=10705523
What differentiates a
popular story?
70
19,795 views 64 views
Data set construction
1. Queried Storify Search API with the top 1000
English keywords as seen by Yahoo
2. Downloaded 145,682 stories in JSON on Feb.
2015
3. Considered stories authored in 2014 or earlier,
resulting in 37,486 stories
4. Eliminated stories with only zero or one
elements or zero views, resulting in:
a. 14,568 unique stories
b. 10,199 unique users
c. 1,251,160 web and text elements
71
Characteristics of human-
generated stories
Features Views Web elements Text elements Subscribers Timespan
in hours
25th percentile 14 10 0 0 0.18
50th percentile 51 23 1 4 3
75th percentile 268 69 9 21 120
90th percentile 1949 210 19 85 1747
Maximum 11,284,896 2,216 559 1,726,143 36,111
72
Characteristics of human-
generated stories
Features Views Web elements Text elements Subscribers Timespan
in hours
25th percentile 14 10 0 0 0.18
50th percentile 51 23 1 4 3
75th percentile 268 69 9 21 120
90th percentile 1949 210 19 85 1747
Maximum 11,284,896 2,216 559 1,726,143 36,111
73
44 % of the stories have no text elements at all
What is the Average Timespan
for Stories?
74
What is the Average Timespan
for Stories?
Intervals Percentage Median web
elements
Median text
elements
Median views
(normalized by
time)
0-60 seconds 14.0% 15 0 23
1-60 minutes 26.7% 19 0 53
1-24 hours 23.4% 25 5 110
1-7 days 13.5% 26 7 78
1-4 weeks 8.4% 26 9 80
1-12 months 10.9% 38 2 129
1-4 years 3.1% 56 15 156
75
There is nearly linear relation between the time length of the story and the
number of elements
What is the Average Timespan
for Stories?
76
The story with the longest timespan in our data set covers more than 4 years
and with more than 13,000 views. It had only 33 web elements and 51 total
elements.
Intervals Percentage Median web
elements
Median text
elements
Median views
(normalized by
time)
0-60 seconds 14.0% 15 0 23
1-60 minutes 26.7% 19 0 53
1-24 hours 23.4% 25 5 110
1-7 days 13.5% 26 7 78
1-4 weeks 8.4% 26 9 80
1-12 months 10.9% 38 2 129
1-4 years 3.1% 56 15 156
The distribution of the elements
(1,251,160)
Resource Proportion
links 70.8%
images 18.4%
text 8.1%
video 2.0%
quotes 0.7%
77
Text elements are relatively rare, meaning that few users choose to annotate the
web elements in their story.
What are the most frequently
used domains?
78
The most frequent domains
in the stories
• Domain Canonicalization
– www.cnn.com  cnn.com
• Dereference all the shortened URIs
– For example: t.co, bit.ly
• 25,947 unique domains
79
The most frequent domains
in the stories represent ~92%
Host Frequency Percentage Alexa Global Rank
as of 2015-03
Category
twitter.com 943,859 82.05% 8 Social media
instagram.com 45,188 3.93% 25 Photos
youtube.com 22,076 1.92% 3 Videos
facebook.com 13,930 1.21% 2 Social media
flickr.com 7,317 0.64% 126 Photos
patch.com 5,783 0.50% 2,096 News
plus.google.com 3,413 0.30% 1 Social media
tumblr.com 3,066 0.27% 31 Blogs
blogspot.com 1,857 0.16% 18 Blogs
imgur.com 1,756 0.15% 36 Photos
coolpile.com 1,706 0.15% 149,281 Entertainment
wordpress.com 1,615 0.14% 33 Blogs
giphy.com 1,055 0.09% 1,604 Photos
bbc.com 966 0.08% 156 News
lastampa.it 927 0.08% 2,440 News
pinterest.com 892 0.08% 32 Photos
softandapps.info 861 0.07% 160,980 News
photobucket.com 768 0.07% 341 Photos
nytimes.com 744 0.06% 97 News
soundcloud.com 736 0.06% 167 Audio
80
The most frequent domains
in the stories represent ~92%
Host Frequency Percentage Alexa Global Rank
as of 2015-03
Category
twitter.com 943,859 82.05% 8 Social media
instagram.com 45,188 3.93% 25 Photos
youtube.com 22,076 1.92% 3 Videos
facebook.com 13,930 1.21% 2 Social media
flickr.com 7,317 0.64% 126 Photos
patch.com 5,783 0.50% 2,096 News
plus.google.com 3,413 0.30% 1 Social media
tumblr.com 3,066 0.27% 31 Blogs
blogspot.com 1,857 0.16% 18 Blogs
imgur.com 1,756 0.15% 36 Photos
coolpile.com 1,706 0.15% 149,281 Entertainment
wordpress.com 1,615 0.14% 33 Blogs
giphy.com 1,055 0.09% 1,604 Photos
bbc.com 966 0.08% 156 News
lastampa.it 927 0.08% 2,440 News
pinterest.com 892 0.08% 32 Photos
softandapps.info 861 0.07% 160,980 News
photobucket.com 768 0.07% 341 Photos
nytimes.com 744 0.06% 97 News
soundcloud.com 736 0.06% 167 Audio
81
The most frequent domains
in the stories represent ~92%
Host Frequency Percentage Alexa Global Rank
as of 2015-03
Category
twitter.com 943,859 82.05% 8 Social media
instagram.com 45,188 3.93% 25 Photos
youtube.com 22,076 1.92% 3 Videos
facebook.com 13,930 1.21% 2 Social media
flickr.com 7,317 0.64% 126 Photos
patch.com 5,783 0.50% 2,096 News
plus.google.com 3,413 0.30% 1 Social media
tumblr.com 3,066 0.27% 31 Blogs
blogspot.com 1,857 0.16% 18 Blogs
imgur.com 1,756 0.15% 36 Photos
coolpile.com 1,706 0.15% 149,281 Entertainment
wordpress.com 1,615 0.14% 33 Blogs
giphy.com 1,055 0.09% 1,604 Photos
bbc.com 966 0.08% 156 News
lastampa.it 927 0.08% 2,440 News
pinterest.com 892 0.08% 32 Photos
softandapps.info 861 0.07% 160,980 News
photobucket.com 768 0.07% 341 Photos
nytimes.com 744 0.06% 97 News
soundcloud.com 736 0.06% 167 Audio
82
The most frequent domains
in the stories represent ~92%
Host Frequency Percentage Alexa Global Rank
as of 2015-03
Category
twitter.com 943,859 82.05% 8 Social media
instagram.com 45,188 3.93% 25 Photos
youtube.com 22,076 1.92% 3 Videos
facebook.com 13,930 1.21% 2 Social media
flickr.com 7,317 0.64% 126 Photos
patch.com 5,783 0.50% 2,096 News
plus.google.com 3,413 0.30% 1 Social media
tumblr.com 3,066 0.27% 31 Blogs
blogspot.com 1,857 0.16% 18 Blogs
imgur.com 1,756 0.15% 36 Photos
coolpile.com 1,706 0.15% 149,281 Entertainment
wordpress.com 1,615 0.14% 33 Blogs
giphy.com 1,055 0.09% 1,604 Photos
bbc.com 966 0.08% 156 News
lastampa.it 927 0.08% 2,440 News
pinterest.com 892 0.08% 32 Photos
softandapps.info 861 0.07% 160,980 News
photobucket.com 768 0.07% 341 Photos
nytimes.com 744 0.06% 97 News
soundcloud.com 736 0.06% 167 Audio
83
The embedded resources
of the tweets
84
https://pbs.twimg.com/media/BfnnY-0CcAA56oy.png
The embedded resources
of the tweets
85
Domain Percentage Category
twimg.com 46.17% Images
instagram.com 4.28% Images
youtube.com 2.82% Videos
linkis.com 2.04% Media sharing
facebook.com 1.40% Social Media
wordpress.com 0.61% Blogs
vine.co 0.53% Videos
blogspot.com 0.52% Blogs
storify.com 0.49% Social Network
bbc.com 0.44% News
• We sampled 5% of
47,512 tweets:
15,217
• The unique URIs:
14,616
The embedded resources
of the tweets
Domain Percentage Category
twimg.com 46.17% Images
instagram.com 4.28% Images
youtube.com 2.82% Videos
linkis.com 2.04% Media sharing
facebook.com 1.40% Social Media
wordpress.com 0.61% Blogs
vine.co 0.53% Videos
blogspot.com 0.52% Blogs
storify.com 0.49% Social Network
bbc.com 0.44% News
86
• We sampled 5% of
47,512 tweets:
15,217
• The unique URIs:
14,616
• 46% are photos from
twitter.com
The embedded resources
of the tweets
Domain Percentage Category
twimg.com 46.17% Images
instagram.com 4.28% Images
youtube.com 2.82% Videos
linkis.com 2.04% Media sharing
facebook.com 1.40% Social Media
wordpress.com 0.61% Blogs
vine.co 0.53% Videos
blogspot.com 0.52% Blogs
storify.com 0.49% Social Network
bbc.com 0.44% News
87
• We sampled 5% of
47,512 tweets:
15,217
• The unique URIs:
14,616
• 0.49% of the stories
point to other stories
in Storify
Is there a correlation between
Alexa global rank and rank
within Storify?
88
Most of the time, the highly ranked
resources correlate the most used
resources in human-generated stories
n 10 15 25 50 100
Kendal 0.1555 0.4476 0.3372 0.3194 0.2485
89
Most of the time, the highly ranked
resources correlate the most used
resources in human-generated stories
n 10 15 25 50 100
Kendal 0.1555 0.4476 0.3372 0.3194 0.2485
90
This is in contrast to the usage
of resources in Pinterest [1]
[1] Zhong, C., Shah, S., Sundaravadivelan, K., Sastry, N.: Sharing the Loves: Understanding the How
and Why of Online Content Curation. ICWSM 2013
Image source: http://workmonk.in/testblog/wp-content/uploads/2014/01/pinterest.jpg
How many of the resources in
these stories disappear
every year?
Can we find these missing
resources in the archives?
91
The existence of the resources
• We checked the live web and public web archives
for 265,181 URIs
– 202,452 URIs from story web elements
– 47,512 randomly sampled tweet URIs
– 15,217 URIs of embedded resources in those tweets
• The unique URIs are 253,978
• We examined the results of the five most
frequent domains in the stories
– twitter.com, instagram.com, youtube.com,
facebook.com, flickr.com
92
The decay rate of the resources
in the stories
93
40.8% of the stories
contain missing
resources with an
average value of
10.3% per story
The decay rate of the resources
in the stories
94
There is a nearly
linear decay rate of
resources through
time. This matches
the finding by
SalahEldeen [2].
[2] SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: How many resources shared on social media have been lost? TPDL’12.
Existence on the live web and
in the archives
95
Of all the unique URIs, 11.8% are missing on the live web.
Existence on the live web and
in the archives
96
• The social media is not well-archived like the regular web
• Facebook uses robots.txt to block web archiving by the Internet Archive
What differentiates a
popular story?
97
Specifying the popular and the
unpopular stories
• The stories are divided based on the number
of their views, then normalized by the amount
of time they were available on the web.
• Popular stories
– the top 25% of stories that have the most views
– 3,642 stories
98
The distributions for the features
of the stories
99
• Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the
unpopular stories are different in terms of most of the features
• Popular stories tend to have:
• more web elements (medians of 28 vs. 21)
• longer timespan (5 hours vs. 2 hours) than the unpopular stories
Do Popular Stories have a Lower
Decay Rate?
100
The 75th percentile of decay rate per popular story is 10% of the resources,
while it is 15% in the unpopular stories
The distributions for the elements
of the stories
101
Conclusions
• We analyzed 14,568 stories from Storify comprising 1,251,160
elements.
• Popular stories have a min/median/max value of 2/28/1950
elements, with the unpopular stories having 2/21/2216
• Popular stories have a median of 12 multimedia resources (the
unpopular stories have a median of 7)
• Of the popular stories, 38% receive continuing edits (as
opposed to 35%)
• In popular stories, only 11% of web elements are missing on the
live web (as opposed to 13%)
• The percentage of the missing resources is proportional with
the age of the stories.
102

Combining Storytelling and Web Archives

  • 1.
    Combining Storytelling and WebArchives Yasmin AlNoamany Michele C. Weigle Michael L. Nelson Old Dominion University Web Science and Digital Libraries Group ws-dl.cs.odu.edu @WebSciDL This work is supported in part by IMLS LG-71-15-0077 Old Dominion University ECE Department Colloquium 2015-11-13
  • 2.
    IMLS-Funded Research 1. Usesmall “stories” to summarize much larger collections of archived web pages – big  small 2. Generate web archive collections by mining user-generated stories for seed URIs • small  big http://ws-dl.blogspot.com/2015/10/2015-10-07-imls-and-nsf-fund-web.html
  • 3.
    The Web isimportant for our cultural heritage. How to preserve the Web??!!! 3
  • 4.
    What are webarchives? 2002 2005 2009 2011 4
  • 5.
    Memento: an archivedsnapshot as it appeared in the past 2002 2005 2009 2011 5
  • 6.
    What are someweb archives? • The entire web • On-demand free services • National libraries • Theme-based collections 6 For more information: https://en.wikipedia.org/wiki/Web_archiving http://mementoweb.org/
  • 7.
    Archive-It, a subscription-basedservice, hosts curated web collections 7 > 3,000 collections ~340 institutions > 10B archived pages
  • 8.
    8 Collection title Collection categorization based on the curator SeedURI Metadata about the collection Text search box The group that the resource belongs to List of the seed URIs Timespan of the resource and the number of times it has been captured
  • 9.
    What is theproblem with the archived collections? 9
  • 10.
    There is morethan one collection about “Egyptian Revolution” 10 • “2010-2011 Arab Spring” https://archive-it.org/collections/3101 • “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349 • “Egypt Revolution and Politics” https://archive-it.org/collections/2358
  • 11.
    Collection understanding and collectionsummarization are not supported currently Not easy to answer “what’s in that collection?” 11
  • 15.
    Our Early Attemptsat Collection Understanding Tried To Include Everything… 15 http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
  • 16.
  • 17.
  • 18.
  • 19.
    (1000s of SeedsX 1000s of Mementos) + Dimension of Time == Conventional Viz Methods Not Applicable 19
  • 20.
  • 21.
    Stories in Literature Storyelements: setting, characters, sequence, exposition, conflict, climax, resolution 21 Once upon a time http://www.learner.org/interactives/story/
  • 22.
    Stories in socialmedia 22 “It's hard to define a story, but I know it when I see it” (Alexander, 2008) A loose context of the conventional definition in Literature
  • 23.
    “storytelling” is becominga popular technique in social media 23
  • 24.
    There is interestin choosing k items from N where k << N 24
  • 25.
    I have 1000sof posts, images, friends’ posts, etc. in Facebook 25
  • 26.
    Facebook reduces thisto 1 minute video (Facebook Look Back) • https://www.youtube.com/watch?v=AWoBAjo vHT4 26
  • 27.
    Interest from Twitterin storytelling 27
  • 28.
  • 29.
    What are thelimitations of storytelling services? 29
  • 30.
  • 31.
  • 32.
    Social media cango off-topic or disappear 32 A year after publication, about 11% of content shared on social media will be gone (SalahEldeen, 2012)* * SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: How many resources shared on social media have been lost? TPDL’12.
  • 33.
    Despite these limitations,how do we combine storytelling & archives? 33
  • 34.
    We sample kmementos from N pages of the collection to create a summary story S 1 S 2 S 3 S 4 S 2 S 1 S 3 Collection Y S 3 S 2 S 1 Collection Z Collection X
  • 35.
    Use interface peoplealready know how to use to summarize collections 35 Archived collectionsStorytelling services Archived enriched stories
  • 36.
    Research Question Can we(semi-)automatically identify, evaluate, and select candidate archived web pages from archived collections for generating stories that summarize the collection? 36
  • 37.
    The archived collectionhas two dimensions 37 Time URI
  • 38.
    The Two Dimensions Appliedto Stories Fixed Page – Fixed Time: differences in GeoIP, mobile, etc. (Kelly, 2013) Fixed Page – Sliding Time: evolution of a single page (or domain) through time Sliding Page – Fixed Time: different perspectives on a point in time Sliding Page – Sliding Time: broadest possible coverage of a collection 38 same Time different URI same different
  • 39.
    Fixed Pages, FixedTime R1 R1 R1 R1 t1 t3t2 t5t4 t6 39
  • 40.
    Fixed Page, FixedTime 40 A desktop Chrome user-agent http://www.cnn.com/2014/02/24/world/africa/egypt- politics/index.html?hpt=wo_c2 Andriod Chrome user-agent http://www.cnn.com/2014/02/24/world/africa/egypt- politics/index.html?hpt=wo_c2 Richard Schneider and Frank McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013.
  • 41.
    Fixed Page, SlidingTime R R R R R R t1 t3t2 t5t4 t6 41
  • 42.
    Feb 1 Feb1 Feb 2 Feb 4 Feb 5 Feb 7 Feb 9 Feb 11 Feb 11
  • 43.
    Sliding Page, FixedTime R1 R2 R3 R4 t1 t3t2 t5t4 t6 43
  • 44.
  • 45.
    Sliding Page, SlidingTime R1 R2 R1 R3 R4 R2 t1 t3t2 t5t4 t6 45
  • 46.
    Jan 27 Jan31 Feb 7Feb 4 Feb 11 Feb 11 Feb 2 Jan 25 Feb 10
  • 47.
    Handcrafted Stories inStorify 47 Sliding Page, Sliding TimeSliding Page, Fixed Time http://storify.com/yasmina_anwar/the-egyptian-revolution-on-archive-it- collection http://storify.com/yasmina_anwar/the-egyptian-revolution-story- from-archive-it-hete
  • 48.
    Other Interfaces Possible… 48 ReplayingStory of Egyptian Revolution http://www.dipity.com/YasminAlnoamany/Jan-25-Egyptian-Revolution/
  • 49.
    Yasmin’s Research Status 1.Establishing a baseline a. For archived collections (Visualizing Digital Collections at Archive-It, JCDL 2012) b. For human stories (Characteristics of Social Media Stories, TPDL 2015) 2. Calculate the aboutness of the collections 3. Find off-topic URIs (Detecting Off-Topic Pages in Web Archives, TPDL 2015) 4. Identify the best k mementos that summarize the collection for desired story type 5. Evaluate & visualize the generated stories 49
  • 50.
    What about usingstories to seed web archive collections? 50
  • 51.
    Creating collections ishard. Requires combination of awareness, domain knowledge, crowdsourcing, etc. 51
  • 52.
    Ukrainian Themed Collections,Prior to February 2014
  • 53.
    Ukrainian Conflict CollectionStarted February 2014
  • 54.
  • 55.
    What about socialmedia, especially curated social media? 55
  • 56.
  • 57.
  • 58.
    https://storify.com/jesseoriordan/ukraine-fight-for-their-rights (late January, 2014) Compellingnarratives are created by knowledgeable (and biased) users arranging news stories, images, social media, commentary, etc. Intuitively, this is a “good story”. How do we quantify this?
  • 59.
    Characteristics of Social MediaStories Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson Old Dominion University Web Science and Digital Libraries Group http://ws-dl.cs.odu.edu/, @WebSciDL 59 TPDL 2015 16.09.2015, Poznań, Poland http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-stories.pdf
  • 60.
    “Storytelling” is becominga popular technique in social media 60
  • 61.
  • 62.
    Research Question 62 We’d liketo automatically generate stories, but we don’t yet know what constitutes a “good” story. Q: What are the structural characteristics of popular (i.e., receiving the most views) human-generated stories?
  • 63.
    What is thelength of a story (the number of resources per story)? • This story has 31 resources 63 1 3 2
  • 64.
    What are thetypes of resources that compose a story? • This story has – 19 quotes – 8 images – 4 videos 64 Quotes Video
  • 65.
    What are themost frequently used domains? • 90% twitter.com • 7% instagram.com • 3% facebook.com 65 Twitter.com Twitter.com Twitter.com
  • 66.
    The popularity ofthe most used domains in the stories 66 mr_sedivy.tripod.com is ranked 3,084,064 by Alexa global rank Twitter.com is ranked 9 by Alexa global rank
  • 67.
    What is thetimespan (editing time) of the stories? • This story covers ~4 hours (from 3:12 PM – 7:27 PM) • Is there a relation between the timespan and the features of the story? 67 http://storify.com/yahoonews/boston-marathon-bombing
  • 68.
    What is the“HTTP 404” rate of the resources? 68 68
  • 69.
    Can we findthese missing resources in the archives? 69 http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=10705523
  • 70.
    What differentiates a popularstory? 70 19,795 views 64 views
  • 71.
    Data set construction 1.Queried Storify Search API with the top 1000 English keywords as seen by Yahoo 2. Downloaded 145,682 stories in JSON on Feb. 2015 3. Considered stories authored in 2014 or earlier, resulting in 37,486 stories 4. Eliminated stories with only zero or one elements or zero views, resulting in: a. 14,568 unique stories b. 10,199 unique users c. 1,251,160 web and text elements 71
  • 72.
    Characteristics of human- generatedstories Features Views Web elements Text elements Subscribers Timespan in hours 25th percentile 14 10 0 0 0.18 50th percentile 51 23 1 4 3 75th percentile 268 69 9 21 120 90th percentile 1949 210 19 85 1747 Maximum 11,284,896 2,216 559 1,726,143 36,111 72
  • 73.
    Characteristics of human- generatedstories Features Views Web elements Text elements Subscribers Timespan in hours 25th percentile 14 10 0 0 0.18 50th percentile 51 23 1 4 3 75th percentile 268 69 9 21 120 90th percentile 1949 210 19 85 1747 Maximum 11,284,896 2,216 559 1,726,143 36,111 73 44 % of the stories have no text elements at all
  • 74.
    What is theAverage Timespan for Stories? 74
  • 75.
    What is theAverage Timespan for Stories? Intervals Percentage Median web elements Median text elements Median views (normalized by time) 0-60 seconds 14.0% 15 0 23 1-60 minutes 26.7% 19 0 53 1-24 hours 23.4% 25 5 110 1-7 days 13.5% 26 7 78 1-4 weeks 8.4% 26 9 80 1-12 months 10.9% 38 2 129 1-4 years 3.1% 56 15 156 75 There is nearly linear relation between the time length of the story and the number of elements
  • 76.
    What is theAverage Timespan for Stories? 76 The story with the longest timespan in our data set covers more than 4 years and with more than 13,000 views. It had only 33 web elements and 51 total elements. Intervals Percentage Median web elements Median text elements Median views (normalized by time) 0-60 seconds 14.0% 15 0 23 1-60 minutes 26.7% 19 0 53 1-24 hours 23.4% 25 5 110 1-7 days 13.5% 26 7 78 1-4 weeks 8.4% 26 9 80 1-12 months 10.9% 38 2 129 1-4 years 3.1% 56 15 156
  • 77.
    The distribution ofthe elements (1,251,160) Resource Proportion links 70.8% images 18.4% text 8.1% video 2.0% quotes 0.7% 77 Text elements are relatively rare, meaning that few users choose to annotate the web elements in their story.
  • 78.
    What are themost frequently used domains? 78
  • 79.
    The most frequentdomains in the stories • Domain Canonicalization – www.cnn.com  cnn.com • Dereference all the shortened URIs – For example: t.co, bit.ly • 25,947 unique domains 79
  • 80.
    The most frequentdomains in the stories represent ~92% Host Frequency Percentage Alexa Global Rank as of 2015-03 Category twitter.com 943,859 82.05% 8 Social media instagram.com 45,188 3.93% 25 Photos youtube.com 22,076 1.92% 3 Videos facebook.com 13,930 1.21% 2 Social media flickr.com 7,317 0.64% 126 Photos patch.com 5,783 0.50% 2,096 News plus.google.com 3,413 0.30% 1 Social media tumblr.com 3,066 0.27% 31 Blogs blogspot.com 1,857 0.16% 18 Blogs imgur.com 1,756 0.15% 36 Photos coolpile.com 1,706 0.15% 149,281 Entertainment wordpress.com 1,615 0.14% 33 Blogs giphy.com 1,055 0.09% 1,604 Photos bbc.com 966 0.08% 156 News lastampa.it 927 0.08% 2,440 News pinterest.com 892 0.08% 32 Photos softandapps.info 861 0.07% 160,980 News photobucket.com 768 0.07% 341 Photos nytimes.com 744 0.06% 97 News soundcloud.com 736 0.06% 167 Audio 80
  • 81.
    The most frequentdomains in the stories represent ~92% Host Frequency Percentage Alexa Global Rank as of 2015-03 Category twitter.com 943,859 82.05% 8 Social media instagram.com 45,188 3.93% 25 Photos youtube.com 22,076 1.92% 3 Videos facebook.com 13,930 1.21% 2 Social media flickr.com 7,317 0.64% 126 Photos patch.com 5,783 0.50% 2,096 News plus.google.com 3,413 0.30% 1 Social media tumblr.com 3,066 0.27% 31 Blogs blogspot.com 1,857 0.16% 18 Blogs imgur.com 1,756 0.15% 36 Photos coolpile.com 1,706 0.15% 149,281 Entertainment wordpress.com 1,615 0.14% 33 Blogs giphy.com 1,055 0.09% 1,604 Photos bbc.com 966 0.08% 156 News lastampa.it 927 0.08% 2,440 News pinterest.com 892 0.08% 32 Photos softandapps.info 861 0.07% 160,980 News photobucket.com 768 0.07% 341 Photos nytimes.com 744 0.06% 97 News soundcloud.com 736 0.06% 167 Audio 81
  • 82.
    The most frequentdomains in the stories represent ~92% Host Frequency Percentage Alexa Global Rank as of 2015-03 Category twitter.com 943,859 82.05% 8 Social media instagram.com 45,188 3.93% 25 Photos youtube.com 22,076 1.92% 3 Videos facebook.com 13,930 1.21% 2 Social media flickr.com 7,317 0.64% 126 Photos patch.com 5,783 0.50% 2,096 News plus.google.com 3,413 0.30% 1 Social media tumblr.com 3,066 0.27% 31 Blogs blogspot.com 1,857 0.16% 18 Blogs imgur.com 1,756 0.15% 36 Photos coolpile.com 1,706 0.15% 149,281 Entertainment wordpress.com 1,615 0.14% 33 Blogs giphy.com 1,055 0.09% 1,604 Photos bbc.com 966 0.08% 156 News lastampa.it 927 0.08% 2,440 News pinterest.com 892 0.08% 32 Photos softandapps.info 861 0.07% 160,980 News photobucket.com 768 0.07% 341 Photos nytimes.com 744 0.06% 97 News soundcloud.com 736 0.06% 167 Audio 82
  • 83.
    The most frequentdomains in the stories represent ~92% Host Frequency Percentage Alexa Global Rank as of 2015-03 Category twitter.com 943,859 82.05% 8 Social media instagram.com 45,188 3.93% 25 Photos youtube.com 22,076 1.92% 3 Videos facebook.com 13,930 1.21% 2 Social media flickr.com 7,317 0.64% 126 Photos patch.com 5,783 0.50% 2,096 News plus.google.com 3,413 0.30% 1 Social media tumblr.com 3,066 0.27% 31 Blogs blogspot.com 1,857 0.16% 18 Blogs imgur.com 1,756 0.15% 36 Photos coolpile.com 1,706 0.15% 149,281 Entertainment wordpress.com 1,615 0.14% 33 Blogs giphy.com 1,055 0.09% 1,604 Photos bbc.com 966 0.08% 156 News lastampa.it 927 0.08% 2,440 News pinterest.com 892 0.08% 32 Photos softandapps.info 861 0.07% 160,980 News photobucket.com 768 0.07% 341 Photos nytimes.com 744 0.06% 97 News soundcloud.com 736 0.06% 167 Audio 83
  • 84.
    The embedded resources ofthe tweets 84 https://pbs.twimg.com/media/BfnnY-0CcAA56oy.png
  • 85.
    The embedded resources ofthe tweets 85 Domain Percentage Category twimg.com 46.17% Images instagram.com 4.28% Images youtube.com 2.82% Videos linkis.com 2.04% Media sharing facebook.com 1.40% Social Media wordpress.com 0.61% Blogs vine.co 0.53% Videos blogspot.com 0.52% Blogs storify.com 0.49% Social Network bbc.com 0.44% News • We sampled 5% of 47,512 tweets: 15,217 • The unique URIs: 14,616
  • 86.
    The embedded resources ofthe tweets Domain Percentage Category twimg.com 46.17% Images instagram.com 4.28% Images youtube.com 2.82% Videos linkis.com 2.04% Media sharing facebook.com 1.40% Social Media wordpress.com 0.61% Blogs vine.co 0.53% Videos blogspot.com 0.52% Blogs storify.com 0.49% Social Network bbc.com 0.44% News 86 • We sampled 5% of 47,512 tweets: 15,217 • The unique URIs: 14,616 • 46% are photos from twitter.com
  • 87.
    The embedded resources ofthe tweets Domain Percentage Category twimg.com 46.17% Images instagram.com 4.28% Images youtube.com 2.82% Videos linkis.com 2.04% Media sharing facebook.com 1.40% Social Media wordpress.com 0.61% Blogs vine.co 0.53% Videos blogspot.com 0.52% Blogs storify.com 0.49% Social Network bbc.com 0.44% News 87 • We sampled 5% of 47,512 tweets: 15,217 • The unique URIs: 14,616 • 0.49% of the stories point to other stories in Storify
  • 88.
    Is there acorrelation between Alexa global rank and rank within Storify? 88
  • 89.
    Most of thetime, the highly ranked resources correlate the most used resources in human-generated stories n 10 15 25 50 100 Kendal 0.1555 0.4476 0.3372 0.3194 0.2485 89
  • 90.
    Most of thetime, the highly ranked resources correlate the most used resources in human-generated stories n 10 15 25 50 100 Kendal 0.1555 0.4476 0.3372 0.3194 0.2485 90 This is in contrast to the usage of resources in Pinterest [1] [1] Zhong, C., Shah, S., Sundaravadivelan, K., Sastry, N.: Sharing the Loves: Understanding the How and Why of Online Content Curation. ICWSM 2013 Image source: http://workmonk.in/testblog/wp-content/uploads/2014/01/pinterest.jpg
  • 91.
    How many ofthe resources in these stories disappear every year? Can we find these missing resources in the archives? 91
  • 92.
    The existence ofthe resources • We checked the live web and public web archives for 265,181 URIs – 202,452 URIs from story web elements – 47,512 randomly sampled tweet URIs – 15,217 URIs of embedded resources in those tweets • The unique URIs are 253,978 • We examined the results of the five most frequent domains in the stories – twitter.com, instagram.com, youtube.com, facebook.com, flickr.com 92
  • 93.
    The decay rateof the resources in the stories 93 40.8% of the stories contain missing resources with an average value of 10.3% per story
  • 94.
    The decay rateof the resources in the stories 94 There is a nearly linear decay rate of resources through time. This matches the finding by SalahEldeen [2]. [2] SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: How many resources shared on social media have been lost? TPDL’12.
  • 95.
    Existence on thelive web and in the archives 95 Of all the unique URIs, 11.8% are missing on the live web.
  • 96.
    Existence on thelive web and in the archives 96 • The social media is not well-archived like the regular web • Facebook uses robots.txt to block web archiving by the Internet Archive
  • 97.
  • 98.
    Specifying the popularand the unpopular stories • The stories are divided based on the number of their views, then normalized by the amount of time they were available on the web. • Popular stories – the top 25% of stories that have the most views – 3,642 stories 98
  • 99.
    The distributions forthe features of the stories 99 • Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the unpopular stories are different in terms of most of the features • Popular stories tend to have: • more web elements (medians of 28 vs. 21) • longer timespan (5 hours vs. 2 hours) than the unpopular stories
  • 100.
    Do Popular Storieshave a Lower Decay Rate? 100 The 75th percentile of decay rate per popular story is 10% of the resources, while it is 15% in the unpopular stories
  • 101.
    The distributions forthe elements of the stories 101
  • 102.
    Conclusions • We analyzed14,568 stories from Storify comprising 1,251,160 elements. • Popular stories have a min/median/max value of 2/28/1950 elements, with the unpopular stories having 2/21/2216 • Popular stories have a median of 12 multimedia resources (the unpopular stories have a median of 7) • Of the popular stories, 38% receive continuing edits (as opposed to 35%) • In popular stories, only 11% of web elements are missing on the live web (as opposed to 13%) • The percentage of the missing resources is proportional with the age of the stories. 102

Editor's Notes

  • #2 What we mean here by Storytelling here is using visualizations to put a set of web pages from web archives in a narrative structure, ordered by time
  • #5 Institutions that maintain taking periodic snapshots of pages to archive of the entire Web
  • #7 There are different kinds of archives depending on their purpose: Foe example there are web arhicves tht archive the whole web, an other aredomain specific, or collection based archives There are web archives that archive the whole web, such as IA. The largest web archive, in all of the above cases, the goal is to summarize or abstract a *large* collection with a few, representative selections
  • #8 First deployed in 2006, Archive-It is a subscription web archiving service from the Internet Archive that helps organizations to harvest, build, and preserve collections of digital content. 
  • #9 Lori created the collections and entered metadata about them,description, title, etc Collection level metadata but it doesn’t help a lot Archive-It provides faceted browsing and search services on the resulting collection
  • #11 , there are about 3 or 4 collections about egyptian revolution in Archive-it, If I want to know about the egy rev, which collection should I browse?? Collection is two dimensions <<URIs, and copies of these URIs>> Historian with more than one collection will not know where to start
  • #22 Every story is made up of a set of events. We use ``story'' in its current, loose context of social media, which is sometimes missing elements from the more formal literary tradition of dramatic structure, morality, humor, improvisation, etc What we mean here by Storytelling here is using visualizations to put a set of web pages from web archives in a narrative structure, ordered by time
  • #23 Story def. in social media much looser and more relaxed. Storytelling may be seen as the set of cultural practices for representing events chronologically.
  • #24 Because of the sheer volume of information on the web, “storytelling” is becoming a popular technique in social media for selecting representative tweets, videos, web pages, etc. and arranging them in chronological order to support a particular narrative or “story”4. We use “story” in its current, loose context of social media, which is sometimes missing elements from the more formal literary tradition of dramatic structure, morality, humor, improvisation, etc
  • #26  Facebook Timeline is one of the most popular site that the people use I have thousands of posts, images, friends posts, and so on in facebook To reduce this o a 1 minute video
  • #27 Highlights the years on facebook
  • #28 https://stories.twitter.com/index_en.html
  • #29 From the videos we take everyday, this app allow selecting and cutting 1 second from each video, and combining these seconds together, so u can review yr life. but a new perspective on how he lived day to day.
  • #31 Storify is a storytelling service that lets the user create stories or timelines using social media and web pages. Storify was launched in September 2010, and has been open to the public since April 2011. storytelling” is best typified by the company Storify http://storify.com/nzherald/mu http://storify.com/nzherald/mu http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=10705546
  • #32 The problem is that storify operate as bookmarking, it doesn’t preserve the links You have no clue of what the person is saying about the link http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=10705523 http://storify.com/nzherald/mu
  • #33 https://twitter.com/LamyaeA/statuses/36095702761218048 http://twitpic.com/3yo7e9
  • #35 So if this is the web, the archived collections are subsets from the web, we will sample from these collections to create a story…..
  • #36  So what we want to do is to create persistent stories then visualize them using storytelling tool that users already know about, such as storify. So we will integrate the story telling servises and the archived collections to generate archived enriched stories.
  • #37 The main research question here is How to identify the set of pages that best summarize the collection ? How do we integrate web archives into live web for repalying news stories as they captured?
  • #39 There are four basic story types We have pages and time both can be sliding or fixed Let me walk you through example for each type…
  • #41 http://www.cnn.com/2014/02/24/world/africa/egypt-politics/index.html?hpt=wo_c2 http://america.aljazeera.com/ Personalized Web resources offer different representations based on the user-agent string and other values in the HTTP request headers, GeoIP, and other environmental factors. Currently web archives don’t support browsing different representation. This means Web crawlers capturing content for archives may receive representations based on the crawl environment which will differ from the representations returned to the interactive users.
  • #43 http://wayback.archive-it.org/2358/20110211191423/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/ http://wayback.archive-it.org/2358/*/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
  • #44 Here
  • #45 Here is feb 11 from different news sites https://wayback.archive-it.org/2358/20110211074248/http://www.globalpost.com/dispatch/egypt/110210/mubarak-resign-obama-egypt https://wayback.archive-it.org/2358/20110211191445/http://www.cnn.com/ https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045 https://wayback.archive-it.org/2358/20110211192142/http://www.modernegypt.info/ https://wayback.archive-it.org/2358/20110211191423/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/ https://wayback.archive-it.org/2358/20110211191423/http://www.arabist.net/ https://wayback.archive-it.org/2358/20110211194239/http://www.globalpost.com/dispatch/egypt/110211/mubarak-quits-resigns-egypt-cairo
  • #47 And here I want to get the broadest coverage possible for the egyptian revolution
  • #48 http://storify.com/yasmina_anwar/the-egyptian-revolution-story-from-archive-it-hete
  • #61 Because of the sheer volume of information on the web, “storytelling” is becoming a popular technique in social media for selecting representative tweets, videos, web pages, etc. and arranging them in chronological order to support a particular narrative or “story”4.
  • #62 Storify is a social networking service launched in 2010 that allows users to create a “story” of their own choosing, consisting of manually chosen web resources, arranged with a visually attractive interface, clustered together with a single URI and suitable for sharing. It provides a graphical interface for selecting URIs of web resources and arranging the resulting snippets and previews
  • #63 Our research question is What are the structural characteristics of popular (i.e., receiving the most views) human-generated stories? We answer the following questions:
  • #67 http://storify.com/RuBegonia/richardgrenell-tweets-about-cnn-as-story-unfolds http://storify.com/OwenCobb/do-you-hear-the-people-type-1
  • #69 The loss (when a web page give HTTP 404) rate of the resources that compose Storify stories
  • #71 the top 25% of views, normalized by time available on the web
  • #73 . Due to the large range of values we believe median is a better indicator of “typical values” rather than mean.
  • #74 . Due to the large range of values we believe median is a better indicator of “typical values” rather than mean.
  • #76 The first two intervals represent the stories that were created and modified, then published with no continuing edits. We see that the majority of the stories in the data set were created and edited in the span of one day. There are 14 % of Storify users who update their stories over a long period of time, with the longest timespan in our data set covering more than four years and with more than 13,000 views. Curiously, it had only 33 web elements and 51 total elements. Although the story with the longest timespan did not have the largest number of elements, from Table 5 we can see that based on the median number of elements in each interval there is nearly linear relation between the time length of the story and the number of elements.
  • #77 The first two intervals represent the stories that were created and modified, then published with no continuing edits. We see that the majority of the stories in the data set were created and edited in the span of one day. There are 14 % of Storify users who update their stories over a long period of time,
  • #78 Using Storify-defined categories reflected in the UI the 1,251,160 elements consist of:
  • #81 This is the most frequent domains used in the stories… We manually categorized these domains in a more fine-grained manner than Storify provides with its “links, images, text, videos, quotes” descriptions
  • #82 Twitter is the most popular domain
  • #83 Note that plus.google.com has rank one because Alexa does not differentiate plus.google.com from google.com.
  • #84 Although the top 25 list of domains appearing in the stories is dominated by globally popular web sites (e.g., Twitter, Instagram, Youtube, Facebook), the long-tailed distribution results in the presence of many globaly lesser known sites.
  • #85 https://storify.com/acranger/facebook-turns-10 The Embedded Resources of twitter.com Since Twitter is the most popular domain (> 82% of web elements), we investigate if the tweets have embedded resources of their own.
  • #86 46% are photos from twitter.com
  • #88 Note that some Storify stories (0.49%) point to other stories in Storify.
  • #90 most of the time the highly ranked real-world resources, such as twitter.com, are correspondingly the most used in human-generated stories. The table Kendall’s Tau correlation coefficient for the most frequent n domains. Statistically significant (p < 0.05) correlations are bolded. The highest correlation is 0.45 for list of 15 domains.
  • #91 In Pinterest, users pin photos or videos of interest to create theme-based image/video collections such as hobbies, fashion, events. The most used subject areas that are being used by Pinterest users are food and drinks, décor and design, and Apparel and Accessories [5]. Most of the pins on Pinterest come from blogs, or uploaded by users. In Storify, the people tend to use social media and web resources to create their narratives about events, or something of interest.
  • #94 the percentage of the missing resources in the stories over time. a nearly linear decay rate of resources through time
  • #95 the percentage of the missing resources in the stories over time.
  • #96 the left two columns of Table 6 contain the results of checking the status of the web pages on the live web. The table also contains the results of the five most frequent domains and all other URIs. We also in we conclude that the decay rate of social media content is lower than the decay rate of the regular web content and websites. the social media is not well-archived like the regular web
  • #97 Of all the unique URIs, 11.8% are missing on the live web. The table also contains the results of the five most frequent domains and all other URIs.
  • #100 Popular stories tend to have: more web elements (medians of 28 vs. 21) longer timespan (5 hours vs. 2 hours) than the unpopular stories longer editing time intervals than the unpopular stories
  • #101 It shows that the resources of the pop- ular stories tends to stay longer than the resources of the unpopular.
  • #102 the distribution of the images in popular stories is higher than the distribution for the images in the unpopular stories. The median number of images in popular stories is 10, while it is 5 in the unpopular stories Although the unpopular stories tend to use links more than the popular stories, the median of the links in popular stories (20 links) is higher than the unpopular stories (16 links).