With web archives, journalists find evidence and information to back up their stories, historians store information for later users, and social scientists can study the actions of humans during specific time periods. These different groups gain value not only from creating their own collections but from using the collections of others. Web archive collections store the content that would otherwise be lost. As users, we currently have no efficient way of understanding what is in each collection without manually reviewing all of its items. Web archives intentionally consist of different versions of the same document. With these multiple versions, we can watch the evolution of a single resource over time, following the changes to an organization or how the public learns the details of an unfolding news story. As aggregations of archived web pages, or mementos, these collections become resources unto themselves. While past work has used mementos for studying how web resources change over time or evaluated the changes to various industries, there is still theoretical work to be done in improving the usability of web archive collections. Our goal is to help collection creators and the public at large to make better use of these collections through improvements to collection understanding. We build upon the work of AlNoamany by using visualizations from social media storytelling. Our goal is to produce a story for each web archive collection. Each story consists of representative mementos selected from the web archive collection that are then individually visualized as surrogates (e.g., screenshots, cards containing a summary of the page). This solution has the benefit of using visualization paradigms familiar to users. In this work, we provide background on the problem, analyze previous work in this area, and highlight our preliminary work before providing a plan for future research.
Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam
1. @shawnmjones @WebSciDL
Improving Understanding of
Web Archive Collections
Through Storytelling
PhD Candidacy Exam for: Shawn M. Jones
Committee:
Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna
Thanks to:
4. @shawnmjones @WebSciDL
Let’s say: you find a bag
There are thousands of different items inside.
Can you use the contents of this bag?
How quickly can you make this decision?
4
5. @shawnmjones @WebSciDL
Now let’s say: there are thousands of bags
Which one might contain something useful for
you?
Do any?
How do you know?
How do you decrease your chances of wasting
your time?
5
7. @shawnmjones @WebSciDL
Researchers create their own web archive collections
7
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah
8. @shawnmjones @WebSciDL
Web archive collections have many versions of the same
page
8
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015
9. @shawnmjones @WebSciDL
Different versions allow us to see an unfolding news
story
9
Memento from
April 19, 2013 17:12
Searching for suspects,
City on lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 24, 2013 2:24
Suspect Found,
Office collier lost life,
Obama speaks
11. @shawnmjones @WebSciDL
Archive-It allows curators to easily create collections
Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive
collections. Curators can supply live web resources as seeds and establish crawling schedules of those
seeds to create mementos.
11
12. @shawnmjones @WebSciDL
… and these collections are used by other researchers
12
The collection curator is not the only user of the
collection!
These collections live a life after their curator
has stopped adding to them.
13. @shawnmjones @WebSciDL
How do we tell the difference between collections?
What is the difference between these two Archive-It collections about the South Louisiana Flood of
2016?
Which one should a researcher use?
13
14. @shawnmjones @WebSciDL 14
31 Archive-It
collections match the
search query
“human rights”
How are they different
from each other?
Which one is best for my
needs?
16. @shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
16
9 seeds
with metadata
132,599 seeds
no metadata
17. @shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
• it is inconsistently applied
This means that a user cannot reliably compare
metadata fields to understand the differences
between collections.
17
132,599 seeds
no metadata
9 seeds
with metadata
Paradox:
More seeds = more effort
More seeds = greater user need for metadata
18. @shawnmjones @WebSciDL
Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of seeds
Each seed can have many mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
18
20. @shawnmjones @WebSciDL
The problem, summarized
There are multiple collections
about the same concept.
The metadata for each collection is
non-existent, or inconsistently
applied.
Many collections have
1000s of seeds with multiple
mementos.
There are more than 8000
collections.
20
21. @shawnmjones @WebSciDL
The problem, summarized
There are multiple collections
about the same concept.
The metadata for each collection is
non-existent, or inconsistently
applied.
Many collections have
1000s of seeds with multiple
mementos.
There are more than 8000
collections.
Human review of these
mementos for collection
understanding is an expensive
proposition.
21
22. @shawnmjones @WebSciDL
Our proposal: a visualization made of representative
mementos
Our visualization is a summary that will
act like an abstract
Pirolli and Card’s Information Foraging
Theory:
maximize the value of the information gained
from our summaries
minimize the cost of interacting with the
collection
ensure that our representative mementos
have good information scent
contain cues that the memento will address a
user’s needs
22
From this:
318 seeds with
2421 mementos
To something like this:
a social media story
of ~28 surrogates
P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
24. @shawnmjones @WebSciDL
Surrogates provide a visual summary of the content
behind a URI…
24
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI represented by a
browser thumbnail surrogate:
The same URI represented by a
social card surrogate:
25. @shawnmjones @WebSciDL
Social media storytelling uses surrogates to provide a
“summary of summaries”
25
2 resources are shown from this Wakelet story6 resources are shown from this Storify story
Each surrogate summarizes a
web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this technique
to summarize web archive
collections because users are
already familiar with this
visualization paradigm.
27. @shawnmjones @WebSciDL
Web surrogates provide a visual summary of a web
resource drawn from the content of the resource
27
Browser Thumbnail (example from UK Web Archive)Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018-
04-24-lets-get-visual-and-examine.html, 2018.
28. @shawnmjones @WebSciDL
Our research questions
RQ1: What types of web archive
collections exist?
RQ2: What surrogates work best for
understanding collections of
mementos?
RQ3: How do we select
representative mementos for the
different semantic types of
collections?
RQ4: How well do stories produced
by different summarization algorithms
work for collection understanding?
28
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
29. @shawnmjones @WebSciDL
RQ2: What surrogates work best for web resources?
29
Studies on visualizing web resources have focused primarily on
determining search engine result relevance and not collection understanding.
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018-
04-24-lets-get-visual-and-examine.html, 2018.
30. @shawnmjones @WebSciDL
RQ3: How might we select representative mementos?
Luhn (1958)
• automatic abstracts
Silva (2014)
• word graphs from
Luhn’s algorithm
DUC Datasets (2001-2007)
Napoles (2012)
• Gigaword
Lin (2014)
• ROUGE metrics
Grusky (2018)
• NEWSROOM
• Existing reference summaries were
built from news articles.
• Existing reference summaries were
not built from web archives.
Mihalcea (2004)
• TextRank
Dolan (2004)
• clustering news articles
• Lede3 preferred by evaluators
Xie (2008)
• MMR for meeting summaries
Radev (1998)
• automatic
news briefs
Xie (2008)
• MMR for meeting
summaries
Sipos (2008)
• scholarly corpus
over time
Zhang (2010)/Li (2011)
• aspects of disasters
Hong (2014)
• word weighting
30
31. @shawnmjones @WebSciDL
RQ3: How might we select representative mementos?
– Related Concepts
Scatter-Gather (Cutting 1992)
allows a user to explore a collection by
drilling through topic cluster until they reach
individual documents
we seek to provide a representative sample
that a user can quickly glance
Recommender Systems
predicts the preference of a user based on
past behavior, demographic profile, or
behavior of the user’s friends
we want to provide a summary without any
knowledge of the user
Zero-Query Systems
predicts the information a user will need
based on time, location, environment, user
interests, and other factors
again, we want to provide a summary with
no knowledge of the user
31
Image reference:
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992.
Scatter/Gather: a cluster-based approach to browsing large document collections.
In Proceedings of the 15th annual international ACM SIGIR conference on Research
and development in information retrieval(SIGIR '92). Copenhagen, Denmark, pp. 318-
329. https://doi.org/10.1145/133160.133214
32. @shawnmjones @WebSciDL
How have others explored collections?
32
Conta Me Histórias
ArchiveSpark
Archives Unleashed Cloud
Existing solutions allow users to query and develop statistics on collections.
Users must have some ideas of a topic or concept a priori.
33. @shawnmjones @WebSciDL
How have others visualized collections for
understanding?
33
Other attempts at
visualizing Archive-It
collections tried to
visualize everything.
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-
visualizing.html
K. Padia, Y. AlNoamany, and M. C. Weigle. 2012. Visualizing digital collections at archive-it. In
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ‘12) 15 – 18.
DOI:10.1145/2232817.2232821
34. @shawnmjones @WebSciDL
How have others told stories with web
archive collections?
34
AlNoamany told stories via the storytelling platform Storify
She proved that test participants could not detect the difference between her automated stories
and stories generated by human curators
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived
Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318.
DOI:10.1145/3091478.3091508
35. @shawnmjones @WebSciDL
How have others told stories with web
archive collections?
35
AlNoamany told stories via the storytelling platform Storify – which is no longer in service
She proved that test participants could not detect the difference between her automated stories
and stories generated by human curators
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
x
S. M. Jones. “Storify Will Be Gone Soon, So How Do We Preserve The Stories?”
http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html 2017.
x
36. @shawnmjones @WebSciDL
How have others told stories with web
archive collections?
AlNoamany told stories via the storytelling platform Storify – which is no longer in service
She proved that test participants could not detect the difference between her automated stories and
stories generated by human curators
Did not evaluate if the resulting summaries were effective tools for collection understanding
Focused on summarizing collections about events
There are other types of Archive-It collections
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
36
x
x
38. @shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
1. How effective are existing curation platforms at producing mementos?
2. Preliminary user surrogate study
3. Partial RQ3: How do we select representative mementos for the different semantic types of
collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
38
39. @shawnmjones @WebSciDL
As collection users, we view Archive-It collections
from outside…
39
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
40. @shawnmjones @WebSciDL
As collection users, what structural features can we
view from outside?
40
Using only structural features is
advantageous because it saves one
from having to download a collection’s
content.
These structural features give us
different insight than can be provided by
text analysis or metadata.
81,014 seeds
486,227 seed mementos
Structural features shown here:
• number of seeds
• number of mementos
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
41. @shawnmjones @WebSciDL
Was the collection built from web sites belonging to one
domain or many?
41
Many domains One domain
Structural feature discussed
here:
• domain diversity
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
42. @shawnmjones @WebSciDL
Were most of the web pages in the collection top-level
pages or specific articles deeper in a web site?
42
Top-level pages Deeper links
Structural feature discussed
here:
• path depth diversity
• most frequent path depth
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
43. @shawnmjones @WebSciDL
Growth curves provide some understanding of collection
curation behavior
43
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
44. @shawnmjones @WebSciDL
Does most of the collection exist earlier or later in its
life?
44
This collection was created in
March 2010.
Most of its mementos come from
2016 – 2018.
Most of this collection exists later in
its life.
Structural feature discussed here:
• area under the seed memento growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
45. @shawnmjones @WebSciDL
When did the curator select and archive a collection’s
contents?
45
This collection was created in
March 2006.
Some of the seeds were selected
in 2006.
Many of the seeds were selected
all along its life.
It has mementos as recent as
July 2018.
Structural feature discussed here:
• area under the seed growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
46. @shawnmjones @WebSciDL
Did the curator create a collection intended to archive new versions of
the same web pages repeatedly?
46
This collection was created
in June 2014.
The seeds were selected
toward the beginning of its
life.
Mementos were captured all
during its life.
Structural feature discussed here:
• area under the seed growth curve
• area under the seed memento growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
47. @shawnmjones @WebSciDL
We discovered four semantic categories in
Archive-It collections…
47
Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
48. @shawnmjones @WebSciDL 48
Self-Archiving
54.1% of collections
Subject-based Time Bounded – Expected Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
49. @shawnmjones @WebSciDL 49
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
50. @shawnmjones @WebSciDL 50
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
51. @shawnmjones @WebSciDL 51
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
52. @shawnmjones @WebSciDL 52
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
53. @shawnmjones @WebSciDL
We can bridge the structural to the descriptive…
53
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
Using the structural features mentioned previously, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
54. @shawnmjones @WebSciDL
We have identified different types of Archive-It
collections
54
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
We can take these features
into account to address the
other research questions.
So, let’s tell some stories on social
media!
Self-Archiving Subject-based
Time Bounded
– Expected
Time Bounded
– Spontaneous
55. @shawnmjones @WebSciDL
We have identified different types of Archive-It
collections
55
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
We can take these features
into account to address the
other research questions.
So, let’s tell some stories on social
media!
Self-Archiving Subject-based
Time Bounded
– Expected
Time Bounded
– Spontaneous
Not so fast…
56. @shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
1. How effective are existing curation platforms at producing mementos?
2. Preliminary user surrogate study
3. Partial RQ3: How do we select representative mementos for the different semantic types of
collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
56
57. @shawnmjones @WebSciDL
Existing platforms do not reliably produce surrogates
for mementos…
57
If we cannot rely upon the
service to generate a surrogate
for a memento, our system must
then do the work to create our
own surrogates.
S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws-
dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.
58. @shawnmjones @WebSciDL
Some services have stories, but not long term
storytelling?
58
Facebook stories
Image ref:
https://techcrunch.com/2018/04/05/facebook-stories-default/
Image ref:
https://techcrunch.com/2013/10/03/snapc
hat-gets-its-own-timeline-with-snapchat-
stories-24-hour-photo-video-tales/
Snapchat stories
Image ref:
https://buffer.com/library/instagram-stories
Instagram stories
These platforms delete the user’s stories 24 hours after they are posted.
This form of social media storytelling is the opposite of what we are looking for.
We want the stories to be artifacts themselves.
59. @shawnmjones @WebSciDL
Some services’ longevity is in doubt…
59
RIP: Google+ 2019 RIP: Tumblr (soon?)RIP: Storify 2018
S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws-
dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.
60. @shawnmjones @WebSciDL
Existing surrogate services create a confusing
experience for mementos
60
Who published these resources?
Archive-It?
CNN?
Is the story author sharing fake news?
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
embed.rocks surrogate
embed.ly surrogate
61. @shawnmjones @WebSciDL
Neither social media services nor surrogate services were
reliable for storytelling, so we created MementoEmbed…
61
Information in the
MementoEmbed social
card surrogate is
separated to avoid
issues of confusion
about attribution.
MementoEmbed is
archive-aware. It can
locate information
about the memento
that is not available in
other surrogates.
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
62. @shawnmjones @WebSciDL
MementoEmbed provides us with a tool for evaluating
surrogates, a step on the road to answering RQ2…
62
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
63. @shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
1. How effective are live web curation platforms at producing mementos?
2. Preliminary user surrogate study
3. Partial RQ3: How do we select representative mementos for the different semantic types of
collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
63
64. @shawnmjones @WebSciDL
Using stories built from curator-selected mementos, we
shared stories with MT participants…
64
Archive-It like
Social Card
Browser thumbnails
Social Card With
Thumbnail as Image
(sc/t)
Social Card With
Thumbnail to
Right (sc+t)
Social Card with
Thumbnail on
Hover (sc^t)
• 4 stories of 15-17 mementos selected by human Archive-It
curators from their collections
• 6 different surrogate types
• 24 different story-surrogate combinations
• 120 MT participants
• Given 30 seconds to view each story
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
65. @shawnmjones @WebSciDL
And then we asked them which of 2 of 6 mementos
come from the same collection…
65
• Each participant was shown a list of 6 surrogates of the same type as the story they just viewed.
• They were asked to choose the 2 that they thought came from the same collection.
• They were given as much time as they wished to answer the question.
• This is similar to the Sentence Verification Task from reading comprehension studies.
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
66. @shawnmjones @WebSciDL
Response times per surrogate had interesting means, but
p-values were not statistically significant at p < 0.05
66
p = 0.190
p = 0.202
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
67. @shawnmjones @WebSciDL
Correct answers per surrogate indicate that social
cards probably outperform the Archive-It surrogate
67
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
p = 0.0569
p = 0.0770
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
68. @shawnmjones @WebSciDL
Whenever thumbnails are present, more users interact
with them
68
We could not detect if participants were zooming in to view thumbnails, but most hovered when confronted
with a thumbnail, regardless of surrogate.
For browser thumbnails alone, most of the participants clicked the link to view the actual memento behind the
surrogate.
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
69. @shawnmjones @WebSciDL
We have some results indicating that social cards
perform better, but there is more to answering RQ2…
69
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
70. @shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
3. Partial RQ3: How do we select representative mementos for the different semantic
types of collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
70
71. @shawnmjones @WebSciDL
Identifying off-topic mementos is key to choosing
representative mementos
71
Hacked
Moved on from topic
Collections have a theme
Seeds are selected to
support that theme
Mementos are versions of
seeds
Some of these versions are
off-topic
Identifying these off-topic
mementos is key to
summarization
Web Page Gone
Account Suspension
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
72. @shawnmjones @WebSciDL
The Off-Topic Memento Toolkit (OTMT) compares a seed’s first
memento with the seed’s other mementos via different
measures…
Measure Fully Equivalent
Score
Fully Dissimilar
Score
Preprocessing
Performed
OTMT -tm
keyword
Byte Count 0.0 -1.0 No bytecount
Word Count 0.0 -1.0 Yes wordcount
Jaccard Distance 0.0 1.0 Yes jaccard
Sørensen-Dice 0.0 1.0 Yes sorensen
Simhash of Term
Frequencies
0 64 Yes simhash-tf
Simhash or raw
memento
0 64 No simhash-raw
Cosine Similarity of
TF-IDF Vectors
1.0 0 Yes cosine
Cosine Similarity of
LSI Vectors
1.0 0 Yes gensim_lsi
72
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
73. @shawnmjones @WebSciDL
After repeating AlNoamany’s experiment, Word Count had
the best F1 score for identifying off-topic mementos…
73
We reused
AlNoamany’s labeled
dataset.
She did not try:
• Sørensen-Dice
• Simhash of raw
content
• Simhash of TF
• Gensim LSI
Our word count
accuracy came out
ahead of
AlNoamany’s.
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within TimeMaps in Web archives,”
International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5
74. @shawnmjones @WebSciDL
Finding off-topic mementos is one of the first steps to
addressing RQ3…
74
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
76. @shawnmjones @WebSciDL
This work requires a flexible
framework –
Dark and Stormy Archives
(DSA) 2.0
76
OTMT
Hypercane
Raintale MementoEmbed
Archive-It Utilities
Story
Web Archive
Collection
✅
✅
✅
callscalls
calls
provides
input to
input
output
Thousands of
HTML documents
< 30 Representative
Mementos
Visualized as
surrogates
calls
✅
S. M. Jones. “Raintale – A Storytelling Tool for Web Archives.” https://ws-dl.blogspot.com/2019/07/2019-07-11-
raintale-storytelling-tool.html, 2019.
Tools for selecting
representative
mementos
Tools for visualizing
mementos as a
story
77. @shawnmjones @WebSciDL
Evaluation of RQ2: What surrogates work best for
understanding collections of mementos?
77
How well do users perform with
different types of surrogates?
1. Select 5 collections from each
semantic category
2. Select the earliest memento of each
of the first 20 seeds from each
collection – this is the number of
surrogates a user views if they
open an Archive-It story and page
down once
3. Present the participant with a story
of 20 surrogates, varying the
surrogate between participants
4. Ask them to address a user task
Variations:
• For step #3, vary the time for participants to view the story
• participants view for 5, 10, 20, 30 seconds
• may surface the ability to “glance” and understand
• some surrogates consist only of title, URI, etc.
• may determine which surrogate elements perform
best
• For step #4, ask the participant to:
• determine if the collection behind the story is suited for a
task – similar to traditional IR research
• identify which items likely belong to the same collection
• Instead of steps 3 and 4 – ask former participants which
surrogate they prefer for a given task
78. @shawnmjones @WebSciDL
Evaluation of RQ2: What surrogates work best for
understanding collections of mementos?
78
What information is available to users
of the existing Archive-It story?
Discover patterns in metadata usage that may indicate
the semantic type of collection.
How well do our stories compare to the
existing metadata?
How well do our stories cover the
content of the underlying collection?
How well does the Archive-It story
cover the underlying collection?
How well do surrogates cover the
content of their mementos?
Collection
Content
Our Story
Content
Collection
Content
Archive-It
Story
Content
Memento
Content
Surrogate
Content
Our Story
Content
Existing
Metadata
For Seeds
Similarity metrics will
be used for evaluating
coverage.
79. @shawnmjones @WebSciDL
Evaluation of RQ3: How do we select representative
mementos for different semantic types of collections?
79
We will develop different algorithms and compare their output
with several metrics to determine which algorithms provide the
best ”aboutness” for the collection.
0
1
2
3
4
5
6
7
8
9
10
Existing Metadata
Content Coverage
Temporal Spread
Source Diversity
Compression
Performance
DSA 1.0 Algorithm 2 Algorithm 3 Algorithm 4
80. @shawnmjones @WebSciDL
RQ4: How well do stories produced by different summarization
algorithms work for collection understanding?
80
How well do our generated stories compare to the
existing Archive-It interface?
Do study participants understand key concepts of the
collection represented by the story?
Using the stories, can participants tell the difference
between similar collections?
Can participants compare stories and tell which are
similar?
Does the addition of existing metadata improve the
participant’s performance?
Does the layout of the surrogates improve the
participant’s performance?
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
81. @shawnmjones @WebSciDL
We plan to
have
completed
this
research in
2021…
81
iPres 2018
iPres 2018
CIKM 2019
ECIR 2020
WWW 2020
CIKM 2020
WebSci 2021
JCDL 2020
JCDL 2018
DTMH 2017
82. @shawnmjones @WebSciDL
Our methods are not just for Archive-It
82
Our methods will be applicable web archive collections created on
other platforms, like Rhizome’s Webrecorder.
83. @shawnmjones @WebSciDL
Motivation Summary
Collection understanding is a problem
with web archive collections
inconsistent metadata
1000s of mementos
1000s of collections
costly for human review
We intend to produce a visualization that
serves as an abstract to assist in
collection understanding
Prior work in this area:
did not evaluate how well this method works
for collection understanding
only focused on collections about events
relied upon Storify as a visualization medium
83
84. @shawnmjones @WebSciDL
Contributions
Existing work:
Derived semantic categories of web archive collections in
Archive-It
Categories can be predicted by using structural features
Most collections are not about events
MementoEmbed – surrogates for the past web
Social cards probably provide better understanding of
collections
Off-Topic Memento Toolkit – Identifying off-topic mementos
Future work:
Evaluate algorithms for surfacing a representative sample
from a document collection
Evaluate different surrogate types via user evaluation
Show which surrogate-sample combinations work best for
collection understanding via user evaluation
84
85. @shawnmjones @WebSciDL
Improving Understanding of
Web Archive Collections
Through Storytelling
PhD Candidacy Exam for: Shawn M. Jones
Committee:
Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna
Thanks to: