1. @shawnmjones @WebSciDL
Storytelling With Web
Archives
Giving visitors a taste of a huge collection
Thanks to:
Shawn M. Jones
Web Science and Digital Libraries Research Group
Old Dominion University
RE-70-18-0005-18
2. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Storytelling With Web
Archives Has Multiple Use
Cases…
1. Promotion of the collection
Storytelling allows a curator to promote a
collection, making others aware of it
2. Exploring aspects of the
collection
Users can explore a collection and
expose specific sides of a news story or
focus on specific people or places
3. Summarization
Web archive collections are too large for
manual review – we need a summary to
understand what they contain
2
3. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Lesson Goals
Introduce social media storytelling
Identify the 2 actions of conducting storytelling with web archives
Provide an overview of AlNoamany’s Algorithm, used for generating the
resources for a story
Highlight tools from the Dark and Stormy Archives project for producing stories
3
4. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Storytelling in literature consists of elements…
4
Story elements: setting, characters, sequence, exposition, conflict,
climax, resolution
Wikipedia contributors. (2019, August 16). Dramatic structure. In Wikipedia, The Free Encyclopedia. Retrieved
15:56, August 16, 2019, from https://en.wikipedia.org/w/index.php?title=Dramatic_structure&oldid=911098220
Annenberg Foundation. (2017). Interactives: Elements of a Story. In Annenberg Learner: Teacher resources and
professional development across the curriculum. Retrieved 15:59, August 16, 2019, from
http://www.learner.org/interactives/story/
6. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Social cards are summaries of web resources
URLs can be difficult for people to comprehend
Social cards provide a title, small text snippet, and striking image from the web
page behind a URL
6
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35.3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1d-106.287162!2d35.8440582
Long URL:
vs.
The same URL represented as a
social card:
7. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Social media storytelling uses groups of social cards to
provide a “summary of summaries”
7
2 resources are shown in this Wakelet story6 resources are shown in this Storify story
Each social card summarizes a
web resource.
Each story groups the social
cards, summarizing the topic.
Social cards contain the same
information in the same place on
each card, allowing for easy
comparison.
We want to use this technique to
summarize web archive collections
because users are already
familiar with this visualization
paradigm.
8. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
AlNoamany analyzed popular
social media stories…
AlNoamany discovered that
popular social media stories
Contain around 28 elements
Contain mostly social cards
This means that they are mostly links
to other content
Because they are mostly links,
popular stories help users by
reducing a topic down to a small
number of items
– a summary
8
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Characteristics of social media stories: What
makes a good story?,” International Journal on Digital Libraries, vol. 17, no. 3, pp. 239–256,
2016. https://doi.org/10.1007/ s00799-016-0185-3.
10. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Web archive collections consist of mementos
– different versions of the same page over time
10
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015
11. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Mementos are the documents in web archive collections
Mementos are the versions
of pages from the time of
the crawl.
The mementos are the
documents in our
collections.
Unlike most document
collections web archives
consist of many different
versions of the same
document.
For summarization, web
archive collections require
different handling than other
types of document
collections.
11
12. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Building stories from web archives
consists of two main actions…
1. Select a small subset of mementos
from the web archive collection
12
2. Visualize that subset via a social
media storytelling tool
14. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Action 1: Sample k ≈ 28 mementos from N mementos
of the collection to create a summary story
14
Web sites may group some content, but curators theme
some of this content into collections which we can reduce
to stories.
15. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Selecting our k ≈ 28 mementos manually requires exploring our
web archive collection and deciding on our story…
What people, places, ideas are in the
collection?
Decide on the story
What story do we want to tell?
What events are the story centered on?
Do we want to address the 5Ws: who, what,
when, where, how, why?
Select the k mementos:
Use the web archive collection search engine to
find the people, places, etc. that reflect the story
you wish to tell
Record the URLs of these mementos
We choose about k ≈ 28 from the N ≈ 1000s of
mementos
15
M. Praetzellis. (2018). Browse and search on archive-it.org. In Archive-It Help Center. Retrieved 16:05, August 16,
2019, from https://support.archive-it.org/hc/en-us/articles/208002196-Browse-and-search-on-archive-it-org
16. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
We can select our k ≈ 28 mementos automatically
using AlNoamany’s Algorithm…
16
We developed the Off-Topic Memento
Toolkit (OTMT) to execute this process.
The OTMT is part of the Dark and Stormy
Archives project.
Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived Collections. In
Proceedings of the 2017 ACM on Web Science Conference, 309–318. http://doi.org/10.1145/3091478.3091508
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on
Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Dark and Stormy Archives. https://oduwsdl.github.io/dsa/
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
Parts of this algorithm are useful for
manually reviewing web archive
collections, too.
17. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Step 1: Identifying Off-topic Mementos
17
Hacked
Moved on from topic
Collections have a theme.
Seeds are selected to
support that theme.
Mementos are versions of
seeds.
Some of these versions are
off-topic.
Identifying these off-topic
mementos is key to
summarization.
Web Page Gone
Account Suspension
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within TimeMaps in Web archives,”
International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5
18. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
First We Identify and Exclude Off-Topic Mementos,
But Why?
We want to identify and exclude (not remove) off-topic mementos because they do
not make for good summaries
18
Things happen to web pages that make them go off-topic.
Red: off-topic, Green: on-topic
Mementos are observations of seeds at different points in time
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within TimeMaps in Web archives,”
International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5
19. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Step 2: remove duplicate
mementos
Remember: A memento is an
observation at a particular
point in time
Sometimes the web page did
not change
These duplicates are extras
that we do not need in our
story
19
Thumbnails of duplicate mementos, grouped by color.
Mementos outlined in red are the same, green are the same, etc.
Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived
Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318.
http://doi.org/10.1145/3091478.3091508
20. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Step 3: only consider pages using the language of our
story
20
We typically want to tell stories
with a single language
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived
Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318.
http://doi.org/10.1145/3091478.3091508
21. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Step 4: slice the collection so we cover the spread across
time
To ensure we account for the spread across
time, we slice the collection dynamically and
distribute the mementos equally on the slices.
For N mementos:
If |N| <= 28, then the number of slices is |N|
If |N| > 28, then the number of slices is:
This way the size of the story grows slowly as
needed for large collections
21
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived
Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318.
http://doi.org/10.1145/3091478.3091508
22. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Step 5: cluster each slice for novelty
To ensure we find novel mementos, we
reuse the Simhash scores from the
deduplication step
Each cluster is built from the distance
between these Simhash scores using the
DBSCAN algorithm
22
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived
Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318.
http://doi.org/10.1145/3091478.3091508
23. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Before moving to the step 6 we must understand
Memento Damage…
Sometimes, when crawling, a web
archive does not acquire all of the
images, stylesheets, or JavaScript to
render a page
This lack of resources is called damage
Note that calculating memento damage
takes a long time, so this next step will
take a while
23
J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson, “Not all mementos are created
equal: measuring the impact of missing resources,” International Journal on Digital Libraries, vol. 16, no. 3-4,
2015. https://doi.org/10.1007/s00799-015-0150-6.
E. Siregar, “Deploying the Memento-Damage Service,” https://ws-dl.blogspot.com/2017/11/2017-11-22-
deploying-memento-damage.html, 2017.
24. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Step 6: select high-quality mementos
We favor pages with the following
features:
News over social media, because social
media posts produce poorer cards
Longer URLs with deeper paths, because
they contain more unique information and
thus produce better cards
They have low memento damage
24
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived
Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318.
http://doi.org/10.1145/3091478.3091508
27. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Storify shut down on May 16, 2018
27
https://storify.com/
Originally we visualized
stories from web archive
collections by using
Storify.
Because Storify shut
down in 2018, we need
to visualize the stories
with alternative tools.
28. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Creating a Web Archive Summary Story on Facebook
1. Create a Facebook post with the title
of your story as the text
2. Create a comment
3. Take the first URL from your story
resources and place it in the
comment
4. Wait for the card to appear
5. Repeat for each additional URL, in
order
28
29. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Problems with Summary Stories on Facebook
Facebook does not always generate
a card for a link.
You can fix this by editing the comment
and inserting your own text and image.
If a logged-in Facebook user clicks
on a card, they may not get to the
memento.
Facebook adds extra “stuff” to the end of
a URL for logged-in users, and web
archives may consider this ”stuff” to be
part of the URL, so your viewer will get a
404.
Comments do not always appear in
the order you inserted them.
29
30. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Existing platforms do not reliably produce social cards
for mementos…
30
If we cannot rely upon the
service to generate a social
card for a memento, our system
must then do the work to create
our own.
S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws-
dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.
31. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Some services have stories, but not long term
storytelling?
31
Facebook stories
Image ref:
https://techcrunch.com/2018/04/05/facebook-stories-default/
Image ref:
https://techcrunch.com/2013/10/03/snapc
hat-gets-its-own-timeline-with-snapchat-
stories-24-hour-photo-video-tales/
Snapchat stories
Image ref:
https://buffer.com/library/instagram-stories
Instagram stories
These platforms delete the user’s stories 24 hours after they are posted.
This form of social media storytelling is the opposite of what we are looking for.
We want the stories to be artifacts themselves.
32. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Existing card services create a confusing experience
for mementos
32
Who published these resources?
Archive-It?
CNN?
Is the story author sharing fake news?
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
embed.rocks social card
embed.ly social card
33. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Neither social media services nor card services were reliable
for storytelling, so we created MementoEmbed…
33
Information in the
MementoEmbed social
card is separated to
avoid issues of
confusion about
attribution.
MementoEmbed is
archive-aware. It can
locate information
about the memento
that is not available in
other cards.
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
34. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
MementoEmbed just produces cards and information
for mementos, so we created Raintale to tell stories…
Raintale uses MementoEmbed to generate raw
HTML stories or publish them to services like
Twitter.
Raintale takes a text file consisting of the URLs
for the story.
This file could be generated by the Off-Topic Memento
Toolkit
This file could also be generated by you! You can
manually select memento URLs to insert into the story.
34
S. M. Jones. “Raintale – A Storytelling Tool for Web Archives.” https://ws-dl.blogspot.com/2019/07/2019-07-11-
raintale-storytelling-tool.html, 2019.
35. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Raintale uses templates to let you customize the look
and destination media of your story
35
Extracted components as
Bootstrap cards used by the
fictional “My Archive”
MementoEmbed cards
in Blogger
Extracted
components via
Twitter Thread
Extracted components
via MediaWiki
Pages
S. M. Jones. “Raintale – A Storytelling Tool for Web Archives.” https://ws-dl.blogspot.com/2019/07/2019-07-11-
raintale-storytelling-tool.html, 2019.
36. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
Summary
We introduced social media storytelling
Storytelling with web archives consists of 2 actions:
1. selecting k mementos from the collection of N mementos where k << N
2. visualizing our selection of k mementos via a social media story
Per action 1, we covered:
the challenges of manually selecting mementos for stories
the steps of AlNoamany’s Algorithm for automatically selecting mementos
Per action 2, we highlighted
the need for Archive-Aware cards via MementoEmbed
the Raintale storytelling tool for web archives
Thus, we covered how to conduct storytelling with web archives
36
37. @shawnmjones @WebSciDL@shawnmjones @WebSciDL
For More Information on Dark and Stormy Archives
Dark and Stormy Archives Project: https://oduwsdl.github.io/dsa/
Laboratory exercises:
https://github.com/oduwsdl/dsa/tree/master/tutorials/CEDWARC-2019
Produce the mementos for your story with the
Off-Topic Memento Toolkit (OTMT):
Distribution Page: https://pypi.org/project/otmt/
Report Issues: https://github.com/oduwsdl/off-topic-memento-toolkit/issues
Create social cards of mementos with MementoEmbed:
Documentation: https://mementoembed.readthedocs.io/en/latest/
Report Issues: https://github.com/oduwsdl/MementoEmbed/issues
Generate and publish your story with Raintale:
Website: https://oduwsdl.github.io/raintale/
Documentation: https://raintale.readthedocs.io/en/latest/
Report Issues: https://github.com/oduwsdl/raintale/issues
37
38. @shawnmjones @WebSciDL
Storytelling With Web
Archives
An overview of building a small story from a large collection
Thanks to:
Shawn M. Jones
Web Science and Digital Libraries Research Group
Old Dominion University
RE-70-18-0005-18