With web archives, journalists find evidence and information to back up their stories, historians store information for later users, and social scientists can study the actions of humans during specific time periods. These different groups gain value not only from creating their own collections but from using the collections of others. Web archive collections store the content that would otherwise be lost. As users, we currently have no efficient way of understanding what is in each collection without manually reviewing all of its items. Web archives intentionally consist of different versions of the same document. With these multiple versions, we can watch the evolution of a single resource over time, following the changes to an organization or how the public learns the details of an unfolding news story. As aggregations of archived web pages, or mementos, these collections become resources unto themselves. While past work has used mementos for studying how web resources change over time or evaluated the changes to various industries, there is still theoretical work to be done in improving the usability of web archive collections. Our goal is to help collection creators and the public at large to make better use of these collections through improvements to collection understanding. We build upon the work of AlNoamany by using visualizations from social media storytelling. Our goal is to produce a story for each web archive collection. Each story consists of representative mementos selected from the web archive collection that are then individually visualized as surrogates (e.g., screenshots, cards containing a summary of the page). This solution has the benefit of using visualization paradigms familiar to users. In this work, we provide background on the problem, analyze previous work in this area, and highlight our preliminary work before providing a plan for future research.
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Shawn Jones
Presented at ACM CIKM 2019. Used by a variety of researchers, web archive collections have become invaluable sources of evidence. If a researcher is presented with a web archive collection that they did not create, how do they know what is inside so that they can use it for their own research? Search engine results and social media links are represented as surrogates, small easily digestible summaries of the underlying page. Search engines and social media have a different focus, and hence produce different surrogates than web archives. Search engine surrogates help a user answer the question "Will this link meet my information need?" Social media surrogates help a user decide "Should I click on this?" Our use case is subtly different. We hypothesize that groups of surrogates together are useful for summarizing a collection. We want to help users answer the question of "What does the underlying collection contain?" But which surrogate should we use? With Mechanical Turk participants, we evaluate six different surrogate types against each other. We find that the type of surrogate does not influence the time to complete the task we presented the participants. Of particular interest are social cards, surrogates typically found on social media, and browser thumbnails, screen captures of web pages rendered in a browser. At p=0.0569, and p=0.0770, respectively, we find that social cards and social cards paired side-by-side with browser thumbnails probably provide better collection understanding than the surrogates currently used by the popular Archive-It web archiving platform. We measure user interactions with each surrogate and find that users interact with social cards less than other types. The results of this study have implications for our web archive summarization work, live web curation platforms, social media, and more.
I presented this at iPres 2018. It consists of an analysis of some structural features found in Archive-It collections. We also categorize Archive-It collections into 4 different semantic categories and then uses the structural features to predict these categories with a Random Forest Classifier.
Where Can We Post Stories Summarizing Web Archive CollectionsShawn Jones
This is a presentation of social media storytelling tools that were covered in a blog post written for the Web Science and Digital Libraries research group: http://ws-dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html
I presented this paper at iPres 2018. Here, we introduce the Off-Topic Memento Toolkit, used to detect versions of web pages that have drifted off topic from the general topic of a collection.
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Shawn Jones
Presented at ACM CIKM 2019. Used by a variety of researchers, web archive collections have become invaluable sources of evidence. If a researcher is presented with a web archive collection that they did not create, how do they know what is inside so that they can use it for their own research? Search engine results and social media links are represented as surrogates, small easily digestible summaries of the underlying page. Search engines and social media have a different focus, and hence produce different surrogates than web archives. Search engine surrogates help a user answer the question "Will this link meet my information need?" Social media surrogates help a user decide "Should I click on this?" Our use case is subtly different. We hypothesize that groups of surrogates together are useful for summarizing a collection. We want to help users answer the question of "What does the underlying collection contain?" But which surrogate should we use? With Mechanical Turk participants, we evaluate six different surrogate types against each other. We find that the type of surrogate does not influence the time to complete the task we presented the participants. Of particular interest are social cards, surrogates typically found on social media, and browser thumbnails, screen captures of web pages rendered in a browser. At p=0.0569, and p=0.0770, respectively, we find that social cards and social cards paired side-by-side with browser thumbnails probably provide better collection understanding than the surrogates currently used by the popular Archive-It web archiving platform. We measure user interactions with each surrogate and find that users interact with social cards less than other types. The results of this study have implications for our web archive summarization work, live web curation platforms, social media, and more.
I presented this at iPres 2018. It consists of an analysis of some structural features found in Archive-It collections. We also categorize Archive-It collections into 4 different semantic categories and then uses the structural features to predict these categories with a Random Forest Classifier.
Where Can We Post Stories Summarizing Web Archive CollectionsShawn Jones
This is a presentation of social media storytelling tools that were covered in a blog post written for the Web Science and Digital Libraries research group: http://ws-dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html
I presented this paper at iPres 2018. Here, we introduce the Off-Topic Memento Toolkit, used to detect versions of web pages that have drifted off topic from the general topic of a collection.
Improving Collection Understanding in Web ArchivesShawn Jones
We propose using visualization of representative mementos to aide in collection understanding of web archive collections, as inspired by AlNomanay's work.
Combining Social Media Storytelling With Web ArchivesShawn Jones
(This was a guest presentation for CS6604 - Digital Libraries - Fall 2019 - taught by Edward A. Fox)
Web archive collections consist of 1000s of documents. Manually making sense of collections at this scale is difficult. We propose using social media storytelling to aid in summarizing web archive collections. We discuss AlNoamany's Algorithm for generating a representative sample from these collections and highlight how to use the Dark and Stormy Archives toolkit.
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Group
ws-dl.cs.odu.edu
@WebSciDL
This work is supported in part by IMLS LG-71-15-0077
Old Dominion University ECE Department Colloquium
2015-11-13
Summarizing archival collections using storytelling techniquesMichael Nelson
Summarizing archival collections using storytelling techniques
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
@phonedude_mln
Research Funded by IMLS LG-71-15-0077-15
Dodging the Memory Hole
Los Angeles, CA, 2016-10-14
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Group
@WebSciDL
This work is supported in part by IMLS LG-71-15-0077
CNI Spring 2016
2016-04-05
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Shawn Jones
Collections are the tools that people use to make sense of an ever-increasing number of archived web pages. As collections themselves grow, we need tools to make sense of them. Tools that work on the general web, like search engines, are not a good fit for these collections because search engines do not currently represent multiple document versions well. Web archive collections themselves are vast, some containing hundreds of thousands of documents. There are also thousands of collections, many of which cover the same topic. Few collections include standardized metadata. Too many documents from too many collections with not enough metadata makes collection understanding an expensive proposition.
This dissertation establishes a five-process model to assist with web archive collection understanding. This model aims to automatically produce a social media story -- a visualization paradigm with which most web users are already familiar. Each social media story contains surrogates which are summaries of individual documents. These surrogates, when collected together, summarize the overall topic of the story. After applying our storytelling model, they summarize the topic of a web archive collection.
We develop and test a framework to select the best exemplars that represent a collection. We establish that algorithms produced from these primitives select exemplars that are otherwise undiscoverable using conventional search engine methods. We generate story metadata to improve the information scent of a story so users can understand it better. After an analysis showing that existing platforms perform poorly for web archives and a user study establishing the best surrogate type, we generate document metadata for the exemplars with machine learning. We then visualize the story and document metadata together and distribute it to satisfy the information needs of multiple personas who benefit from our model.
Our tools serve as a reference implementation of our Dark and Stormy Archives storytelling model. Hypercane selects exemplars and generates story metadata. MementoEmbed generates document metadata. Raintale visualizes and distributes the story based on the story metadata and the document metadata of these exemplars. By providing understanding at a glance, our stories save users the time and effort of reading thousands of documents and, most importantly, help them understand web archive collections.
The Power of Sharing Linked Data - ELAG 2014 WorkshopRichard Wallis
Presentation to set the scene and stimulate discussion in the Workshop "The Power of Sharing Linked Data" at ELAG 2014 - Bath University, UK June 10/11 2014
Are museums a dial that only goes to 5? Michael Edson
For Social Media Week, Washington, D.C., "Defining and measuring social media success in museums and arts organizations." http://socialmediaweek.org/blog/event/are-you-remarkable-defining-and-measuring-social-media-success-in-museums-and-arts-organizations/#.US4XyOtARCQ
Improving Collection Understanding in Web ArchivesShawn Jones
We propose using visualization of representative mementos to aide in collection understanding of web archive collections, as inspired by AlNomanay's work.
Combining Social Media Storytelling With Web ArchivesShawn Jones
(This was a guest presentation for CS6604 - Digital Libraries - Fall 2019 - taught by Edward A. Fox)
Web archive collections consist of 1000s of documents. Manually making sense of collections at this scale is difficult. We propose using social media storytelling to aid in summarizing web archive collections. We discuss AlNoamany's Algorithm for generating a representative sample from these collections and highlight how to use the Dark and Stormy Archives toolkit.
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Group
ws-dl.cs.odu.edu
@WebSciDL
This work is supported in part by IMLS LG-71-15-0077
Old Dominion University ECE Department Colloquium
2015-11-13
Summarizing archival collections using storytelling techniquesMichael Nelson
Summarizing archival collections using storytelling techniques
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
@phonedude_mln
Research Funded by IMLS LG-71-15-0077-15
Dodging the Memory Hole
Los Angeles, CA, 2016-10-14
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Group
@WebSciDL
This work is supported in part by IMLS LG-71-15-0077
CNI Spring 2016
2016-04-05
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Shawn Jones
Collections are the tools that people use to make sense of an ever-increasing number of archived web pages. As collections themselves grow, we need tools to make sense of them. Tools that work on the general web, like search engines, are not a good fit for these collections because search engines do not currently represent multiple document versions well. Web archive collections themselves are vast, some containing hundreds of thousands of documents. There are also thousands of collections, many of which cover the same topic. Few collections include standardized metadata. Too many documents from too many collections with not enough metadata makes collection understanding an expensive proposition.
This dissertation establishes a five-process model to assist with web archive collection understanding. This model aims to automatically produce a social media story -- a visualization paradigm with which most web users are already familiar. Each social media story contains surrogates which are summaries of individual documents. These surrogates, when collected together, summarize the overall topic of the story. After applying our storytelling model, they summarize the topic of a web archive collection.
We develop and test a framework to select the best exemplars that represent a collection. We establish that algorithms produced from these primitives select exemplars that are otherwise undiscoverable using conventional search engine methods. We generate story metadata to improve the information scent of a story so users can understand it better. After an analysis showing that existing platforms perform poorly for web archives and a user study establishing the best surrogate type, we generate document metadata for the exemplars with machine learning. We then visualize the story and document metadata together and distribute it to satisfy the information needs of multiple personas who benefit from our model.
Our tools serve as a reference implementation of our Dark and Stormy Archives storytelling model. Hypercane selects exemplars and generates story metadata. MementoEmbed generates document metadata. Raintale visualizes and distributes the story based on the story metadata and the document metadata of these exemplars. By providing understanding at a glance, our stories save users the time and effort of reading thousands of documents and, most importantly, help them understand web archive collections.
The Power of Sharing Linked Data - ELAG 2014 WorkshopRichard Wallis
Presentation to set the scene and stimulate discussion in the Workshop "The Power of Sharing Linked Data" at ELAG 2014 - Bath University, UK June 10/11 2014
Are museums a dial that only goes to 5? Michael Edson
For Social Media Week, Washington, D.C., "Defining and measuring social media success in museums and arts organizations." http://socialmediaweek.org/blog/event/are-you-remarkable-defining-and-measuring-social-media-success-in-museums-and-arts-organizations/#.US4XyOtARCQ
This presentation provides an accessible introduction to Linked Open Data (LOD) and how LOD is modelled and made available online. The presenters will discuss several LOD projects created by libraries and archives in order to illustrate the benefits of applying LOD principles and practices. They will also demonstrate easy ways to leverage the power of LOD for archival organizations and their digital collections, with concrete examples involving WikiData, Omeka S, and the SNAC (Social Networks and Archival Context) Project.
Society of Georgia Archivists 2018 Annual Meeting
Speakers:
Josh Hogan, Atlanta University Center Robert W. Woodruff Library
Cliff Landis, Atlanta University Center Robert W. Woodruff Library
"Libraries always remind me that there are good things in this world."
Print -
Print Resources. University and college libraries tend to have more recent and detailed materials, most of which are print resources, than community or other lending libraries. ... Print resources are books, journals, newspapers, and other documents containing relevant information.
# E Print / Digital / NON Print
An information explosion has been with us for several decades. ... Nonbook materials consist of periodicals, newspapers, pamphlets, maps, photographs, pictures, posters, slides, film strips, motion pictures, video tapes, cassettes, microfilms and microfiches, computer disks, etc.
What is a book? What implications do new digital formats and communications media have for our answer to this question? Kudos enables authors to connect books to related materials in all media, to expand their appeal and discoverability. Slides from a presentation given to the Faculty of Humanities and Social Sciences at the University of Liverpool, for Academic Book Week 2015 (10th November 2015).
Slides from a talk given by Stacy Allison-Cassin and William Denton, of York University, at the Ontario Library Association 2009 Super Conference, 29 January 2009.
Available under a Creative Commons license.
http://hdl.handle.net/10315/2501
Dr. Michael Nelson is a professor of computer science at Old Dominion University. Prior to joining ODU, he worked at NASA Langley Research Center from 1991 to 2002. He is a co-editor of the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting), OAI-ORE (Open Archives Initiative Object Reuse and Exchange), Memento and ResourceSync specifications. His research interests include repository-object interaction and alternative approaches to digital preservation.
The Royal Society of Chemistry hosts one of the worlds’ richest collections of online chemistry data that is free-to-access for the community. ChemSpider presently hosts over 30 million unique chemical compounds together with associated data and accessible via a number of search techniques. With almost 50,000 unique users per day from around the world the site offers scientists the ability to investigate the world of small molecules via property searches, analytical data and predictive models. The challenges associated with providing a similar platform for “materials” are manifold but, if they could be addressed, would offer a valuable service to the materials community. This presentation will provide an overview of how ChemSpider was built, our efforts to expand the capabilities to a more encompassing data repository and some of the challenges faced to embrace the diverse world of materials informatics and online data access.
Emerging Technologies for Libraries and Librarians, 2013Jennifer Baxmeyer
Slides from a presentation given to students in Professor Andrew P. Jackson's "Organization and Management: Public Libraries" class in the Graduate School of Library and Information Studies at Queens College in Queens, NY.
This is the updated Social Work Research slideshow (Feb 19, 2014) which includes databases and how to search them; how to use the online catalog effectively for research; how to find online books on social work through the online catalog. Questions? llord@ku.edu
Digital Transformation and Data - the Wikimedia Residency at the University o...Ewan McAndrew
Digital Transformation and Data — The Wikimedia Residency at the University of Edinburgh
This presentation took place at SCURL’s ‘Libraries, Literacies & Learning’ event 23 March 2018.
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Shawn Jones
Much computer vision research has focused on natural images, but technical documents typically consist of abstract images, such as charts, drawings, diagrams, and schematics. How well do general web search engines discover abstract images? Recent advancements in computer vision and machine learning have led to the rise of reverse image search engines. Where conventional search engines accept a text query and return a set of document results, including images, a reverse image search accepts an image as a query and returns a set of images as results. This paper evaluates how well common reverse image search engines discover abstract images. We conducted an experiment leveraging images from Wikimedia Commons, a website known to be well indexed by Baidu, Bing, Google, and Yandex. We measure how difficult an image is to find again (retrievability), what percentage of images returned are relevant (precision), and the average number of results a visitor must review before finding the submitted image (mean reciprocal rank). When trying to discover the same image again among similar images, Yandex performs best. When searching for pages containing a specific image, Google and Yandex outperform the others when discovering photographs with precision scores ranging from 0.8191 to 0.8297, respectively. In both of these cases, Google and Yandex perform better with natural images than with abstract ones achieving a difference in retrievability as high as 54% between images in these categories. These results affect anyone applying common web search engines to search for technical documents that use abstract images.
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...Shawn Jones
Much computer vision research has focused on natural images, but technical documents typically consist of abstract images, such as charts, drawings, diagrams, and schematics. How well do general web search engines discover abstract images? Recent advancements in computer vision and machine learning have led to the rise of reverse image search engines. Where conventional search engines accept a text query and return a set of document results, including images, a reverse image search accepts an image as a query and returns a set of images as results. This paper evaluates how well common reverse image search engines discover abstract images. We conducted an experiment leveraging images from Wikimedia Commons, a website known to be well indexed by Baidu, Bing, Google, and Yandex. We measure how difficult an image is to find again (retrievability), what percentage of images returned are relevant (precision), and the average number of results a visitor must review before finding the submitted image (mean reciprocal rank). When trying to discover the same image again among similar images, Yandex performs best. When searching for pages containing a specific image, Google and Yandex outperform the others when discovering photographs with precision scores ranging from 0.8191 to 0.8297, respectively. In both of these cases, Google and Yandex perform better with natural images than with abstract ones achieving a difference in retrievability as high as 54% between images in these categories. These results affect anyone applying common web search engines to search for technical documents that use abstract images.
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Shawn Jones
Much computer vision research has focused on natural images, but technical documents typically consist of abstract images, such as charts, drawings, diagrams, and schematics. How well do general web search engines discover abstract images? Recent advancements in computer vision and machine learning have led to the rise of reverse image search engines. Where conventional search engines accept a text query and return a set of document results, including images, a reverse image search accepts an image as a query and returns a set of images as results. This paper evaluates how well common reverse image search engines discover abstract images. We conducted an experiment leveraging images from Wikimedia Commons, a website known to be well indexed by Baidu, Bing, Google, and Yandex. We measure how difficult an image is to find again (retrievability), what percentage of images returned are relevant (precision), and the average number of results a visitor must review before finding the submitted image (mean reciprocal rank). When trying to discover the same image again among similar images, Yandex performs best. When searching for pages containing a specific image, Google and Yandex outperform the others when discovering photographs with precision scores ranging from 0.8191 to 0.8297, respectively. In both of these cases, Google and Yandex perform better with natural images than with abstract ones achieving a difference in retrievability as high as 54% between images in these categories. These results affect anyone applying
common web search engines to search for technical documents that use abstract images.
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...Shawn Jones
In a perfect world, all articles consistently contain sufficient metadata to describe the resource. We know this is not the reality, so we are motivated to investigate the evolution of the metadata that is present when authors and publishers supply their own. Because applying metadata takes time, we recognize that each news article author has a limited metadata budget with which to spend their time and effort. How are they spending this budget? What are the top metadata categories in use? How did they grow over time? What purpose do they serve? We also recognize that not all metadata fields are used equally. What is the growth of individual fields over time? Which fields experienced the fastest adoption? In this paper, we review 227,726 HTML news articles from 29 outlets captured by the Internet Archive between 1998 and 2016. Upon reviewing the metadata fields in each article, we discovered that 2010 began a metadata renaissance as publishers embraced metadata for improved search engine ranking, search engine tracking, social media tracking, and social media sharing. When analyzing individual fields, we find that one application of metadata stands out above all others: social cards -- the cards generated by platforms like Twitter when one shares a URL. Once a metadata standard was established for cards in 2010, its fields were adopted by 20% of articles in the first year and reached more than 95% adoption by 2016. This rate of adoption surpasses efforts like schema.org and Dublin Core by a fair margin. When confronted with these results on how news publishers spend their metadata budget, we must conclude that it is all about the cards.
Automatically Selecting Striking Images for Social CardsShawn Jones
To allow previewing a web page, social media platforms have developed social cards: visualizations consisting of vital information about the underlying resource. At a minimum, social cards often include features such as the web resource's title, text summary, striking image, and domain name. News and scholarly articles on the web are frequently subject to social card creation when being shared on social media. However, we noticed that not all web resources offer sufficient metadata elements to enable appealing social cards. For example, the COVID-19 emergency has made it clear that scholarly articles, in particular, are at an aesthetic disadvantage in social media platforms when compared to their often more flashy disinformation rivals. Also, social cards are often not generated correctly for archived web resources, including pages that lack or predate standards for specifying striking images. With these observations, we are motivated to quantify the levels of inclusion of required metadata in web resources, its evolution over time for archived resources, and create and evaluate an algorithm to automatically select a striking image for social cards. We find that more than 40% of archived news articles sampled from the NEWSROOM dataset and 22% of scholarly articles sampled from the PubMed Central dataset fail to supply striking images. We demonstrate that we can automatically predict the striking image with a Precision@1 of 0.83 for news articles from NEWSROOM and 0.78 for scholarly articles from the open access journal PLOS ONE.
A presentation of the work I had done with the Research Library Prototyping Team at Los Alamos National Laboratory given to the local chapter of the Special Libraries Association in New Mexico.
Avoiding Spoilers On MediaWiki Fan Sites Using MementoShawn Jones
A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if readers are behind in their viewing they run the risk of encountering "spoilers" -- information that gives away key plot points before the intended time of the show's writers. Enterprising readers might browse the wiki in a web archive so as to view the page prior to a specic episode date and thereby avoid spoilers. Unfortunately, due to how web archives choose the "best" page, it is still possible to see spoilers (especially in sparse archives).
In this presentation we highlight the issues with avoiding spoilers using Memento. We show that for a sample of fan wiki pages there is as much as a 66% chance of encountering a spoiler. We also find, using logs from the Internet Archive, that 19% of actual requests to the Wayback Machine for wikia.com end in spoilers. We suggest a different heuristic for use with wikis and unveil the Memento MediaWiki Extension as a solution.
Reconstructing the past with media wikiShawn Jones
The Internet Archive attempts to reconstruct web pages via snapshots (Mementos) that are taken of pages at various points in time. Many pages change more frequently than the Internet Archive can capture them, meaning that some revisions of a given web page are lost forever. Mediawiki, however, has all past revisions of a given page, and also its associated external resources. This inspired the development of the Memento Mediawiki Extension as an improvement over the Internet Archive's "drive by" method of digital preservation where Mediawiki sites are involved.
While working on the Memento Mediawiki Extension, effort was put into reconstructing past revisions of each Wiki page. The existing software reconstructs the page text as per RFC 7089, but does not try to pull in past versions of images, JavaScript, CSS, and other external resources, because Mediawiki, as it exists, makes it difficult or impossible to load these resources at page generation time.
This curated talk will explore the problems of page reconstruction on the main web and detail the issues within the Mediawiki code that currently prevent and/or make it difficult to reconstruct the page in its totality as it looked at that revision.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Improving Understanding of Web Archive Collections Through Storytelling - PhD Candidacy Exam
1. @shawnmjones @WebSciDL
Improving Understanding of
Web Archive Collections
Through Storytelling
PhD Candidacy Exam for: Shawn M. Jones
Committee:
Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna
Thanks to:
4. @shawnmjones @WebSciDL
Let’s say: you find a bag
There are thousands of different items inside.
Can you use the contents of this bag?
How quickly can you make this decision?
4
5. @shawnmjones @WebSciDL
Now let’s say: there are thousands of bags
Which one might contain something useful for
you?
Do any?
How do you know?
How do you decrease your chances of wasting
your time?
5
7. @shawnmjones @WebSciDL
Researchers create their own web archive collections
7
Archived web pages, or mementos, are used by journalists, sociologists, and historians.
Tucson Shootings2008 OlympicsUniversity of Utah
8. @shawnmjones @WebSciDL
Web archive collections have many versions of the same
page
8
2013
2015
2018
University of Utah Office of Admissions
from the University of Utah Web Archive Collection
4/1/2015
3/5/2015
Tumblr Black Lives Matter Blog
from the #blacklivesmatter Collection
2/12/2015
9. @shawnmjones @WebSciDL
Different versions allow us to see an unfolding news
story
9
Memento from
April 19, 2013 17:12
Searching for suspects,
City on lockdown
Memento from
April 19, 2013 17:59
Officer Donahue in hospital,
Lockdown loosened,
Will the Red Sox game be cancelled?
Memento from
April 24, 2013 2:24
Suspect Found,
Office collier lost life,
Obama speaks
11. @shawnmjones @WebSciDL
Archive-It allows curators to easily create collections
Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive
collections. Curators can supply live web resources as seeds and establish crawling schedules of those
seeds to create mementos.
11
12. @shawnmjones @WebSciDL
… and these collections are used by other researchers
12
The collection curator is not the only user of the
collection!
These collections live a life after their curator
has stopped adding to them.
13. @shawnmjones @WebSciDL
How do we tell the difference between collections?
What is the difference between these two Archive-It collections about the South Louisiana Flood of
2016?
Which one should a researcher use?
13
14. @shawnmjones @WebSciDL 14
31 Archive-It
collections match the
search query
“human rights”
How are they different
from each other?
Which one is best for my
needs?
16. @shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
16
9 seeds
with metadata
132,599 seeds
no metadata
17. @shawnmjones @WebSciDL
But, alas the metadata does not help
Because metadata is optional it is not always
present.
Metadata on Archive-It collections:
• many different curators
• different organizations
• different content standards
• different rules of interpretation
• it is inconsistently applied
This means that a user cannot reliably compare
metadata fields to understand the differences
between collections.
17
132,599 seeds
no metadata
9 seeds
with metadata
Paradox:
More seeds = more effort
More seeds = greater user need for metadata
18. @shawnmjones @WebSciDL
Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of seeds
Each seed can have many mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
18
20. @shawnmjones @WebSciDL
The problem, summarized
There are multiple collections
about the same concept.
The metadata for each collection is
non-existent, or inconsistently
applied.
Many collections have
1000s of seeds with multiple
mementos.
There are more than 8000
collections.
20
21. @shawnmjones @WebSciDL
The problem, summarized
There are multiple collections
about the same concept.
The metadata for each collection is
non-existent, or inconsistently
applied.
Many collections have
1000s of seeds with multiple
mementos.
There are more than 8000
collections.
Human review of these
mementos for collection
understanding is an expensive
proposition.
21
22. @shawnmjones @WebSciDL
Our proposal: a visualization made of representative
mementos
Our visualization is a summary that will
act like an abstract
Pirolli and Card’s Information Foraging
Theory:
maximize the value of the information gained
from our summaries
minimize the cost of interacting with the
collection
ensure that our representative mementos
have good information scent
contain cues that the memento will address a
user’s needs
22
From this:
318 seeds with
2421 mementos
To something like this:
a social media story
of ~28 surrogates
P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
24. @shawnmjones @WebSciDL
Surrogates provide a visual summary of the content
behind a URI…
24
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI represented by a
browser thumbnail surrogate:
The same URI represented by a
social card surrogate:
25. @shawnmjones @WebSciDL
Social media storytelling uses surrogates to provide a
“summary of summaries”
25
2 resources are shown from this Wakelet story6 resources are shown from this Storify story
Each surrogate summarizes a
web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this technique
to summarize web archive
collections because users are
already familiar with this
visualization paradigm.
27. @shawnmjones @WebSciDL
Web surrogates provide a visual summary of a web
resource drawn from the content of the resource
27
Browser Thumbnail (example from UK Web Archive)Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018-
04-24-lets-get-visual-and-examine.html, 2018.
28. @shawnmjones @WebSciDL
Our research questions
RQ1: What types of web archive
collections exist?
RQ2: What surrogates work best for
understanding collections of
mementos?
RQ3: How do we select
representative mementos for the
different semantic types of
collections?
RQ4: How well do stories produced
by different summarization algorithms
work for collection understanding?
28
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
29. @shawnmjones @WebSciDL
RQ2: What surrogates work best for web resources?
29
Studies on visualizing web resources have focused primarily on
determining search engine result relevance and not collection understanding.
Li (2008)
social cards > text snippets
in performance
Dziadosz (2002)
text + thumbnail > text snippet
text snippet > thumbnail
in performance
Woodruff (2001)
thumbnails > text snippets
in performance
Teevan (2009)
text snippets > thumbnails
in performance
Aula (2010)
text snippets ~= thumbnails
in performance
Loumakis (2011)
text snippets ~= social cards
in performance
social cards > text snippets
in information scent and user preference
Capra (2013)
social cards > text snippets
In performance
(barely statistically significant)
Al Maqbali (2010)
text + thumbnail ~= social card
text snippet ~= social card
text + thumbnail ~= text snippet
in performance
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-dl.blogspot.com/2018/04/2018-
04-24-lets-get-visual-and-examine.html, 2018.
30. @shawnmjones @WebSciDL
RQ3: How might we select representative mementos?
Luhn (1958)
• automatic abstracts
Silva (2014)
• word graphs from
Luhn’s algorithm
DUC Datasets (2001-2007)
Napoles (2012)
• Gigaword
Lin (2014)
• ROUGE metrics
Grusky (2018)
• NEWSROOM
• Existing reference summaries were
built from news articles.
• Existing reference summaries were
not built from web archives.
Mihalcea (2004)
• TextRank
Dolan (2004)
• clustering news articles
• Lede3 preferred by evaluators
Xie (2008)
• MMR for meeting summaries
Radev (1998)
• automatic
news briefs
Xie (2008)
• MMR for meeting
summaries
Sipos (2008)
• scholarly corpus
over time
Zhang (2010)/Li (2011)
• aspects of disasters
Hong (2014)
• word weighting
30
31. @shawnmjones @WebSciDL
RQ3: How might we select representative mementos?
– Related Concepts
Scatter-Gather (Cutting 1992)
allows a user to explore a collection by
drilling through topic cluster until they reach
individual documents
we seek to provide a representative sample
that a user can quickly glance
Recommender Systems
predicts the preference of a user based on
past behavior, demographic profile, or
behavior of the user’s friends
we want to provide a summary without any
knowledge of the user
Zero-Query Systems
predicts the information a user will need
based on time, location, environment, user
interests, and other factors
again, we want to provide a summary with
no knowledge of the user
31
Image reference:
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992.
Scatter/Gather: a cluster-based approach to browsing large document collections.
In Proceedings of the 15th annual international ACM SIGIR conference on Research
and development in information retrieval(SIGIR '92). Copenhagen, Denmark, pp. 318-
329. https://doi.org/10.1145/133160.133214
32. @shawnmjones @WebSciDL
How have others explored collections?
32
Conta Me Histórias
ArchiveSpark
Archives Unleashed Cloud
Existing solutions allow users to query and develop statistics on collections.
Users must have some ideas of a topic or concept a priori.
33. @shawnmjones @WebSciDL
How have others visualized collections for
understanding?
33
Other attempts at
visualizing Archive-It
collections tried to
visualize everything.
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-
visualizing.html
K. Padia, Y. AlNoamany, and M. C. Weigle. 2012. Visualizing digital collections at archive-it. In
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ‘12) 15 – 18.
DOI:10.1145/2232817.2232821
34. @shawnmjones @WebSciDL
How have others told stories with web
archive collections?
34
AlNoamany told stories via the storytelling platform Storify
She proved that test participants could not detect the difference between her automated stories
and stories generated by human curators
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
Y. AlNoamany, M. C. Weigle, and M. L. Nelson. 2017. Generating Stories From Archived
Collections. In Proceedings of the 2017 ACM on Web Science Conference, 309–318.
DOI:10.1145/3091478.3091508
35. @shawnmjones @WebSciDL
How have others told stories with web
archive collections?
35
AlNoamany told stories via the storytelling platform Storify – which is no longer in service
She proved that test participants could not detect the difference between her automated stories
and stories generated by human curators
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
x
S. M. Jones. “Storify Will Be Gone Soon, So How Do We Preserve The Stories?”
http://ws-dl.blogspot.com/2017/12/2017-12-14-storify-will-be-gone-soon-so.html 2017.
x
36. @shawnmjones @WebSciDL
How have others told stories with web
archive collections?
AlNoamany told stories via the storytelling platform Storify – which is no longer in service
She proved that test participants could not detect the difference between her automated stories and
stories generated by human curators
Did not evaluate if the resulting summaries were effective tools for collection understanding
Focused on summarizing collections about events
There are other types of Archive-It collections
Characteristicsof
human-generated
Stories
Characteristicsof
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
36
x
x
38. @shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
1. How effective are existing curation platforms at producing mementos?
2. Preliminary user surrogate study
3. Partial RQ3: How do we select representative mementos for the different semantic types of
collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
38
39. @shawnmjones @WebSciDL
As collection users, we view Archive-It collections
from outside…
39
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
40. @shawnmjones @WebSciDL
As collection users, what structural features can we
view from outside?
40
Using only structural features is
advantageous because it saves one
from having to download a collection’s
content.
These structural features give us
different insight than can be provided by
text analysis or metadata.
81,014 seeds
486,227 seed mementos
Structural features shown here:
• number of seeds
• number of mementos
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
41. @shawnmjones @WebSciDL
Was the collection built from web sites belonging to one
domain or many?
41
Many domains One domain
Structural feature discussed
here:
• domain diversity
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
42. @shawnmjones @WebSciDL
Were most of the web pages in the collection top-level
pages or specific articles deeper in a web site?
42
Top-level pages Deeper links
Structural feature discussed
here:
• path depth diversity
• most frequent path depth
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
43. @shawnmjones @WebSciDL
Growth curves provide some understanding of collection
curation behavior
43
• Skew of the
collection’s holdings
• Indicates temporality
of collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
44. @shawnmjones @WebSciDL
Does most of the collection exist earlier or later in its
life?
44
This collection was created in
March 2010.
Most of its mementos come from
2016 – 2018.
Most of this collection exists later in
its life.
Structural feature discussed here:
• area under the seed memento growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
45. @shawnmjones @WebSciDL
When did the curator select and archive a collection’s
contents?
45
This collection was created in
March 2006.
Some of the seeds were selected
in 2006.
Many of the seeds were selected
all along its life.
It has mementos as recent as
July 2018.
Structural feature discussed here:
• area under the seed growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
46. @shawnmjones @WebSciDL
Did the curator create a collection intended to archive new versions of
the same web pages repeatedly?
46
This collection was created
in June 2014.
The seeds were selected
toward the beginning of its
life.
Mementos were captured all
during its life.
Structural feature discussed here:
• area under the seed growth curve
• area under the seed memento growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
47. @shawnmjones @WebSciDL
We discovered four semantic categories in
Archive-It collections…
47
Self-Archiving Subject-based Time Bounded – Expected Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
48. @shawnmjones @WebSciDL 48
Self-Archiving
54.1% of collections
Subject-based Time Bounded – Expected Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
49. @shawnmjones @WebSciDL 49
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
50. @shawnmjones @WebSciDL 50
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
51. @shawnmjones @WebSciDL 51
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
In a study of 3,382 Archive-It collections
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
52. @shawnmjones @WebSciDL 52
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
We discovered four semantic categories in
Archive-It collections…
53. @shawnmjones @WebSciDL
We can bridge the structural to the descriptive…
53
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
Using the structural features mentioned previously, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
54. @shawnmjones @WebSciDL
We have identified different types of Archive-It
collections
54
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
We can take these features
into account to address the
other research questions.
So, let’s tell some stories on social
media!
Self-Archiving Subject-based
Time Bounded
– Expected
Time Bounded
– Spontaneous
55. @shawnmjones @WebSciDL
We have identified different types of Archive-It
collections
55
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
We can take these features
into account to address the
other research questions.
So, let’s tell some stories on social
media!
Self-Archiving Subject-based
Time Bounded
– Expected
Time Bounded
– Spontaneous
Not so fast…
56. @shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
1. How effective are existing curation platforms at producing mementos?
2. Preliminary user surrogate study
3. Partial RQ3: How do we select representative mementos for the different semantic types of
collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
56
57. @shawnmjones @WebSciDL
Existing platforms do not reliably produce surrogates
for mementos…
57
If we cannot rely upon the
service to generate a surrogate
for a memento, our system must
then do the work to create our
own surrogates.
S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws-
dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.
58. @shawnmjones @WebSciDL
Some services have stories, but not long term
storytelling?
58
Facebook stories
Image ref:
https://techcrunch.com/2018/04/05/facebook-stories-default/
Image ref:
https://techcrunch.com/2013/10/03/snapc
hat-gets-its-own-timeline-with-snapchat-
stories-24-hour-photo-video-tales/
Snapchat stories
Image ref:
https://buffer.com/library/instagram-stories
Instagram stories
These platforms delete the user’s stories 24 hours after they are posted.
This form of social media storytelling is the opposite of what we are looking for.
We want the stories to be artifacts themselves.
59. @shawnmjones @WebSciDL
Some services’ longevity is in doubt…
59
RIP: Google+ 2019 RIP: Tumblr (soon?)RIP: Storify 2018
S. M. Jones. “Where Can We Post Stories Summarizing Web Archive Collections?” https://ws-
dl.blogspot.com/2017/08/2017-08-11-where-can-we-post-stories.html, 2017.
60. @shawnmjones @WebSciDL
Existing surrogate services create a confusing
experience for mementos
60
Who published these resources?
Archive-It?
CNN?
Is the story author sharing fake news?
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
embed.rocks surrogate
embed.ly surrogate
61. @shawnmjones @WebSciDL
Neither social media services nor surrogate services were
reliable for storytelling, so we created MementoEmbed…
61
Information in the
MementoEmbed social
card surrogate is
separated to avoid
issues of confusion
about attribution.
MementoEmbed is
archive-aware. It can
locate information
about the memento
that is not available in
other surrogates.
S. M. Jones. “A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages.” https://ws-
dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
62. @shawnmjones @WebSciDL
MementoEmbed provides us with a tool for evaluating
surrogates, a step on the road to answering RQ2…
62
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
63. @shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
1. How effective are live web curation platforms at producing mementos?
2. Preliminary user surrogate study
3. Partial RQ3: How do we select representative mementos for the different semantic types of
collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
63
64. @shawnmjones @WebSciDL
Using stories built from curator-selected mementos, we
shared stories with MT participants…
64
Archive-It like
Social Card
Browser thumbnails
Social Card With
Thumbnail as Image
(sc/t)
Social Card With
Thumbnail to
Right (sc+t)
Social Card with
Thumbnail on
Hover (sc^t)
• 4 stories of 15-17 mementos selected by human Archive-It
curators from their collections
• 6 different surrogate types
• 24 different story-surrogate combinations
• 120 MT participants
• Given 30 seconds to view each story
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
65. @shawnmjones @WebSciDL
And then we asked them which of 2 of 6 mementos
come from the same collection…
65
• Each participant was shown a list of 6 surrogates of the same type as the story they just viewed.
• They were asked to choose the 2 that they thought came from the same collection.
• They were given as much time as they wished to answer the question.
• This is similar to the Sentence Verification Task from reading comprehension studies.
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
66. @shawnmjones @WebSciDL
Response times per surrogate had interesting means, but
p-values were not statistically significant at p < 0.05
66
p = 0.190
p = 0.202
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
67. @shawnmjones @WebSciDL
Correct answers per surrogate indicate that social
cards probably outperform the Archive-It surrogate
67
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
p = 0.0569
p = 0.0770
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
68. @shawnmjones @WebSciDL
Whenever thumbnails are present, more users interact
with them
68
We could not detect if participants were zooming in to view thumbnails, but most hovered when confronted
with a thumbnail, regardless of surrogate.
For browser thumbnails alone, most of the participants clicked the link to view the actual memento behind the
surrogate.
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web
Archive Collections,” Tech. Rep. 1905.11342, Old Dominion University, 2019. https://arxiv.org/abs/1905.11342.
69. @shawnmjones @WebSciDL
We have some results indicating that social cards
perform better, but there is more to answering RQ2…
69
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
70. @shawnmjones @WebSciDL
Outline
1. Motivation
2. Research Questions
3. Preliminary Work
1. RQ1: What types of web archive collections exist?
2. Partial RQ2: What surrogates work best for understanding collections of mementos?
3. Partial RQ3: How do we select representative mementos for the different semantic
types of collections?
1. The Off-Topic Memento Toolkit (OTMT)
4. Proposed Research
70
71. @shawnmjones @WebSciDL
Identifying off-topic mementos is key to choosing
representative mementos
71
Hacked
Moved on from topic
Collections have a theme
Seeds are selected to
support that theme
Mementos are versions of
seeds
Some of these versions are
off-topic
Identifying these off-topic
mementos is key to
summarization
Web Page Gone
Account Suspension
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
72. @shawnmjones @WebSciDL
The Off-Topic Memento Toolkit (OTMT) compares a seed’s first
memento with the seed’s other mementos via different
measures…
Measure Fully Equivalent
Score
Fully Dissimilar
Score
Preprocessing
Performed
OTMT -tm
keyword
Byte Count 0.0 -1.0 No bytecount
Word Count 0.0 -1.0 Yes wordcount
Jaccard Distance 0.0 1.0 Yes jaccard
Sørensen-Dice 0.0 1.0 Yes sorensen
Simhash of Term
Frequencies
0 64 Yes simhash-tf
Simhash or raw
memento
0 64 No simhash-raw
Cosine Similarity of
TF-IDF Vectors
1.0 0 Yes cosine
Cosine Similarity of
LSI Vectors
1.0 0 Yes gensim_lsi
72
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
73. @shawnmjones @WebSciDL
After repeating AlNoamany’s experiment, Word Count had
the best F1 score for identifying off-topic mementos…
73
We reused
AlNoamany’s labeled
dataset.
She did not try:
• Sørensen-Dice
• Simhash of raw
content
• Simhash of TF
• Gensim LSI
Our word count
accuracy came out
ahead of
AlNoamany’s.
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within TimeMaps in Web archives,”
International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5
74. @shawnmjones @WebSciDL
Finding off-topic mementos is one of the first steps to
addressing RQ3…
74
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
76. @shawnmjones @WebSciDL
This work requires a flexible
framework –
Dark and Stormy Archives
(DSA) 2.0
76
OTMT
Hypercane
Raintale MementoEmbed
Archive-It Utilities
Story
Web Archive
Collection
✅
✅
✅
callscalls
calls
provides
input to
input
output
Thousands of
HTML documents
< 30 Representative
Mementos
Visualized as
surrogates
calls
✅
S. M. Jones. “Raintale – A Storytelling Tool for Web Archives.” https://ws-dl.blogspot.com/2019/07/2019-07-11-
raintale-storytelling-tool.html, 2019.
Tools for selecting
representative
mementos
Tools for visualizing
mementos as a
story
77. @shawnmjones @WebSciDL
Evaluation of RQ2: What surrogates work best for
understanding collections of mementos?
77
How well do users perform with
different types of surrogates?
1. Select 5 collections from each
semantic category
2. Select the earliest memento of each
of the first 20 seeds from each
collection – this is the number of
surrogates a user views if they
open an Archive-It story and page
down once
3. Present the participant with a story
of 20 surrogates, varying the
surrogate between participants
4. Ask them to address a user task
Variations:
• For step #3, vary the time for participants to view the story
• participants view for 5, 10, 20, 30 seconds
• may surface the ability to “glance” and understand
• some surrogates consist only of title, URI, etc.
• may determine which surrogate elements perform
best
• For step #4, ask the participant to:
• determine if the collection behind the story is suited for a
task – similar to traditional IR research
• identify which items likely belong to the same collection
• Instead of steps 3 and 4 – ask former participants which
surrogate they prefer for a given task
78. @shawnmjones @WebSciDL
Evaluation of RQ2: What surrogates work best for
understanding collections of mementos?
78
What information is available to users
of the existing Archive-It story?
Discover patterns in metadata usage that may indicate
the semantic type of collection.
How well do our stories compare to the
existing metadata?
How well do our stories cover the
content of the underlying collection?
How well does the Archive-It story
cover the underlying collection?
How well do surrogates cover the
content of their mementos?
Collection
Content
Our Story
Content
Collection
Content
Archive-It
Story
Content
Memento
Content
Surrogate
Content
Our Story
Content
Existing
Metadata
For Seeds
Similarity metrics will
be used for evaluating
coverage.
79. @shawnmjones @WebSciDL
Evaluation of RQ3: How do we select representative
mementos for different semantic types of collections?
79
We will develop different algorithms and compare their output
with several metrics to determine which algorithms provide the
best ”aboutness” for the collection.
0
1
2
3
4
5
6
7
8
9
10
Existing Metadata
Content Coverage
Temporal Spread
Source Diversity
Compression
Performance
DSA 1.0 Algorithm 2 Algorithm 3 Algorithm 4
80. @shawnmjones @WebSciDL
RQ4: How well do stories produced by different summarization
algorithms work for collection understanding?
80
How well do our generated stories compare to the
existing Archive-It interface?
Do study participants understand key concepts of the
collection represented by the story?
Using the stories, can participants tell the difference
between similar collections?
Can participants compare stories and tell which are
similar?
Does the addition of existing metadata improve the
participant’s performance?
Does the layout of the surrogates improve the
participant’s performance?
RQ2:
Surrogate Types
RQ3:
Selecting
Mementos
RQ4:
Evaluating
Stories
RQ1:
Collection Types
✅
☑️ ☑️
81. @shawnmjones @WebSciDL
We plan to
have
completed
this
research in
2021…
81
iPres 2018
iPres 2018
CIKM 2019
ECIR 2020
WWW 2020
CIKM 2020
WebSci 2021
JCDL 2020
JCDL 2018
DTMH 2017
82. @shawnmjones @WebSciDL
Our methods are not just for Archive-It
82
Our methods will be applicable web archive collections created on
other platforms, like Rhizome’s Webrecorder.
83. @shawnmjones @WebSciDL
Motivation Summary
Collection understanding is a problem
with web archive collections
inconsistent metadata
1000s of mementos
1000s of collections
costly for human review
We intend to produce a visualization that
serves as an abstract to assist in
collection understanding
Prior work in this area:
did not evaluate how well this method works
for collection understanding
only focused on collections about events
relied upon Storify as a visualization medium
83
84. @shawnmjones @WebSciDL
Contributions
Existing work:
Derived semantic categories of web archive collections in
Archive-It
Categories can be predicted by using structural features
Most collections are not about events
MementoEmbed – surrogates for the past web
Social cards probably provide better understanding of
collections
Off-Topic Memento Toolkit – Identifying off-topic mementos
Future work:
Evaluate algorithms for surfacing a representative sample
from a document collection
Evaluate different surrogate types via user evaluation
Show which surrogate-sample combinations work best for
collection understanding via user evaluation
84
85. @shawnmjones @WebSciDL
Improving Understanding of
Web Archive Collections
Through Storytelling
PhD Candidacy Exam for: Shawn M. Jones
Committee:
Michael L. Nelson, Michele C. Weigle, Jian Wu, Sampath Jayarathna
Thanks to: