The document provides an overview of browser-based digital preservation including:
- The current state of digital preservation which relies on web crawlers and archives like the Internet Archive. However, this approach is insufficient for preserving pages that are not popular, behind authentication, or use complex JavaScript.
- The requirements for new software to directly capture and preserve web pages from within the browser in order to address the limitations of current archival approaches.
- A proposed system called "WARCreate" that would leverage the Chrome extension API to capture web pages and resources and generate WARC files for preservation while maintaining the original browsing context.
The document discusses potential summer research projects in the areas of digital preservation, web archiving, and information visualization. It describes the Web Sciences and Digital Libraries research group's work in web archiving and efforts to make archived web content more accessible and usable. Potential summer projects outlined include developing a visualization of health data from the Blue Button initiative, visualizing aggregate health data using public datasets, and exploring tools for analyzing large collections of academic documents.
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
This is the main slide deck for a workshop at iPRES 2018 on human scale web collecting. A primary focus of the presentation was the use of Webrecorder.io, a free, open source web archiving tool available to all.
On the Change in Archivability of Websites Over TimeMichael Nelson
On the Change in Archivability of Websites Over Time
Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson
Old Dominion University
TPDL 2013, Valletta, Malta, September 23, 2013
Samvera and IIIF: Opportunities and Challenges presented by Jon Dunn (Indiana), Simeon Warner (Cornell), Hannah Frost (Stanford), Adam Wead (Penn. State), and Trey Pendragon (Princeton). The document discusses the opportunities and challenges of integrating Samvera with the International Image Interoperability Framework (IIIF). It provides an overview of IIIF APIs and specifications. It also highlights several institutions' experiences with and plans for IIIF, including using it for audiovisual content, experiments with the Presentation API, and consuming and creating IIIF manifests in digital collections. Key challenges mentioned include tool support for IIIF 3.0, authentication,
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingMichael Nelson
The document discusses the challenges of archiving the web at scale. It notes that while much of the web has been archived, temporal drift and gaps remain issues. Memento provides a framework for accessing content across multiple archives. To fully archive the web, more copies stored in diverse archives are needed due to the risk of any single archive becoming unavailable.
Connect With Your Users: Communicate Using Social Software ToolsRobFav
NELA presentation delivered at the 113th Vermont Library Conference, May 15, 2007. The presentation explores how libraries are using Blogs, Wikis, and RSS.
Web archiving challenges and opportunitiesAhmed AlSum
The document discusses challenges and opportunities in web archiving. It outlines the key stages in the web archiving lifecycle including selection of content, harvesting techniques, storage formats and infrastructure, ways to provide access, and the role of community. Specific challenges are discussed such as representing dynamic and social media content, optimizing storage solutions, and addressing limitations of current access interfaces. Opportunities exist in focusing collection efforts on underrepresented regions, leveraging existing archived data, and developing innovative services and tools to support researchers.
The document discusses potential summer research projects in the areas of digital preservation, web archiving, and information visualization. It describes the Web Sciences and Digital Libraries research group's work in web archiving and efforts to make archived web content more accessible and usable. Potential summer projects outlined include developing a visualization of health data from the Blue Button initiative, visualizing aggregate health data using public datasets, and exploring tools for analyzing large collections of academic documents.
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
This is the main slide deck for a workshop at iPRES 2018 on human scale web collecting. A primary focus of the presentation was the use of Webrecorder.io, a free, open source web archiving tool available to all.
On the Change in Archivability of Websites Over TimeMichael Nelson
On the Change in Archivability of Websites Over Time
Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson
Old Dominion University
TPDL 2013, Valletta, Malta, September 23, 2013
Samvera and IIIF: Opportunities and Challenges presented by Jon Dunn (Indiana), Simeon Warner (Cornell), Hannah Frost (Stanford), Adam Wead (Penn. State), and Trey Pendragon (Princeton). The document discusses the opportunities and challenges of integrating Samvera with the International Image Interoperability Framework (IIIF). It provides an overview of IIIF APIs and specifications. It also highlights several institutions' experiences with and plans for IIIF, including using it for audiovisual content, experiments with the Presentation API, and consuming and creating IIIF manifests in digital collections. Key challenges mentioned include tool support for IIIF 3.0, authentication,
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingMichael Nelson
The document discusses the challenges of archiving the web at scale. It notes that while much of the web has been archived, temporal drift and gaps remain issues. Memento provides a framework for accessing content across multiple archives. To fully archive the web, more copies stored in diverse archives are needed due to the risk of any single archive becoming unavailable.
Connect With Your Users: Communicate Using Social Software ToolsRobFav
NELA presentation delivered at the 113th Vermont Library Conference, May 15, 2007. The presentation explores how libraries are using Blogs, Wikis, and RSS.
Web archiving challenges and opportunitiesAhmed AlSum
The document discusses challenges and opportunities in web archiving. It outlines the key stages in the web archiving lifecycle including selection of content, harvesting techniques, storage formats and infrastructure, ways to provide access, and the role of community. Specific challenges are discussed such as representing dynamic and social media content, optimizing storage solutions, and addressing limitations of current access interfaces. Opportunities exist in focusing collection efforts on underrepresented regions, leveraging existing archived data, and developing innovative services and tools to support researchers.
Using Web Archives to Enrich the Live Web Experience Through Storytelling discusses how storytelling can be used to automatically summarize archived web collections. The presentation describes research that established baselines for characteristics of human-generated stories and developed frameworks to detect off-topic pages, select representative pages, and generate stories from archived collections. User studies found that stories generated by the automatic Dark and Stormy Archives framework were indistinguishable from human-generated stories and both were better than randomly generated stories. The research aims to make archived collections more understandable and accessible through an intuitive storytelling interface.
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
Michael L. Nelson
@phonedude_mln
Michele C. Weigle
@weiglemc
National Symposium on Web Archiving Interoperability
2017-02-21
Many projects joint with LANL
Funding from NSF, IMLS, NEH, and AMF
Interoperability and Its Role In Standardization, Plus A ResourceSync OverviewPeter Murray
This document summarizes Peter Murray's presentation on interoperability and its role in standardization. The presentation covered four levels of interoperability: technical, syntactic, semantic, and organizational. It also discussed the ResourceSync specification, which provides a framework for synchronizing web resources between a source and destination using sitemaps. The specification builds on existing sitemap standards and is currently in beta with the goal of finalizing version 1.0 in fall 2013. Implementation tools are being developed and public feedback is being solicited to improve the specification.
This document discusses Library 2.0 and related concepts. It begins by defining Library 2.0 as applying Web 2.0 tools to library services to meet user needs caused by the effects of Web 2.0. Web 2.0 is described as facilitating user participation and collaboration. Key differences between Library 1.0 and Library 2.0 are outlined, with Library 2.0 being more user-centered, participatory, and flexible. Examples of Web 2.0 tools for libraries like wikis, blogs and RSS feeds are provided along with potential benefits and use cases.
LITA Preconference: Getting Started with Drupal (handout)Rachel Vacek
This document provides an overview of popular modules for the content management system Drupal, focusing on modules useful for libraries. It discusses modules for administration, content management, performance, navigation, user management, and library-specific functions. Popular modules are highlighted for tasks like custom fields, views, panels, web forms, images, editors, spam prevention, taxonomy, scheduling, groups, analytics, events, authentication, searching catalogs and databases. Resources for learning Drupal like books, tutorials, communities and publications are also listed.
The document discusses blogs and their use in education. It explains that blogs are online diaries that are regularly updated with information, links, reflections, and conversations. It provides examples of educational blogs for teachers to explore ideas, lesson plans, and pedagogy. It recommends that teachers start a personal blog to understand how blogging can be used as a teaching and learning tool. RSS feeds are also discussed as a way to automatically track frequently updated blogs and content in one place.
This document provides an overview of library resources and services for students at UIC. It discusses citation management tools like RefWorks, accessing library resources and databases off-campus, obtaining full-text articles through interlibrary loan, and setting up course reserves. Contact and research guide information is provided for getting help from librarians on topics like citation management, remote access to library resources, and setting up course reserves.
This document discusses using Web 2.0 tools in an educational environment. It begins by comparing Web 1.0 and Web 2.0, noting that Web 2.0 encourages sharing, user-generated content, and mobile access over desktop applications. The document then provides many examples of how schools and libraries can use Web 2.0 tools, including blogs, wikis, social networking, photo sharing, and more. It acknowledges challenges but emphasizes that websites should be flexible and encourage collaboration.
The document discusses client-side storage options for web applications, including cookies, Web Storage, IndexedDB, and File APIs. It provides details on each technology, including examples, limitations, and browser support. It emphasizes that IndexedDB and client-side storage can provide benefits like faster loading, reduced network usage, and the ability to work offline. The document also lists several sites that provide more information and tools for exploring these web storage technologies.
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Justin F. Brunelle
Michael L. Nelson
Lyudmila Balakireva
Robert Sanderson
Herbert Van de Sompel
TPDL 2013, September 24, 2013
This document discusses various strategies and resources for archiving internet content for research purposes. It describes several existing large-scale web archives like the Internet Archive and Common Crawl, as well as national and institutional archives. It also outlines how researchers can collect targeted web archives using open-source tools or subscription-based services.
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
Potential utilization of the emerging Web technology, Web Bundles, in Web archiving, presented at the IIPC WAC 2021 in Session 8 by Sawood Alam.
Recording: https://youtu.be/lQX9v9V0FRQ
Tanvi Wadekar completed a 100-hour IT training course and project on the World Wide Web (WWW). The document defines WWW as an information system accessed via the internet that allows for the exchange of hypertext documents and other digital resources. It discusses the history of WWW, invented by Tim Berners-Lee in 1989, and its key components like browsers, servers, caches, and protocols. The working of WWW involves connecting to a server via HTTP, requesting an HTML page, and receiving a response before closing the connection. Common elements on WWW are discussed like web pages, bookmarks, directories, sites and URLs. [/SUMMARY]
This document provides an overview of a major seminar on knowledge discovery from web logs. It discusses how analyzing vast amounts of web site traversal data stored in web logs can reveal useful knowledge about user behavior that can be applied to improve web service performance. Specific techniques covered include mining web logs to build path profiles that predict future page visits, using these predictions to prefetch web documents for faster loading, and clustering web pages to create more intuitive user interfaces. The document lists several applications of web log mining and its advantages.
The document provides an overview of the evolution and trends of web technologies and applications. It discusses the key stages in the evolution of the web from pre-web to modern mobile web. These stages include the early/simple web using static HTML, the dynamic web enabled by server-side processing, the web as a platform supported by mature frameworks, and advances like Web 2.0 and responsive design for mobile. It also covers fundamental technologies, components, servers, processing capabilities, and trends that have shaped the landscape of web development.
This document provides an overview of distributed web-based systems and the World Wide Web. It discusses traditional client-server web architectures as well as more advanced multi-tiered architectures. Key aspects of the web covered include HTTP, web servers, caching, and content distribution networks. The document is attributed to multiple authors and universities and has been modified by the presenting author.
In today’s systems , the time it takes to bring data to the end-user can be very long, especially under heavy load. An application can often increase performance by using an appropriate caching system. There are many caching level that you can use in our application today : CDN, In-Memory/Local Cache, Distributed Cache, Outut Cache, Browser Cache, Html Cache
Filling in the Blanks: Capturing Dynamically Generated ContentJustin Brunelle
JCDL 2012 Doctoral Consortium presentation by Justin F. Brunelle. Covers the problem Web 2.0 creates for preservation, and proposes a solution for client-side capture of content.
Static Site Generators - Developing Websites in Low-resource ConditionIWMW
Paul Walk discusses static site generators as an alternative to content management systems for publishing websites. Static site generators allow content to be authored in simple text files using formats like Markdown and compiled into static HTML and CSS that can be hosted on basic web servers. They provide benefits like minimal infrastructure needs, easy preservation of content, and increased security compared to systems that rely on databases. However, they may not be as user-friendly for content authoring. In general, static site generators are best suited for smaller, simpler websites that don't require advanced user access controls or dynamic functionality.
This document provides tools and best practices for creating coastal web maps. It discusses options for data storage, including hosting locally or in the cloud. Common data formats like Shapefiles, GeoJSON and GeoTIFF are recommended. It also covers web mapping services like WMS, WFS and WMTS based on OGC standards. Popular client-side development options are presented, such as APIs from Google Maps and ArcGIS, and open-source libraries like OpenLayers and Leaflet. Resources for further information are provided.
This document discusses using Varnish as a caching solution for Drupal websites. It begins by introducing Varnish and explaining that it is a reverse proxy cache server that can improve Drupal performance. It then describes what content, like anonymous pages and static assets, can be cached in Varnish and what content should not be cached, like logged-in pages. It also discusses how to configure Varnish and the Drupal Varnish module to integrate caching. The document then covers additional Varnish features like caching authenticated users, purging caches, and tools for monitoring caches.
Using Web Archives to Enrich the Live Web Experience Through Storytelling discusses how storytelling can be used to automatically summarize archived web collections. The presentation describes research that established baselines for characteristics of human-generated stories and developed frameworks to detect off-topic pages, select representative pages, and generate stories from archived collections. User studies found that stories generated by the automatic Dark and Stormy Archives framework were indistinguishable from human-generated stories and both were better than randomly generated stories. The research aims to make archived collections more understandable and accessible through an intuitive storytelling interface.
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
Michael L. Nelson
@phonedude_mln
Michele C. Weigle
@weiglemc
National Symposium on Web Archiving Interoperability
2017-02-21
Many projects joint with LANL
Funding from NSF, IMLS, NEH, and AMF
Interoperability and Its Role In Standardization, Plus A ResourceSync OverviewPeter Murray
This document summarizes Peter Murray's presentation on interoperability and its role in standardization. The presentation covered four levels of interoperability: technical, syntactic, semantic, and organizational. It also discussed the ResourceSync specification, which provides a framework for synchronizing web resources between a source and destination using sitemaps. The specification builds on existing sitemap standards and is currently in beta with the goal of finalizing version 1.0 in fall 2013. Implementation tools are being developed and public feedback is being solicited to improve the specification.
This document discusses Library 2.0 and related concepts. It begins by defining Library 2.0 as applying Web 2.0 tools to library services to meet user needs caused by the effects of Web 2.0. Web 2.0 is described as facilitating user participation and collaboration. Key differences between Library 1.0 and Library 2.0 are outlined, with Library 2.0 being more user-centered, participatory, and flexible. Examples of Web 2.0 tools for libraries like wikis, blogs and RSS feeds are provided along with potential benefits and use cases.
LITA Preconference: Getting Started with Drupal (handout)Rachel Vacek
This document provides an overview of popular modules for the content management system Drupal, focusing on modules useful for libraries. It discusses modules for administration, content management, performance, navigation, user management, and library-specific functions. Popular modules are highlighted for tasks like custom fields, views, panels, web forms, images, editors, spam prevention, taxonomy, scheduling, groups, analytics, events, authentication, searching catalogs and databases. Resources for learning Drupal like books, tutorials, communities and publications are also listed.
The document discusses blogs and their use in education. It explains that blogs are online diaries that are regularly updated with information, links, reflections, and conversations. It provides examples of educational blogs for teachers to explore ideas, lesson plans, and pedagogy. It recommends that teachers start a personal blog to understand how blogging can be used as a teaching and learning tool. RSS feeds are also discussed as a way to automatically track frequently updated blogs and content in one place.
This document provides an overview of library resources and services for students at UIC. It discusses citation management tools like RefWorks, accessing library resources and databases off-campus, obtaining full-text articles through interlibrary loan, and setting up course reserves. Contact and research guide information is provided for getting help from librarians on topics like citation management, remote access to library resources, and setting up course reserves.
This document discusses using Web 2.0 tools in an educational environment. It begins by comparing Web 1.0 and Web 2.0, noting that Web 2.0 encourages sharing, user-generated content, and mobile access over desktop applications. The document then provides many examples of how schools and libraries can use Web 2.0 tools, including blogs, wikis, social networking, photo sharing, and more. It acknowledges challenges but emphasizes that websites should be flexible and encourage collaboration.
The document discusses client-side storage options for web applications, including cookies, Web Storage, IndexedDB, and File APIs. It provides details on each technology, including examples, limitations, and browser support. It emphasizes that IndexedDB and client-side storage can provide benefits like faster loading, reduced network usage, and the ability to work offline. The document also lists several sites that provide more information and tools for exploring these web storage technologies.
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Justin F. Brunelle
Michael L. Nelson
Lyudmila Balakireva
Robert Sanderson
Herbert Van de Sompel
TPDL 2013, September 24, 2013
This document discusses various strategies and resources for archiving internet content for research purposes. It describes several existing large-scale web archives like the Internet Archive and Common Crawl, as well as national and institutional archives. It also outlines how researchers can collect targeted web archives using open-source tools or subscription-based services.
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
Potential utilization of the emerging Web technology, Web Bundles, in Web archiving, presented at the IIPC WAC 2021 in Session 8 by Sawood Alam.
Recording: https://youtu.be/lQX9v9V0FRQ
Tanvi Wadekar completed a 100-hour IT training course and project on the World Wide Web (WWW). The document defines WWW as an information system accessed via the internet that allows for the exchange of hypertext documents and other digital resources. It discusses the history of WWW, invented by Tim Berners-Lee in 1989, and its key components like browsers, servers, caches, and protocols. The working of WWW involves connecting to a server via HTTP, requesting an HTML page, and receiving a response before closing the connection. Common elements on WWW are discussed like web pages, bookmarks, directories, sites and URLs. [/SUMMARY]
This document provides an overview of a major seminar on knowledge discovery from web logs. It discusses how analyzing vast amounts of web site traversal data stored in web logs can reveal useful knowledge about user behavior that can be applied to improve web service performance. Specific techniques covered include mining web logs to build path profiles that predict future page visits, using these predictions to prefetch web documents for faster loading, and clustering web pages to create more intuitive user interfaces. The document lists several applications of web log mining and its advantages.
The document provides an overview of the evolution and trends of web technologies and applications. It discusses the key stages in the evolution of the web from pre-web to modern mobile web. These stages include the early/simple web using static HTML, the dynamic web enabled by server-side processing, the web as a platform supported by mature frameworks, and advances like Web 2.0 and responsive design for mobile. It also covers fundamental technologies, components, servers, processing capabilities, and trends that have shaped the landscape of web development.
This document provides an overview of distributed web-based systems and the World Wide Web. It discusses traditional client-server web architectures as well as more advanced multi-tiered architectures. Key aspects of the web covered include HTTP, web servers, caching, and content distribution networks. The document is attributed to multiple authors and universities and has been modified by the presenting author.
In today’s systems , the time it takes to bring data to the end-user can be very long, especially under heavy load. An application can often increase performance by using an appropriate caching system. There are many caching level that you can use in our application today : CDN, In-Memory/Local Cache, Distributed Cache, Outut Cache, Browser Cache, Html Cache
Filling in the Blanks: Capturing Dynamically Generated ContentJustin Brunelle
JCDL 2012 Doctoral Consortium presentation by Justin F. Brunelle. Covers the problem Web 2.0 creates for preservation, and proposes a solution for client-side capture of content.
Static Site Generators - Developing Websites in Low-resource ConditionIWMW
Paul Walk discusses static site generators as an alternative to content management systems for publishing websites. Static site generators allow content to be authored in simple text files using formats like Markdown and compiled into static HTML and CSS that can be hosted on basic web servers. They provide benefits like minimal infrastructure needs, easy preservation of content, and increased security compared to systems that rely on databases. However, they may not be as user-friendly for content authoring. In general, static site generators are best suited for smaller, simpler websites that don't require advanced user access controls or dynamic functionality.
This document provides tools and best practices for creating coastal web maps. It discusses options for data storage, including hosting locally or in the cloud. Common data formats like Shapefiles, GeoJSON and GeoTIFF are recommended. It also covers web mapping services like WMS, WFS and WMTS based on OGC standards. Popular client-side development options are presented, such as APIs from Google Maps and ArcGIS, and open-source libraries like OpenLayers and Leaflet. Resources for further information are provided.
This document discusses using Varnish as a caching solution for Drupal websites. It begins by introducing Varnish and explaining that it is a reverse proxy cache server that can improve Drupal performance. It then describes what content, like anonymous pages and static assets, can be cached in Varnish and what content should not be cached, like logged-in pages. It also discusses how to configure Varnish and the Drupal Varnish module to integrate caching. The document then covers additional Varnish features like caching authenticated users, purging caches, and tools for monitoring caches.
This document provides an overview of web services, including RESTful and SOAP-based services. It discusses key concepts such as APIs, URIs, HTTP methods, XML/JSON data formats. For RESTful services, it covers the main design principles of being stateless, using explicit HTTP methods, and having directory-like URIs. For SOAP-based services, it describes the roles of SOAP, WSDL, and UDDI in defining and discovering services. The document also provides examples and comparisons of RESTful and SOAP-based approaches.
The document discusses web browser architecture and processes, active browser pages, and caching support. It describes the evolution of web browsers from Tim Berner Lee's initial browser to popular browsers today. It explains how browsers work using a client-server model and protocols like HTTP. The document outlines the typical architecture of a web browser including components like the user interface, rendering engine, and networking. It defines different types of web pages including static, dynamic, and active pages that are driven by JavaScript. Finally, it covers caching at the site, browser, server, and micro levels to improve page load speeds.
WWW: The World Wide Web (WWW) is a repository of information linked together from points all over the world.
The WWW has a unique combination of flexibility , portability, and user-friendly features that distinguish it from other services provided by the Internet.
This document provides an overview of existing web archives and their use for research. It discusses several major web archives including the Internet Archive, Common Crawl, Pandora Archive, and national archives. For each, it describes their size and collection strategies, as well as positives and negatives for research use. The talk concludes with examples of how existing web archives in Australia are being used for research.
This document summarizes various tools for web archiving, including tools for capturing websites and individual web pages (HTTrack, Heritrix, Wget, WARCreate), replaying archived websites (Wayback Machine, MementoFox), managing workflows (Web Curator Tool, NetarchiveSuite, CINCH), hosted services (Archive-It, Web Archiving Service), and file utilities (HTTrack2Arc, warc-tools, WAT Utilities).
The document discusses various web-based attacks such as SQL injection, cross-site scripting (XSS), and cross-site request forgery (CSRF). It provides an overview of these attacks, including how they work and examples. It also covers related topics like the HTTP protocol, URLs, cookies, and the OWASP Top 10 list of most critical web application security risks.
This document provides a checklist and guidance for basic web application security testing in quality assurance. It outlines 10 areas to focus testing on: 1) information disclosure, 2) SSL/TLS, 3) slow HTTP denial of service attacks, 4) HTTP host header attacks, 5) login page over HTTPS, 6) same site scripting, 7) secure headers, 8) cross domain policy, 9) session management, and 10) URL validation. For each area, it describes the security weakness, examples of attacks, and tools that can be used to test for those weaknesses. The document is intended to help integrate an attacker perspective into QA test plans and deliver risk-based security testing.
Similar to Browser-Based Digital Preservation (20)
Aggregating Private and Public Web Archives Using the Mementity FrameworkMat Kelly
This document outlines Mat Kelly's PhD dissertation defense. The defense will address aggregating private and public web archives using the Mementity framework. Kelly will defend his dissertation to a committee chaired by Michele Weigle on May 7, 2019. The dissertation addresses challenges around capturing and replaying private content from the web, including content behind authentication or that requires special handling when aggregated. It proposes research questions around difficult to archive content types, comparing browser and crawler capabilities, issues with authenticated content, signaling content that needs special handling, and access controls for private archives.
A Framework for Aggregating Public and Private Web ArchivesMat Kelly
This document proposes a framework for aggregating private and public web archives. It discusses the current state of memento aggregation and outlines ways to make timemaps and aggregation more expressive. This includes adding attributes to timemaps to provide more information about mementos without requiring full dereferencing, such as status codes, content digests, and indicators of private versus public captures. The framework aims to provide a more comprehensive view of the archived web by incorporating both personal and non-aggregated archives.
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesMat Kelly
Mat Kelly presented a framework for aggregating personal, private, and institutional web archives while maintaining access control. The framework includes separate timemaps for different types of captures that could be aggregated while restricting access to private captures. Kelly sought input on use cases around access control for private web archives and mechanisms for protecting archived web pages. The presentation explored challenges in replaying private archives alongside public ones from institutions and how the framework could address these issues.
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...Mat Kelly
This document proposes a framework for aggregating private and public web archives. It introduces two new entities: the Memento Meta Aggregator (MMA) and the Private Web Archive Adapter (PWAA). The MMA allows for dynamic inclusion of archives and recursive construction of archive sets. The PWAA regulates access to private web archives by authenticating requests and relaying results. This framework enables private archives to be included in aggregations while preserving privacy through access control and authentication via the PWAA.
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Mat Kelly
The document describes a system for generating thumbnail summaries of large collections of web archive mementos. The system uses SimHash to identify sufficiently unique mementos based on similarities and differences in HTML markup. It calculates Hamming distance between memento SimHashes to select a subset for the summary that limits redundancy while preserving important captures. The visualizations generated by the system provide an overview of a website's evolution over time using 3-6 representative thumbnails.
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryMat Kelly
The document proposes using ResourceSync, BitTorrent, and WebRTC to facilitate the a posteriori replication of satellite imagery published on NASA web servers. It describes using a crawler to discover imagery resources and produce metadata, which is then used by adapter software to invoke a BitTorrent-based distribution of image payloads to users. The approach was constructed as a proof-of-concept to distribute data and mitigate reliance on NASA servers as the single source. Evaluation showed it was effective but temporally expensive, and future work could better integrate ResourceSync and utilize the YAML metadata.
This document introduces the Archival Acid Test, which evaluates how well web archiving tools archive modern webpages that use advanced HTML, JavaScript, and other web technologies. The test is divided into basic tests, JavaScript tests, and advanced features tests to assess different areas. Results show that archiving tools perform well on basic tests but struggle with dynamic content, asynchronous JavaScript, iframes, and other complex features. The goal of the Archival Acid Test is to create a standardized, publicly available way to evaluate how completely archiving tools archive modern webpages and identify areas for improvement.
Archive What I See Now - Archive-It Partner Meeting 2013 2013Mat Kelly
This document summarizes a presentation about enabling individual web archiving. It discusses tools like WARCreate and WAIL that allow users to archive web pages from their browser in WARC format. Issues addressed include timely capture of breaking news, preserving original context like user profiles, and uploading personal archives to institutional archives. Goals of the Archive What I See Now project are to port WARCreate to Firefox, add capabilities to upload WARCs, and implement sequential archiving of linked resources.
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemMat Kelly
This document describes a graph-based visualization system for navigating and predicting box office performance. The system represents movie data as interconnected nodes in a graph layout. Selecting different nodes allows navigation between the movie context and related contexts like actors. Node size and position encode attributes relevant to box office predictions. The system preprocesses and caches external data to make complex predictions accessible through an interactive visual interface.
The document introduces WARCreate and WAIL, tools that make web archiving easier. WARCreate allows users to archive web pages they see in their browser directly as WARC files, preserving context. WAIL packages existing tools like Heritrix and Wayback into a graphical user interface, allowing one-click archiving. Together these tools aim to make web archiving more accessible to personal archivists while still producing outputs compatible with institutional tools and standards.
Making Enterprise-Level Archive Tools Accessible for Personal Web ArchivingMat Kelly
The document describes a set of tools that make enterprise-level web archiving accessible for personal use. The tools include a crawler (Heritrix), a web archive player (Wayback Machine), and an archive inspector (WARC-Proxy) that are installed locally on a personal machine. The interface provides one-click options to set up crawls, view archived pages in the local Wayback installation, and check archive status. It aims to support personal web archiving through a graphical user interface that allows customizing crawls, starting/stopping services, and works with existing WARC files from other tools on Windows, MacOS, and Linux systems.
An Extensible Framework for Creating Personal Web Archives of Content Behind ...Mat Kelly
The document is a thesis that aims to develop an extensible framework for creating personal web archives of content behind authentication barriers. It discusses problems with current tools for personal web archiving, such as them breaking when sites change hierarchies and producing suboptimal archives. The thesis seeks to remedy such issues, preserve more social media content, and make archiving outputs more optimal. It utilizes tools like Archive Facebook and WARCreate to generate navigable archives in a format compatible with replay systems like Wayback Machine.
If twitter is the "first draft of history", then we should be doing a better job of preserving it. For the one
year anniversary of the Egyptian revolution (2012) we revisited a sample of the shared social media content and found nearly 11% missing from the current web, and only 20% available in public web archives. Spurred by this, we sampled tweets for ve other culturally important events from 2009-2012 and found similar rates for archiving and loss.
WARCreate - Create Wayback-Consumable WARC Files from Any WebpageMat Kelly
The Internet Archive's Wayback Machine is the most common way that typical users interact with web archives. The Internet Archive uses the Heritrix web crawler to transform pages on the publicly available web into Web ARChive (WARC) files, which can then be accessed using the Wayback Machine. Because Heritrix can only access the publicly available web, many personal pages (e.g., password-protected pages, social media pages) cannot be easily archived
into the standard WARC format. We have created a Google
Chrome extension,WARCreate, that allows a user to create
a WARC file from any webpage. Using this tool, content that might have been otherwise lost in time can be archived in a standard format by any user. This tool provides a way for casual users to easily create archives of personal online
content. This is one of the first steps in resolving issues of
long term storage, maintenance, and access of personal digital assets that have emotional, intellectual, and historical
value to individuals
NDIIPP/NDSA 2011 - YouTube Link RestorationMat Kelly
Creating Persistent Links to YouTube Music Videos
The document discusses the problem of links to YouTube videos becoming invalid when videos are removed. It proposes introducing a resolver service that redirects links to alternative copies of videos when the original link returns a 404 error. This service would also retrieve and publish metadata about videos to external websites to help find available copies when the initial link is broken. The goal is to create persistent links to YouTube music videos even if the specific video is removed from YouTube.
Archive Facebook is an add-on for Mozilla Firefox that allows users to create stand-alone archives of the content on their Facebook account. It preserves the look and feel of Facebook, unlike Facebook's native downloading option. The add-on lets users choose what specific types of content to archive, rather than limiting it to what Facebook allows. This ensures the archive is a true snapshot of the user's Facebook data and history. The add-on provides an easy-to-use interface to navigate and access archived content.
HijackLoader Evolution: Interactive Process HollowingDonato Onofri
CrowdStrike researchers have identified a HijackLoader (aka IDAT Loader) sample that employs sophisticated evasion techniques to enhance the complexity of the threat. HijackLoader, an increasingly popular tool among adversaries for deploying additional payloads and tooling, continues to evolve as its developers experiment and enhance its capabilities.
In their analysis of a recent HijackLoader sample, CrowdStrike researchers discovered new techniques designed to increase the defense evasion capabilities of the loader. The malware developer used a standard process hollowing technique coupled with an additional trigger that was activated by the parent process writing to a pipe. This new approach, called "Interactive Process Hollowing", has the potential to make defense evasion stealthier.
Gen Z and the marketplaces - let's translate their needsLaura Szabó
The product workshop focused on exploring the requirements of Generation Z in relation to marketplace dynamics. We delved into their specific needs, examined the specifics in their shopping preferences, and analyzed their preferred methods for accessing information and making purchases within a marketplace. Through the study of real-life cases , we tried to gain valuable insights into enhancing the marketplace experience for Generation Z.
The workshop was held on the DMA Conference in Vienna June 2024.
Ready to Unlock the Power of Blockchain!Toptal Tech
Imagine a world where data flows freely, yet remains secure. A world where trust is built into the fabric of every transaction. This is the promise of blockchain, a revolutionary technology poised to reshape our digital landscape.
Toptal Tech is at the forefront of this innovation, connecting you with the brightest minds in blockchain development. Together, we can unlock the potential of this transformative technology, building a future of transparency, security, and endless possibilities.
Discover the benefits of outsourcing SEO to Indiadavidjhones387
"Discover the benefits of outsourcing SEO to India! From cost-effective services and expert professionals to round-the-clock work advantages, learn how your business can achieve digital success with Indian SEO solutions.
2. What This Will Be About
• Saving Things on Live Web to Web Archives
– State of the Art
– Barriers
• Dynamics of Digital Preservation Components
– Means of Preservation (e.g., crawlers)
– Means of Replay (e.g., Wayback)
• Outstanding and Unsolved Issues in DigPres
2
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
3. The Speaker, Mat Kelly:
• 3rd Year PhD student
• MS Degree:
• BS Degree:
• Previously a researcher
at BMW
• Currently programmer @ NASA Langley
• Research Assistant @ ODU WS-DL Research Lab
Research Lab homepage: http://ws-dl.cs.odu.edu
3
My homepage: http://www.cs.odu.edu/~mkelly
4. Quick Primer: The Web
• What It Is
– Client-Server Messaging + Payload
– HTML, Images, JavaScripts, Multimedia, and more!
• Where It Resides
– Live web: remote servers, distributed content
4
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
5. Quick Primer: HTTP
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
5
http://mysite.com/~jsmith/myPage.html
domain resource path on serverSCHEME/PROTOCOL
user agent
(browser)
Client Requests Resource
Server
Server Responds
YES! 200 OK + Resource
Resource Exists? Respond With
NO, it’s moved! 300-code + URI
NO, it’s gone! 404
Embedded Content in Resp?
e.g., images, JavaScript YES! 200 OK + Resource
NO, it’s moved! 300-code + URI
NO, it’s gone! 404
6. How The Web Is Used
• Web browser accesses website via URI
• Browser requests site’s contents from server
• Browser displays site’s contents
– Usually requiring multiple requests for embedded
resources
GET / HTTP/1.1
Host: mysite.com BEHIND THE SCENES
6
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
HTTP/1.1 200 OK
Content-Type: text/html
<!DOCTYPE html>
<html>
<head> …
HTTP Response
Header
HTTP Response
Entity
7. Returning to a Page on the Live Web
• Lookup by URI
• But Site’s Gone
• Lookup in Web Archives
GET / HTTP/1.1
Host: mysite.com
HTTP/1.1 404 NOT FOUND
Content-Type: text/html
7
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
8. Returning to a Page on the Live Web
• Lookup by URI at Internet Archive
✓ SUCCESS
Likely Preserved If…
• Popular
• Surface (e.g., no auth)
• Simple (no fancy JS)
USING THE ARCHIVES
8
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
10. 10
Sites Not On Surface Web are Likely
Not Archived
• Sites can restrict being crawled
– Whose data is it anyway?
• Robots.txt can limit
– Crawler (by user-agent)
– Paths/files/whole site
11. 11
Sites Not On Surface Web are Likely
Not Archived
• Not everything ought
to be in public archives
• …but we might still
want to preserve these
pages.
– e.g., our FB news feed
13. Recall:
Likely a Web Page Is Preserved If…
• Popular
• Surface (e.g., no authentication required)
• Simple (no fancy JavaScript)
13
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
14. From This, We Can Surmise:
A page will be UNLIKELY or INSUFFICIENTLY
preserved if:
• Not Popular
• On the Deep Web or Behind Authentication
• Contains Fancy Effects or Asynchronously
Loaded Resources
14
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
15. Why and What Can Be Done
A page will be UNLIKELY or INSUFFICIENTLY
preserved if:
• Not Popular
• On the Deep Web or Behind Authentication
• Contains Fancy Effects or Asynchronously
Loaded Resources
Internet Archive chooses What is Preserved
15
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
16. Why and What Can Be Done
A page will be UNLIKELY or INSUFFICIENTLY
preserved if:
• Not Popular
• On the Deep Web or Behind Authentication
• Contains Fancy Effects or Asynchronously
Loaded Resources
Preserve from the Browser Instead of
Delegating by URI
16
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
17. Why and What Can Be Done
A page will be UNLIKELY or INSUFFICIENTLY
preserved if:
• Not Popular
• On the Deep Web or Behind Authentication
• Contains Fancy Effects or Asynchronously
Loaded Resources
Leverage the Browser’s Native JavaScript
rendering engine
17
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
18. • Internet Archive’s archival crawler
– Open source, also deployed @archive.org
• Command-line driven
• Web-based GUI
• Allows list of URIs to be crawled,
frequency, crawl depth, etc.
– On user’s own Heritrix deployment
• Generates Web ARChive (WARC) files
State of the Art:
18
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
19. The Web ARChive (WARC) File Format
• 28500
• Multiple text-based
record types in a file
• Supports arbitrary
data types on web
• Preserves HTTP
information for
replaying content
WARC/1.0
WARC-Type: warcinfo
WARC-Filename: test.warc
Content-Length: 42
Format: WARC File Format 1.0
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://mysite.
WARC-Date: 2014-05-10T12:11:03Z
WARC-Concurrent-To: <urn:uuid:b
WARC-Record-ID: <urn:uuid:f8e09
Content-Length: 115
GET / HTTP/1.1
User-Agent: Mozilla/5.0
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://mysite.
WARC-Concurrent-To: <urn:uuid:b
WARC-Record-ID: <urn:uuid:f93ea
Content-Length: 11262
HTTP/1.1 200 OK
<!DOCTYPE html><html>
<head>
<body>...
WARC
info
Record
WARC
request
Record
WARC
response
Record
19
20. WARC Anatomy
• WARC Records each have
a header and a payload
• One warcinfo/WARC
• warcmetadata optional
– Describe files
– can have multiple/file
• Req/Resp records
document HTTP
transactions @archive time
warcinfo record header
warcinfo record payload
warcmetadata record header
warcmetadata record payload
warcrequest record header
warcrequest record payload
warcresponse record header
warcreponse record payload
warcrequest record header
warcrequest record payload
warcresponse record header
warcreponse record payload
MyArchive.warc
20
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
21. Reading WARCs
• “Replaying” content as it once was versus
viewing a static web page
•
– Deployed at Internet Archive (archive.org)
– Open Source software
• You can deploy your own Wayback!
– Indexes WARCs, allows content to be accessible
from web interface
21
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
22. But the Issue Remains…
A crawler
changes the capture context
from the browser
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
22
23. High Level Requirements
WARC Creation Software
• Intuitive to use
– cf. Heritrix and Wayback are difficult to setup
• Comply with WARC ISO standard
• Capture JavaScript
• Capture content behind authentication
• Allow a user to execute through the browser
– On demand, without need to delegate via URI
23
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
24. WARCreate
“Create WARC files from any webpage”
1. On Page Load
1. Generate warcinfo record
2. Save HTTP Requests, Responses
• Content as Strings
2. On Secondary Content Loaded
Repeat 1.2. and Concatenate
3. On Generate WARC command
1. Create Blob from String
2. Provide Blob to FileSaver library
24
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
26. Payloads Captured
• Target information can be
captured with webRequest
• File/Session descriptors
generated at archive time
• Payload headers generated
once payload captured
• Templating system used
warcinfo record header
warcinfo record payload
warcmetadata record header
warcmetadata record payload
warcrequest record header
warcrequest record payload
warcresponse record header
warcreponse record payload
warcrequest record header
warcrequest record payload
warcresponse record header
warcreponse record payload
MyArchive.warc
✓
✓
✓
✓
26
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
27. A Real warcinfo
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2014-06-30T22:56:06Z
WARC-Filename: 20140630180606900.warc
WARC-Record-ID: <urn:uuid:7fc954f0-e291-279f-5b6b-5b97aeb437bd>
Content-Type: application/warc-fields
Content-Length: 458
software: WARCreate/0.2014.6.2 http://warcreate.com
format: WARC File Format 1.0
conformsTo:
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
isPartOf: basic
description: Crawl initiated from the WARCreate Google Chrome
extension
robots: ignore
http-header-user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153
Safari/537.36
http-header-from: warcreate@matkelly.com
• Template
• Generated @ archive time
• Self-descriptive
27
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
28. A Real warcmetadata
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://matkelly.com/
WARC-Date: 2014-06-30T22:56:06Z
WARC-Concurrent-To: <urn:uuid:dddc4ba2-c1e1-459b-8d0d-a98a20b87e96>
WARC-Record-ID: <urn:uuid:6fef2a49-a9ba-4b40-9f4a-5ca5db1fd5c6>
Content-Type: application/warc-fields
Content-Length: 1938
outlink: http://matkelly.com/_images/logo.png E =EMBED_MISC
outlink: http://matkelly.com/resume/ L a/@href
• Template
• Generated @ archive time
• Self-descriptive
28
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
29. A Real warcrequest
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://matkelly.com/
WARC-Date: 2014-06-30T22:56:06Z
WARC-Concurrent-To: <urn:uuid:ba53adfd-4bd0-9a53-9396-b0dd77e39353>
WARC-Record-ID: <urn:uuid:f8e095f3-7b77-f2e6-f592-ce0b63a32f36>
Content-Type: application/http; msgtype=request
Content-Length: 317
GET / HTTP/1.1
Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q
=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153
Safari/537.36
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,de-DE;q=0.6
• Template
• Generated @ archive time
• Self-descriptive
29
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
HTTP
Request
header
31. The File System Problem
• WARC strings must be saved to file system
• JS provides access to HTML5 localStorage
– Limited to sandbox, 5 MB
• NEW Chrome extension API allows unlimited MB
– chrome.storage instead of localStorage
• HTML5 W3C saveAs() support is limited
• FileSaver.js provides polyfill!
– Limited to ~ 360 MB/file
https://github.com/eligrey/FileSaver.js/
31
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
32. Capture Scope
• Not limited to robots.txt
• Native JS Support
(from browser)
• User accessible
– No URI delegation
32
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
33. User Dynamics
• Simple operation:
navigate, click a button
• Files downloaded with
smart defaults to HDD
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
33
34. • We may not want our personal info in archives
• Your http://facebook.com and
mine are vastly different
– …but same URI!
• Even if we capture it,
it might not be replayable
– Limited by replay system
Caveats of Personal Web Archiving
34
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
35. Recent Work/Outstanding Issues
Direct upload of WARC to server
Comprehensive Site Archiving (e.g., all your )
• Annotation of Web Pages
• Grouping WARCs into archival collections
• Periodic, automated execution
• Portable JS WARC interaction library
• Much more!
35
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
36. But…Isn’t This Just Coding
and Not Research?
• WARC work in the JS space is non-existent!
• Initial implementations have already
uncovered limitations
• JS tech has evolved, opened new doors for
functionality
• WARCreate is state-of-the art for using
JavaScript/Browsers FOR archiving
36
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
37. Browser-Based Digital Preservation
WARCreate as Research
• Publications
– Mat Kelly and Michele C. Weigle, "WARCreate - Create Wayback-Consumable
WARC Files from Any Webpage," In Proceedings of the ACM/IEEE Joint
Conference on Digital Libraries (JCDL). Washington, DC, June 2012, pp. 437-
438
– Mat Kelly, Michele C. Weigle and Michael L. Nelson. "WARCreate - Create
Wayback-Consumable WARC Files from Any Webpage," Digital Preservation
2012, Tools Demo Session: Web Archiving; 2012 Jul 25; Washington, DC.
– Mat Kelly, Michael L. Nelson and Michele C. Weigle. "WARCreate and WAIL:
WARC, Wayback and Heritrix Made Easy," Digital Preservation 2013,
Workshops and Sessions: Web Archiving; 2013 Jul 24; Alexandria, VA
• Academic Funding
– Archive What I See Now: Bringing Institutional Web Archiving Tools to the
Individual Researcher, National Endowment for the Humanities (NEH), Digital
Humanities Start-Up Grant, HD-51670-13, May 2013 - Dec 2014, $57,892
• Notoriety
– Future Steward Innovation Award Recipient, National Digital Stewardship
Alliance (NDSA) / Library of Congress, July 2012
37
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
38. Where I Have Traveled as a
ODU Researcher
Browser-Based Digital Preservation
http://ws-dl.cs.odu.edu
38
Washington, DC Atlanta, GA College Park, MD
Salt Lake City, UT San Francisco, CA London, England
(Sept 2014)
…