Presented by Michele C. Weigle, June 4, 2015
Columbia University Web Archiving Collaboration: New Tools and Models
Work by Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryMat Kelly
Ā
Publicly available NASA satellite imagery hosted on NASA Langley Research Center's web servers are relied upon for analysis by atmospheric scientists. Because the data is hosted in a single location, any downtime, data loss, or latency is reliant on a single point of failure. Data redundancy would assist in dissemination of this data but it would then be necessary to ensure that updated versions of the data or fresh data from the same domain remain related with the original data. In this work, I borrow on technologies from other domains of digital preservation and data distribution to allow the progressively growing corpus of imagery at NASA to be harvested and made available on a peer-to-peer basis using facets of the frameworks behind ResourceSync, BitTorrent, and WebRTC.
Collaborative Web Archiving with Ivy Plus / Borrow Direct Anna Perricci
Ā
Presentation for Web Archiving Collaboration: New Tools and Models (#cuwarc), which was a conference held at Columbia University Libraries on June 4th, 2015. There are corrections on the slide covering the citation analysis we are doing, which is still currently in progress. Video of this and all presentations on June 4 is expected to be available later in 2015.
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryMat Kelly
Ā
Publicly available NASA satellite imagery hosted on NASA Langley Research Center's web servers are relied upon for analysis by atmospheric scientists. Because the data is hosted in a single location, any downtime, data loss, or latency is reliant on a single point of failure. Data redundancy would assist in dissemination of this data but it would then be necessary to ensure that updated versions of the data or fresh data from the same domain remain related with the original data. In this work, I borrow on technologies from other domains of digital preservation and data distribution to allow the progressively growing corpus of imagery at NASA to be harvested and made available on a peer-to-peer basis using facets of the frameworks behind ResourceSync, BitTorrent, and WebRTC.
Collaborative Web Archiving with Ivy Plus / Borrow Direct Anna Perricci
Ā
Presentation for Web Archiving Collaboration: New Tools and Models (#cuwarc), which was a conference held at Columbia University Libraries on June 4th, 2015. There are corrections on the slide covering the citation analysis we are doing, which is still currently in progress. Video of this and all presentations on June 4 is expected to be available later in 2015.
The Memento Protocol and Research Issues With Web ArchivingMichael Nelson
Ā
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
University of Virginia Colloquium
2016-09-12
This is a very basic workshop to introduce novice users to Omeka with an eye towards providing hands-on experience to decide whether it can serve their own research needs.
Continuing Education to Advance Web Archiving (CEDWARC) on Oct 28, 2019 at Gelman Library, George Washington University, 2130 H St NW, Washington, DC 20052.
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
Ā
This is the main slide deck for a workshop at iPRES 2018 on human scale web collecting. A primary focus of the presentation was the use of Webrecorder.io, a free, open source web archiving tool available to all.
Intelligent web crawling
Denis Shestakov, Aalto University
Slides for tutorial given at WI-IAT'13 in Atlanta, USA on November 20th, 2013
Outline:
- overview of web crawling;
- intelligent web crawling;
- open challenges
Archiving Web-Based #musetech for Institutional MemorySamantha Norling
Ā
Museum websites, blog and social media posts, gallery interactives, dashboards and micrositesāthese and other web-based content created by museum technologists contain a wealth of information about our institutions. Documenting everything from collections and exhibitions to public programs and staff activities, content created and shared on the web forms a vital part of a museum's institutional memory shared by its staff, audiences, and the communities of which it is a part.
While we'd like to think that web-based content and applications will live forever, the reality is that they often have a predetermined (or worse, unexpectedly shortened) active life on the web. Whether tied to a temporary exhibition or event, superseded by more current content, replaced by newer technologies, or fallen to technical obsolescence, retired web-based content can and should be archived for continued access to information in context.
This session will provide an overview of the web archiving landscape (best practices, available tools and resources, relevant initiatives). Web archiving activities of the Newfields Lab--in collaboration with Newfields Archives--will serve as case study. To date, the Newfields web archives include imamuseum.org, various blogs, the IMA Dashboard, and exhibition-related interactives and microsites--content which now serves a variety of uses as archives.
This slide deck provides an overview of proposals to use HTTP Links as a means to address some long standing problems related to scholarly resources on the web.
Presentation about reference rot given at the Complexity Science Hub in Vienna, November 2021.
Links to web resources frequently break (link rot), and linked content can change at unpredictable rates (content drift). These dynamics of the Web are detrimental when references to web resources provide evidence or supporting information.
This presentation will report on research that assessed the extent of these problems for links to web resources in scholarly literature, by using three vast corpora of publications and a range of public web archives. It will also describe the Robust Link approach that offers a proactive, uniform, and machine-actionable way to combat link rot and content drift. Finally, it will introduce the Robustify web service and API that was devised to generate links that remain functional over time, paying special attention to challenges related to deploying infrastructure that is required to be long lasting.
"Scholarly Communication: Deconstruct and Decentralize" was presented at the Fall 2017 Meeting of the Coalition for Networked Information. It explores working towards a Scholarly Commons by applying decentralized web ideas to scholarly communication.
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
Ā
Topic: Doctoral Dissertation Defense
Title: MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Student: Sawood Alam
University: Old Dominion University
Date: Friday, December 4, 2020
The Memento Protocol and Research Issues With Web ArchivingMichael Nelson
Ā
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
University of Virginia Colloquium
2016-09-12
This is a very basic workshop to introduce novice users to Omeka with an eye towards providing hands-on experience to decide whether it can serve their own research needs.
Continuing Education to Advance Web Archiving (CEDWARC) on Oct 28, 2019 at Gelman Library, George Washington University, 2130 H St NW, Washington, DC 20052.
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
Ā
This is the main slide deck for a workshop at iPRES 2018 on human scale web collecting. A primary focus of the presentation was the use of Webrecorder.io, a free, open source web archiving tool available to all.
Intelligent web crawling
Denis Shestakov, Aalto University
Slides for tutorial given at WI-IAT'13 in Atlanta, USA on November 20th, 2013
Outline:
- overview of web crawling;
- intelligent web crawling;
- open challenges
Archiving Web-Based #musetech for Institutional MemorySamantha Norling
Ā
Museum websites, blog and social media posts, gallery interactives, dashboards and micrositesāthese and other web-based content created by museum technologists contain a wealth of information about our institutions. Documenting everything from collections and exhibitions to public programs and staff activities, content created and shared on the web forms a vital part of a museum's institutional memory shared by its staff, audiences, and the communities of which it is a part.
While we'd like to think that web-based content and applications will live forever, the reality is that they often have a predetermined (or worse, unexpectedly shortened) active life on the web. Whether tied to a temporary exhibition or event, superseded by more current content, replaced by newer technologies, or fallen to technical obsolescence, retired web-based content can and should be archived for continued access to information in context.
This session will provide an overview of the web archiving landscape (best practices, available tools and resources, relevant initiatives). Web archiving activities of the Newfields Lab--in collaboration with Newfields Archives--will serve as case study. To date, the Newfields web archives include imamuseum.org, various blogs, the IMA Dashboard, and exhibition-related interactives and microsites--content which now serves a variety of uses as archives.
This slide deck provides an overview of proposals to use HTTP Links as a means to address some long standing problems related to scholarly resources on the web.
Presentation about reference rot given at the Complexity Science Hub in Vienna, November 2021.
Links to web resources frequently break (link rot), and linked content can change at unpredictable rates (content drift). These dynamics of the Web are detrimental when references to web resources provide evidence or supporting information.
This presentation will report on research that assessed the extent of these problems for links to web resources in scholarly literature, by using three vast corpora of publications and a range of public web archives. It will also describe the Robust Link approach that offers a proactive, uniform, and machine-actionable way to combat link rot and content drift. Finally, it will introduce the Robustify web service and API that was devised to generate links that remain functional over time, paying special attention to challenges related to deploying infrastructure that is required to be long lasting.
"Scholarly Communication: Deconstruct and Decentralize" was presented at the Fall 2017 Meeting of the Coalition for Networked Information. It explores working towards a Scholarly Commons by applying decentralized web ideas to scholarly communication.
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
Ā
Topic: Doctoral Dissertation Defense
Title: MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Student: Sawood Alam
University: Old Dominion University
Date: Friday, December 4, 2020
Information sharing about Columbia University Libraryās recent web archiving ...Anna Perricci
Ā
This presentation was given at the 2015 Archive-It partner meeting and contains some highlights from a recent web archiving conference held at Columbia University Libraries. More information about this conference, including presentation slides and videos, can be found on this page: https://library.columbia.edu/bts/web_resources_collection/Conferences/program.html
TPDL2013 tutorial linked data for digital libraries 2013-10-22jodischneider
Ā
Tutorial on Linked Data for Digital Libraries, given by me, Uldis Bojars, and Nuno Lopes in Valletta, Malta at TPDL2013 on 2013-10-22.
http://tpdl2013.upatras.gr/tut-lddl.php
This half-day tutorial is aimed at academics and practitioners interested in creating and using Library Linked Data. Linked Data has been embraced as the way to bring complex information onto the Web, enabling discoverability while maintaining the richness of the original data. This tutorial will offer participants an overview of how digital libraries are already using Linked Data, followed by a more detailed exploration of how to publish, discover and consume Linked Data. The practical part of the tutorial will include hands-on exercises in working with Linked Data and will be based on two main case studies: (1) linked authority data and VIAF; (2) place name information as Linked Data.
For practitioners, this tutorial provides a greater understanding of what Linked Data is, and how to prepare digital library materials for conversion to Linked Data. For researchers, this tutorial updates the state of the art in digital libraries, while remaining accessible to those learning Linked
Data principles for the first time. For library and iSchool instructors, the tutorial provides a valuable introduction to an area of growing interest for information organization curricula. For digital library project managers, this tutorial provides a deeper understanding of the principles of Linked Data, which is needed for bespoke projects that involve data mapping and the reuse of existing metadata models.
Slides from our tutorial on Linked Data generation in the energy domain, presented at the Sustainable Places 2014 conference on October 2nd in Nice, France
This presentation provides an accessible introduction to Linked Open Data (LOD) and how LOD is modelled and made available online. The presenters will discuss several LOD projects created by libraries and archives in order to illustrate the benefits of applying LOD principles and practices. They will also demonstrate easy ways to leverage the power of LOD for archival organizations and their digital collections, with concrete examples involving WikiData, Omeka S, and the SNAC (Social Networks and Archival Context) Project.
Society of Georgia Archivists 2018 Annual Meeting
Speakers:
Josh Hogan, Atlanta University Center Robert W. Woodruff Library
Cliff Landis, Atlanta University Center Robert W. Woodruff Library
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
Ā
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Group
@WebSciDL
This work is supported in part by IMLS LG-71-15-0077
CNI Spring 2016
2016-04-05
Linked Statistical Data: does it actually pay off?Oscar Corcho
Ā
Invited keynote at the ISWC2015 Workshop on Semantics and Statistics (SemStats 2015). http://semstats.github.io/2015/
The release of the W3C RDF Data Cube recommendation was a significant milestone towards improving the maturity of the area of Linked Statistical Data. Many Data Cube-based datasets have been released since then. Tools for the generation and exploitation of such datasets have also appeared. While the benefits for the usage of RDF Data Cube and the generation of Linked Data in this area seem to be clear, there are still many challenges associated to the generation and exploitation of such data. In this talk we will reflect about them, based on our experience on generating and exploiting such type of data, and hopefully provoke some discussion about what the next steps should be.
Similar to Detecting Off-Topic Web Pages at #CUWARC (20)
Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...Michele Weigle
Ā
Based on work published in ACM Transactions on Information Systems (TOIS), 36(1), July 2017 by Lulwah Alkwai, Michael L. Nelson, and Michele C. Weigle
Presented at ACM SIGIR 2019 on July 24, 2019 by Michele C. Weigle
WS-DLās Work towards Enabling Personal Use of Web ArchivesMichele Weigle
Ā
Talk given at Library of Congress by Michele C. Weigle (@weiglemc)
December 18, 2018
Web Science and Digital Libraries (WS-DL) Research Group (@WebSciDL)
Old Dominion University
Norfolk, VA
Keynote talk presented at Web Archiving and Digital Libraries (WADL) 2018
June 6, 2018 - Fort Worth, TX
Michele C. Weigle (@weiglemc)
Web Science and Digital Libraries (WS-DL) Research Group (@WebSciDL)
Old Dominion University
Norfolk, VA
My academic story as told through the Internet Archive's Wayback Machine.
Slides from my keynote presentation at the Southeast Women in Computing Conference, November 16, 2013
Full talk slides at http://www.slideshare.net/mweigle/telling-stories-with-web-archives
A Retasking Framework For Wireless Sensor NetworksMichele Weigle
Ā
Presented by Yang He
Military Communications Conference (MILCOM)
October 6-8, 2014
Baltimore, MD
Michael Ruffing, Yang He, Jason Hallstrom, Mat Kelly, Stephan Olariu and Michele C. Weigle, "A Retasking Framework For Wireless Sensor Networks," In Proceedings of the Military Communications Conference (MILCOM). Baltimore, MD, October 2014.
Strategies for Sensor Data Aggregation in Support of Emergency ResponseMichele Weigle
Ā
Presented by Xianping Wang
Military Communications Conference (MILCOM)
October 6-8, 2014
Baltimore, MD
Xianping Wang, Aaron Walden, Michele C. Weigle and Stephan Olariu, "Strategies for Sensor Data Aggregation in Support of Emergency Response," In Proceedings of the Military Communications Conference (MILCOM). Baltimore, MD, October 2014.
What's Grad School All About?
Capital Region Celebration of Women in Computing (CAPWIC), Harrisonburg, VA
February 27, 2015
Presented by Michele Weigle
Archive What I See Now - 2014 NEH ODH OverviewMichele Weigle
Ā
"Archive What I See Now": Bringing Institutional Web Archiving Tools to the Individual Researcher
Slides from 2014 NEH ODH Project Directors' Meeting
September 15, 2014
Michele C. Weigle, Michael L. Nelson, Liza Potts
"Archive What I See Now" - NEH ODH overviewMichele Weigle
Ā
"Archive What I See Now": Bringing Institutional Web Archiving Tools to the Individual Researcher
Slides from shutdown-cancelled NEH ODH Project Directors' Meeting (originally scheduled for Oct 4, 2013)
Michele C. Weigle and Michael L. Nelson
TDMA Slot Reservation in Cluster-Based VANETsMichele Weigle
Ā
Mohammad Almalag's PhD Defense Slides
Department of Computer Science
Old Dominion University
April 3, 2013
Note: You may need to download the file to see all of the animations.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
Ā
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Ā
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
Ā
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
Ā
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Ā
Are you looking to streamline your workflows and boost your projectsā efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, youāre in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part āEssentials of Automationā series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Hereās what youāll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
Weāll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Donāt miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Ā
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Ā
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as āpredictable inferenceā.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Ā
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overviewā
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Ā
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Ā
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Ā
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Ā
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
1. Tools for Managing Seed URIs
(Detecting Off-Topic Pages)
Old Dominion University
Web Science and Digital Libraries Group
http://ws-dl.cs.odu.edu/, @WebSciDL
Web Archiving Collaboration: New Tools and Models
June 4-5, 2015
Yasmin AlNoamany, Michele C. Weigle,
Michael L. Nelson
Funded by Columbia University Libraries Web Archiving Incentive program
7. Pages can go off-topic through time
May 13, 2012: The page started as
on-topic.
7http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
8. Pages can go off-topic through time
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
8http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
9. Pages can go off-topic through time
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
9http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
10. Pages can go off-topic through time
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again
10http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
11. Pages can go off-topic through time
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked
11http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
12. Over 60% of archived versions of
hamdeensabahy.com are off-topic
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
12http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
13. Social media pages can go off-topic
Dec. 22, 2011: Facebook page was relevant to
the Occupy collection
13http://wayback.archive-it.org/2950/*/http://www.facebook.com/MayorJeanQuan
14. Social media pages can go off-topic
Dec. 22, 2011: Facebook page was relevant to
the Occupy collection
Aug. 10, 2012: URI redirects to www.facebook.com
14http://wayback.archive-it.org/2950/*/http://www.facebook.com/MayorJeanQuan
17. We identified 5 classes of TimeMaps
17
1. Always On
2. Step Function On
3. Step Function Off
4. Oscillating
5. Always Off
1. wayback.archive-it.org/2950/*/http://occupypsl.org
2. wayback.archive-it.org/2950/*/http://occupygso.tumblr.com
3. wayback.archive-it.org/2950/*/http://occupyashland.com
4. wayback.archive-it.org/2950/*/http://www.indyows.org
5. wayback.archive-it.org/2950/*/http://occupy605.com
20. A web page goes off-topic and
on-topic many times (Oscillating)
On-topic: Egyptian
Revolution coverage
Off-topic: news
about Iraq
Off-topic:
news about Syria
On-topic:
Egypt news
Off-topic:
Palestine
20
http://wayback.archive-it.org/2358/*/http://www.bbc.co.uk/news/world/middle_east/
21. A web page goes off-topic and
on-topic many times (Oscillating)
On-topic: Egyptian
Revolution coverage
Off-topic: news
about Iraq
Off-topic:
news about Syria
Off-topic: news
about Syria
On-topic:
Egypt news
Off-topic:
Palestine
21
http://wayback.archive-it.org/2358/*/http://www.bbc.co.uk/news/world/middle_east/
22. Most TimeMaps are Always On
22
1. Always On
2. Step Function On
3. Step Function Off
4. Oscillating
5. Always Off
1. wayback.archive-it.org/2950/*/http://occupypsl.org
2. wayback.archive-it.org/2950/*/http://occupygso.tumblr.com
3. wayback.archive-it.org/2950/*/http://occupyashland.com
4. wayback.archive-it.org/2950/*/http://www.indyows.org
5. wayback.archive-it.org/2950/*/http://occupy605.com
0-2%
6-15%
~0%
74%
8-11%
24. From Archive-It collection to terms
1. Obtain the seed URIs from the front-end
interface of Archive-It
2. Obtain the TimeMap of the seed URIs from the
CDX file*
3. Extract the HTML of the mementos from the
WARC files*
4. Extract the text of the page using the Boilerpipe
library
5. Extract terms from the page, using scikit-learn to
tokenize, remove stop words, and apply
stemming
24
*locally hosted at ODU
25. We investigated 6 similarity metrics
ā¢ Textual Content
ā cosine similarity of TF-IDF
ā intersection of the 20 most frequent terms
ā Jaccard similarity coefficient
ā¢ Semantics
ā Web-based kernel function using a search engine (SE)
ā¢ Structural
ā the change in number of words
ā the change in content length
25
28. Semantics of the Text
Web based kernel function using the search engine (SE)
28
Feb. 2011 July 2013
Tahrir, Egypt, army Cairo, Morsi, protestsNo term-wise overlap
29. Semantics of the Text
Web based kernel function using the search engine (SE)
29
Egypt, Tahrir, president, protests, army, Cairo Egypt, protests, Morsi, Cairo, president
Feb. 2011 July 2013
Tahrir, Egypt, army Cairo, Morsi, protestsNo term-wise overlap
Method Similarity
SE-Kernel 0.7
Technique inspired by Sahami and Heilman, WWW 2006
32. We built a gold standard data set to
evaluate the methods
32
33. We manually labeled 15,760 mementos
Egypt Revolution and Politics
URI-Rs: 136
URI-Ms: 6,886
Off-topic URI-Ms: 384
Occupy Movement
URI-Rs: 255
URI-Ms: 6,570
Off-topic URI-Ms: 458
Columbia Univ. Human Rights collection
URI-Rs: 198
URI-Ms: 2,304
Off-topic URI-Ms: 94
33
34. Example of manually labeled set
Future work: convert to annotated/extended
TimeMap format
34
id date URI label
9 20120124014240 http://wayback.archive-it.org/2950/20120124014240/http://occupysarasota.com/ 1
9 20120131014118 http://wayback.archive-it.org/2950/20120131014118/http://occupysarasota.com/ 1
9 20120207014119 http://wayback.archive-it.org/2950/20120207014119/http://occupysarasota.com/ 1
9 20120501041141 http://wayback.archive-it.org/2950/20120501041141/http://occupysarasota.com/ 0
9 20120508032644 http://wayback.archive-it.org/2950/20120508032644/http://occupysarasota.com/ 0
9 20120515034720 http://wayback.archive-it.org/2950/20120515034720/http://occupysarasota.com/ 0
35. Evaluated 6 methods at 21 thresholds
ā¢ Assumed first memento was on-topic
ā¢ Combined two methods ('OR') to find best
combination method
ā 15 combinations
ā 6,615 tests (15 combinations x 21 thresholds x 21
thresholds)
ā¢ Averaged the results at each threshold over
the three collections
35
36. Evaluated based on 5 metrics
ā¢ False positives (FP)
ā on-topic labeled as off-topic
ā¢ False negatives (FN)
ā off-topic labeled as on-topic
ā¢ Accuracy (ACC)
ā proportion of correct
classifications
ā (TP + TN)/(TP + FP + FN + TN)
ā¢ F1 score
ā weighted average of precision
and recall
ā 2TP/(2TP + FP + FN)
ā¢ AUC
ā area under the ROC curve
ā ROC - plots false positive rate
vs. true positive rate
36
39. Applied best method to 11 Archive-It
collections
ā¢ Cosine|Word Count with 0.10|-0.85
thresholds
ā¢ Collection Characteristics
ā governmental, event-based, theme-based
ā time spans of 1 week - 7 years
ā 35 - 1459 URI-Rs
ā 118 - 10,283 URI-Ms
39
40. Average precision of 0.92 on 11
Archive-It collections
40
ID Collection URI-Rs URI-Ms Off-topic
URI-Ms
Affected
URI-Rs
TP FP P
2893 Global Food Crisis 65 3063 22 7 22 0 1.000
1084 Government in Alaska 68 506 16 4 16 0 1.000
2966 Virginia Tech Shootings 239 1670 24 2 24 0 1.000
2017 Wikileaks 2010 Document 35 2360 107 8 107 0 1.000
2323 Jasmine Revolution 2011 231 4076 114 31 107 7 0.939
1827 IT Historical Resource 1459 10,283 59 34 45 14 0.763
1475 Human Rights Document 147 1530 54 20 39 15 0.722
1826 Maryland State Document 69 184 0 0 - - -
694 April 16 Archive 35 118 0 0 - - -
2535 Brazilian School Shooting 476 1092 0 0 - - -
2823 Russia Plane Crash 65 447 0 0 - - -
41. Summary
ā¢ We investigated six methods for measuring similarity
between mementos in a TimeMap:
ā cosine similarity of TF-IDF
ā Jaccard similarity
ā intersection of the 20 most frequent terms
ā Web-based kernel function
ā change in number of words
ā change in content length
ā¢ We tested the approaches on a gold standard data set from
three Archive-It collections
ā¢ We evaluated best approach on 11 diverse Archive-It
collections
41
42. Findings
ā¢ Combining cosine similarity at threshold 0.10 and
change in size using word count at threshold
ā0.85 gives the best performance
ā¢ Cosine similarity at threshold = 0.15 is the best
single method
ā¢ Using the combined method, we achieved 0.92
average precision on 11 Archive-It collections
42
43. Tool for detecting off-topic pages
ā¢ A python command-line tool for suggesting
off-topic pages in web archives
ā Cosine Similarity
ā default threshold is 0.15
ā operates on live TimeMaps
Available at
https://github.com/yasmina85/OffTopic-Detection
43
44. Detecting off-topic pages in an
Archive-It collection (Maryland State Docs)
% python detect_off_topic.py -i 1826 -th 0.15
extracting seed list
ā¦
http://agroecol.umd.edu/Research/index.cfm
http://casademaryland.org
ā¦
50 URIs are extracted from collection https://archive-it.org/collections/1826
Downloading timemap using uri http://wayback.archive-
it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm
Downloading timemap using uri http://wayback.archive-
it.org/1826/timemap/link/http://casademaryland.org
ā¦
Downloading 4 mementos out of 306
Downloading 14 mementos out of 306
ā¦
Detecting off-topic mementos
Similarity memento_uri
0.0 http://wayback.archive-
it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/
0.0 http://wayback.archive-
it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html/
44
This was run live after we did the
evaluation, so now there are off-
topic mementos
45. Detecting off-topic pages in a single TimeMap
% python detect_off_topic.py -t https://wayback.archive-
it.org/2358/timemap/link/http://hamdeensabahy.com/
Downloading 0 mementos out of 270
http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/
ā¦
Downloading 270 mementos out of 270
ā¦
Extracting text from the html
ā¦
Detecting off-topic mementos
Similarity memento_uri
0.0509170839413 http://wayback.archive-
it.org/2358/20140524131241/http://www.hamdeensabahy.com/
0.0 http://wayback.archive-
it.org/2358/20130321080254/http://hamdeensabahy.com/
0.0368021561791 http://wayback.archive-
it.org/2358/20130621131337/http://www.hamdeensabahy.com/
0.12899637517 http://wayback.archive-
it.org/2358/20140602131307/http://hamdeensabahy.com/
ā¦ 45
46. We're continuing work on this
ā¢ Enhancements to the detection tool
ā add the other similarity methods (WordCount first)
ā allow input of local CDX and WARC files
ā¢ Investigate characteristics of collections and
TimeMaps that affect choosing thresholds
ā¢ Detect off-topic seeds (URI-Rs) in a collection
ā determine collection aboutness
46
47. Tools for Managing Seed URIs
(Detecting Off-Topic Pages)
Old Dominion University
Web Science and Digital Libraries Group
http://ws-dl.cs.odu.edu/, @WebSciDL
Web Archiving Collaboration: New Tools and Models
June 4-5, 2015
Yasmin AlNoamany, Michele C. Weigle,
Michael L. Nelson
Python Tool: https://github.com/yasmina85/OffTopic-Detection
Editor's Notes
First deployed in 2006, Archive-It is a subscription web archiving service from the Internet Archive that helps organizations to harvest, build, and preserve collections of digital content.Ā
We go to this effort to collect good seeds for our collections
We specify archiving period and depth
But once the crawl is running, we don't know what really happens to our seeds
We can tell what types of data we're gathering or if it's 404, but what if the content changes significantly?
Is there any way other than manual inspection to detect off-topic mementos?
(the frequency is tunable by the user), and
to what depth (e.g., follow the pages linked to from the seeds two-levels out).
The Heritrix crawler at Archive-It then recrawls these seeds at the specified frequency and depth to, while the crawler is capturing the seed periodically at the time that lori specified,
There is no tool to detect when the page goes off-topic
Add the links here
Add the links here
The textual content:
cosine similarity
intersection of the most frequent terms
Jaccard coefficient
The semantics of the text:
Web based kernel function using the search engine (SE)
Structural methods:
the change in number of words
the change in content length
The textual content:
cosine similarity
intersection of the most frequent terms
Jaccard coefficient
The semantics of the text:
Web based kernel function using the search engine (SE)
Structural methods:
the change in number of words
the change in content length
The textual content:
cosine similarity
intersection of the most frequent terms
Jaccard coefficient
The semantics of the text:
Web based kernel function using the search engine (SE)
Structural methods:
the change in number of words
the change in content length
Top 5 terms
Extract terms from Top 10 snippets
Combine original page terms with snippet terms
Compute Jaccard coefficient for similarity
*** PAPER SAYS THAT SE EXPANSION ONLY DONE FOR 1ST MEMENTO. TERMS FROM SNIPPETS COMBINED WITH ORIGINAL TERMS AND THAT WAS COMPARED AGAINST CANDIDATE TERMS ***
Top 5 terms
Extract terms from Top 10 snippets
Combine original page terms with snippet terms
Compute Jaccard coefficient for similarity
*** SE EXPANSION ONLY DONE FOR 1ST MEMENTO. TERMS FROM SNIPPETS COMBINED WITH ORIGINAL TERMS AND THAT WAS COMPARED AGAINST CANDIDATE TERMS ***
Cosine similarity at threshold = 0.15 is the best single method
If cosine similarity between candidate memento and first memento < 0.15, then candidate memento is marked as 'off-topic'
If cosine similarity between candidate memento and first memento < 0.10 OR word count between candidate memento and first memento has decreased by more than 85%, then candidate memento is marked as 'off-topic'
We have shown 98% ACC of the tool. Next, we evaluate the tool on other Archive-It collections for which we do not know the answer