SlideShare a Scribd company logo
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in
Web Archives
Yasmin AlNoamany, Michele C. Weigle,
and Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Group
http://ws-dl.cs.odu.edu/, @WebSciDL
1
Internet Archive Aug. 3, 2015
Detecting Off-Topic Pages in Web Archives
Pages go off-topic through time!
2
Detecting Off-Topic Pages in Web Archives
Over 70% of hamdeensabahy.com
archived versions (269) are off-topic
May 13, 2012: The page started as
ontopic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May. 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
3
http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
Detecting Off-Topic Pages in Web Archives
Social media pages go off-topic
Dec. 22, 2011: Facebook page was relevant to
the Occupy collection
Aug. 10, 2012: URI redirects to www.facebook.com
4
http://wayback.archive-it.org/2950/*/http://www.facebook.com/MayorJeanQuan
Detecting Off-Topic Pages in Web Archives
TimeMap Behavior
5
Detecting Off-Topic Pages in Web Archives
TimeMap behavior
6
1. Always On
2. Step Function On
3. Step Function Off
4. Oscillating
5. Always Off
1. wayback.archive-it.org/2950/*/http://occupypsl.org
2. wayback.archive-it.org/2950/*/http://occupygso.tumblr.com
3. wayback.archive-it.org/2950/*/http://occupyashland.com
4. wayback.archive-it.org/2950/*/http://www.indyows.org
5. wayback.archive-it.org/2950/*/http://occupy605.com
Detecting Off-Topic Pages in Web Archives
TimeMap behavior
7
1. Always On
2. Step Function On
3. Step Function Off
4. Oscillating
5. Always Off
1. wayback.archive-it.org/2950/*/http://occupypsl.org
2. wayback.archive-it.org/2950/*/http://occupygso.tumblr.com
3. wayback.archive-it.org/2950/*/http://occupyashland.com
4. wayback.archive-it.org/2950/*/http://www.indyows.org
5. wayback.archive-it.org/2950/*/http://occupy605.com
0-2%
6-15%
~0%
74%
8-11%
Detecting Off-Topic Pages in Web Archives
Methods for detecting off-topic pages
8
Detecting Off-Topic Pages in Web Archives
The textual content
cosine similarity, intersecting the most frequent terms,
Jaccard similarity
9
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
Method Similarity
cosine 0.0
TF-Intersection 0.0
Jaccard 0.0
Detecting Off-Topic Pages in Web Archives
The semantics of the text
Web based kernel function using the search engine (SE)
10
Egypt, Tahrir, president, protests, army, Cairo Egypt, protests, Morsi, Cairo, president
Feb. 2011 July 2013
Tahrir, Egypt, army Cairo, Morsi, protestsNo term-wise overlap
Method Similarity
SE-Kernel 0.7
Detecting Off-Topic Pages in Web Archives
Structural methods
no. of words, content-length
11
100 109
100 5
Method Similarity
(% change)
WordCount 0.09
Method Similarity
(% change)
WordCount -0.95
Detecting Off-Topic Pages in Web Archives
We manually labeled 15,760 mementos
for evaluating the methods
Egypt Revolution and Politics
URI-Ms: 6,570
Off-topic URI-Ms: 458
Occupy Movement
URI-Ms: 6,886
Off-topic URI-Ms: 384
Human Rights collection
URI-Ms: 2,304
Off-topic URI-Ms: 94
12
Detecting Off-Topic Pages in Web Archives
The results of evaluating the
similarity approaches
13
Detecting Off-Topic Pages in Web Archives
The best three combined methods
averaged on three collections
14
Detecting Off-Topic Pages in Web Archives
Finding off-topic pages in other
Archive-It collections
15
Detecting Off-Topic Pages in Web Archives
The average precision of evaluating
the 11 collections of Archive-It is 0.92
16
Detecting Off-Topic Pages in Web Archives
Detecting off-topic pages
in web archives
• A command-line tool for suggesting off-topic
pages in web archives
• Publically available at:
https://github.com/yasmina85/OffTopic-
Detection
17
Detecting Off-Topic Pages in Web Archives
Detecting off-topic pages in
1826 collection
python detect_off_topic.py -i 1826 -th 0.15
extracting seed list
…
http://agroecol.umd.edu/Research/index.cfm
http://casademaryland.org
…
50 URIs are extracted from collection https://archive-it.org/collections/1826
Downloading timemap using uri http://wayback.archive-
it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm
Downloading timemap using uri http://wayback.archive-
it.org/1826/timemap/link/http://casademaryland.org
…
Downloading 4 mementos out of 306
Downloading 14 mementos out of 306
…
Detecting off-topic mementos
Similarity memento_uri
0.0 http://wayback.archive-it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/
0.0 http://wayback.archive-it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html/
18
Detecting Off-Topic Pages in Web Archives
Detecting off-topic pages for
http://hamdeensabahy.com/
Downloading 0 mementos out of 270
http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140602131307/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140528131258/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130617131324/http://www.hamdeensabahy.com/
…
Downloading 4 mementos out of 270
…
Extracting text from the html
…
Detecting off-topic mementos
Similarity memento_uri
-0.979434447301 http://wayback.archive-it.org/2358/20121213102904/http://hamdeensabahy.com/
-0.966580976864 http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/
-0.94087403599 http://wayback.archive-it.org/2358/20130526131402/http://www.hamdeensabahy.com/
-0.94087403599 http://wayback.archive-it.org/2358/20130527143614/http://www.hamdeensabahy.com/
19
python detect_off_topic.py -t https://wayback.archive-
it.org/2358/timemap/link/http://hamdeensabahy.com/ -m wcount -th -0.85
Detecting Off-Topic Pages in Web Archives
What next
• Feedback
• Collaboration
20
Detecting Off-Topic Pages in Web Archives
Extra Slides
21
Detecting Off-Topic Pages in Web Archives
On-topic: Egyptian
Revolution coverage
Off-topic: the domain
registration is lost
Web page goes off-topic
(Step Function On)
http://wayback.archive-it.org/2358/*/http://www.7amla.net
22
Detecting Off-Topic Pages in Web Archives
On-topic: Egyptian
Revolution coverage
Off-topic: the domain
registration is lost
23
Web page goes off-topic
(Step Function On)
http://wayback.archive-it.org/2358/*/http://www.7amla.net
Detecting Off-Topic Pages in Web Archives
Web pages go off-topic and
on-topic many times (Oscillating)
http://wayback.archive-it.org/2358/*/http://www.bbc.co.uk/news/world/middle_east/
c
On-topic: Egyptian
Revolution coverage
Off-topic: news
about Iraq
Off-topic:
news about Syria
On-topic:
Egypt news
Off-topic:
Palestine
24
24
Detecting Off-Topic Pages in Web Archives
On-topic: Egyptian
Revolution coverage
Off-topic: news
about Iraq
Off-topic:
news about Syria
Off-topic: news
about Syria
On-topic:
Egypt news
Off-topic:
Palestine
25
Web pages go off-topic and
on-topic many times (Oscillating)
http://wayback.archive-it.org/2358/*/http://www.bbc.co.uk/news/world/middle_east/
Detecting Off-Topic Pages in Web Archives
Preprocessing
1. Obtain the seed URIs from the front-end
interface of Archive-It
2. Obtain the TimeMap of the seed URIs from the
CDX file*
3. Extract the HTML of the mementos from the
WARC files*
4. Extract the text of the page using the Boilerpipe
library
5. Extract terms from the page, using scikit-learn to
tokenize, remove stop words, and apply
stemming
26
*locally hosted at ODU

More Related Content

What's hot

Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
Michael Nelson
 
Storytelling With Web Archives
Storytelling With Web ArchivesStorytelling With Web Archives
Storytelling With Web Archives
Shawn Jones
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media Stories
Yasmin AlNoamany, PhD
 
Where Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive CollectionsWhere Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive Collections
Shawn Jones
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
Michael Nelson
 
Improving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesImproving Collection Understanding in Web Archives
Improving Collection Understanding in Web Archives
Shawn Jones
 
Combining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web Archives
Shawn Jones
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
Justin Brunelle
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich  the Live Web Experience Through StorytellingUsing Web Archives to Enrich  the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Yasmin AlNoamany, PhD
 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-It
Michele Weigle
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
Michele Weigle
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Alexander Nwala
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Michael Nelson
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library System
WiLS
 
Open the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School LibrariesOpen the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School LibrariesJoyce Kasman Valenza
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Yasmin AlNoamany, PhD
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Michael Nelson
 
The Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 WorkshopThe Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 Workshop
Richard Wallis
 
Linked data - A radical change?
Linked data - A radical change?Linked data - A radical change?
Linked data - A radical change?
Richard Wallis
 

What's hot (20)

Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Storytelling With Web Archives
Storytelling With Web ArchivesStorytelling With Web Archives
Storytelling With Web Archives
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media Stories
 
Where Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive CollectionsWhere Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive Collections
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Improving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesImproving Collection Understanding in Web Archives
Improving Collection Understanding in Web Archives
 
Combining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web Archives
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich  the Live Web Experience Through StorytellingUsing Web Archives to Enrich  the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-It
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library System
 
Open the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School LibrariesOpen the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School Libraries
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
The Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 WorkshopThe Power of Sharing Linked Data - ELAG 2014 Workshop
The Power of Sharing Linked Data - ELAG 2014 Workshop
 
Linked data - A radical change?
Linked data - A radical change?Linked data - A radical change?
Linked data - A radical change?
 

Viewers also liked

Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
Sawood Alam
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
Sawood Alam
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
Sawood Alam
 
JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?
LulwahMA
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Sawood Alam
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Yasmin AlNoamany, PhD
 
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesImpact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Myriam Traub
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
Sawood Alam
 
Quantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.isQuantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.is
maturban
 
Web Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives UnleashedWeb Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives Unleashed
mwe400
 
Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0
Justin Littman
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake News
Erika Siregar
 
Good News/ Bad News
Good News/ Bad NewsGood News/ Bad News
Good News/ Bad News
LulwahMA
 
PID Signposting Pattern
PID Signposting PatternPID Signposting Pattern
PID Signposting Pattern
Herbert Van de Sompel
 
Event Pro Mobile Business Solutions Presentation 11.19.08 (Compatability Mode)
Event Pro   Mobile Business Solutions Presentation 11.19.08 (Compatability Mode)Event Pro   Mobile Business Solutions Presentation 11.19.08 (Compatability Mode)
Event Pro Mobile Business Solutions Presentation 11.19.08 (Compatability Mode)
birvin
 
Is There A You In Team Feb 25 2009 At The University Of Waterloo
Is There A You In Team    Feb 25 2009 At The University Of WaterlooIs There A You In Team    Feb 25 2009 At The University Of Waterloo
Is There A You In Team Feb 25 2009 At The University Of Waterloojimlove
 
WebRTC Asia Forum - What is it & why is it important? Dean Bubley, Disruptive...
WebRTC Asia Forum - What is it & why is it important? Dean Bubley, Disruptive...WebRTC Asia Forum - What is it & why is it important? Dean Bubley, Disruptive...
WebRTC Asia Forum - What is it & why is it important? Dean Bubley, Disruptive...
Dean Bubley
 
The Changing World of Business Blogging
The Changing World of Business BloggingThe Changing World of Business Blogging
The Changing World of Business Blogging
Marketwired
 

Viewers also liked (20)

Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesImpact Analysis of OCR Quality on Research Tasks in Digital Archives
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
Quantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.isQuantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.is
 
Web Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives UnleashedWeb Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives Unleashed
 
Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake News
 
Good News/ Bad News
Good News/ Bad NewsGood News/ Bad News
Good News/ Bad News
 
PID Signposting Pattern
PID Signposting PatternPID Signposting Pattern
PID Signposting Pattern
 
Parts Of Speech
Parts Of SpeechParts Of Speech
Parts Of Speech
 
Ppp den danske model
Ppp den danske modelPpp den danske model
Ppp den danske model
 
Event Pro Mobile Business Solutions Presentation 11.19.08 (Compatability Mode)
Event Pro   Mobile Business Solutions Presentation 11.19.08 (Compatability Mode)Event Pro   Mobile Business Solutions Presentation 11.19.08 (Compatability Mode)
Event Pro Mobile Business Solutions Presentation 11.19.08 (Compatability Mode)
 
Is There A You In Team Feb 25 2009 At The University Of Waterloo
Is There A You In Team    Feb 25 2009 At The University Of WaterlooIs There A You In Team    Feb 25 2009 At The University Of Waterloo
Is There A You In Team Feb 25 2009 At The University Of Waterloo
 
WebRTC Asia Forum - What is it & why is it important? Dean Bubley, Disruptive...
WebRTC Asia Forum - What is it & why is it important? Dean Bubley, Disruptive...WebRTC Asia Forum - What is it & why is it important? Dean Bubley, Disruptive...
WebRTC Asia Forum - What is it & why is it important? Dean Bubley, Disruptive...
 
The Changing World of Business Blogging
The Changing World of Business BloggingThe Changing World of Business Blogging
The Changing World of Business Blogging
 

Similar to Detecting Off-Topic Pages in Web Archives

Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
Michele Weigle
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
Michael Nelson
 
The methods and practices of Linked Open Data
The methods and practices of Linked Open DataThe methods and practices of Linked Open Data
The methods and practices of Linked Open Data
Dongpo Deng
 
Linked Open Data for Archives
Linked Open Data for ArchivesLinked Open Data for Archives
Linked Open Data for Archives
Cliff Landis
 
Representing the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makersRepresenting the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makersjudell
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Michael Nelson
 
Supporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram PostsSupporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram Posts
Himarsha Jayanetti
 
Socialmediaandweb2.0
Socialmediaandweb2.0Socialmediaandweb2.0
Socialmediaandweb2.0
shwetanema
 
Digital Presence & Online Outreach
Digital Presence & Online Outreach Digital Presence & Online Outreach
Digital Presence & Online Outreach
Polly Farrington
 
Mapping the Dutch Blogosphere
Mapping the Dutch BlogosphereMapping the Dutch Blogosphere
Mapping the Dutch Blogosphere
Digital Methods Initiative
 
Your Library on the Go: Catching Up with Your Mobile Patrons
Your Library on the Go: Catching Up with Your Mobile PatronsYour Library on the Go: Catching Up with Your Mobile Patrons
Your Library on the Go: Catching Up with Your Mobile Patrons
techbrarian
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
Richard Wallis
 
Online Outreach & Marketing
Online Outreach & MarketingOnline Outreach & Marketing
Online Outreach & Marketing
Polly Farrington
 
Consuming Linked Data by Humans - WWW2010
Consuming Linked Data by Humans - WWW2010Consuming Linked Data by Humans - WWW2010
Consuming Linked Data by Humans - WWW2010
Juan Sequeda
 
Google Tools for Schools
Google Tools for SchoolsGoogle Tools for Schools
Google Tools for Schools
Polly Farrington
 
Technology Tips for Legal Aid Advocates
Technology Tips for Legal Aid AdvocatesTechnology Tips for Legal Aid Advocates
Technology Tips for Legal Aid Advocates
Kate Bladow
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
Michael Nelson
 
Wdie 8 Sem2
Wdie 8 Sem2Wdie 8 Sem2
Wdie 8 Sem2
Emma Duke-Williams
 
Wdie 8 Sem2
Wdie 8 Sem2Wdie 8 Sem2
Wdie 8 Sem2
Emma Duke-Williams
 

Similar to Detecting Off-Topic Pages in Web Archives (20)

Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
The methods and practices of Linked Open Data
The methods and practices of Linked Open DataThe methods and practices of Linked Open Data
The methods and practices of Linked Open Data
 
Linked Open Data for Archives
Linked Open Data for ArchivesLinked Open Data for Archives
Linked Open Data for Archives
 
Representing the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makersRepresenting the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makers
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Supporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram PostsSupporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram Posts
 
Socialmediaandweb2.0
Socialmediaandweb2.0Socialmediaandweb2.0
Socialmediaandweb2.0
 
Digital Presence & Online Outreach
Digital Presence & Online Outreach Digital Presence & Online Outreach
Digital Presence & Online Outreach
 
Go beyondgoogle
Go beyondgoogleGo beyondgoogle
Go beyondgoogle
 
Mapping the Dutch Blogosphere
Mapping the Dutch BlogosphereMapping the Dutch Blogosphere
Mapping the Dutch Blogosphere
 
Your Library on the Go: Catching Up with Your Mobile Patrons
Your Library on the Go: Catching Up with Your Mobile PatronsYour Library on the Go: Catching Up with Your Mobile Patrons
Your Library on the Go: Catching Up with Your Mobile Patrons
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
 
Online Outreach & Marketing
Online Outreach & MarketingOnline Outreach & Marketing
Online Outreach & Marketing
 
Consuming Linked Data by Humans - WWW2010
Consuming Linked Data by Humans - WWW2010Consuming Linked Data by Humans - WWW2010
Consuming Linked Data by Humans - WWW2010
 
Google Tools for Schools
Google Tools for SchoolsGoogle Tools for Schools
Google Tools for Schools
 
Technology Tips for Legal Aid Advocates
Technology Tips for Legal Aid AdvocatesTechnology Tips for Legal Aid Advocates
Technology Tips for Legal Aid Advocates
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Wdie 8 Sem2
Wdie 8 Sem2Wdie 8 Sem2
Wdie 8 Sem2
 
Wdie 8 Sem2
Wdie 8 Sem2Wdie 8 Sem2
Wdie 8 Sem2
 

More from Yasmin AlNoamany, PhD

A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
Yasmin AlNoamany, PhD
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
Yasmin AlNoamany, PhD
 
Data curation vanderbilt
Data curation vanderbiltData curation vanderbilt
Data curation vanderbilt
Yasmin AlNoamany, PhD
 
Generating stories from Archive-It collections
Generating stories from Archive-It collectionsGenerating stories from Archive-It collections
Generating stories from Archive-It collections
Yasmin AlNoamany, PhD
 
User Access Patterns in Web Archives
User Access Patterns in Web ArchivesUser Access Patterns in Web Archives
User Access Patterns in Web ArchivesYasmin AlNoamany, PhD
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet ArchiveYasmin AlNoamany, PhD
 
Access Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesAccess Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesYasmin AlNoamany, PhD
 
Access Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesAccess Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesYasmin AlNoamany, PhD
 

More from Yasmin AlNoamany, PhD (8)

A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
 
Data curation vanderbilt
Data curation vanderbiltData curation vanderbilt
Data curation vanderbilt
 
Generating stories from Archive-It collections
Generating stories from Archive-It collectionsGenerating stories from Archive-It collections
Generating stories from Archive-It collections
 
User Access Patterns in Web Archives
User Access Patterns in Web ArchivesUser Access Patterns in Web Archives
User Access Patterns in Web Archives
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Access Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesAccess Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web Archives
 
Access Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web ArchivesAccess Patterns for Robots and Humans in Web Archives
Access Patterns for Robots and Humans in Web Archives
 

Recently uploaded

Elevate Your Brand with Digital Marketing for Fashion Industry
Elevate Your Brand with Digital Marketing for Fashion IndustryElevate Your Brand with Digital Marketing for Fashion Industry
Elevate Your Brand with Digital Marketing for Fashion Industry
Matebiz Pvt. Ltd
 
Waikiki Sunset Catamaran ! MAITAI Catamaran
Waikiki Sunset Catamaran !  MAITAI CatamaranWaikiki Sunset Catamaran !  MAITAI Catamaran
Waikiki Sunset Catamaran ! MAITAI Catamaran
maitaicatamaran
 
Colors of Wall Paint and Their Mentally Properties.pptx
Colors of Wall Paint and Their Mentally Properties.pptxColors of Wall Paint and Their Mentally Properties.pptx
Colors of Wall Paint and Their Mentally Properties.pptx
Brendon Jonathan
 
What Are the Latest Trends in Endpoint Security for 2024?
What Are the Latest Trends in Endpoint Security for 2024?What Are the Latest Trends in Endpoint Security for 2024?
What Are the Latest Trends in Endpoint Security for 2024?
VRS Technologies
 
Hospitality Training for Hotel Industries
Hospitality Training for Hotel IndustriesHospitality Training for Hotel Industries
Hospitality Training for Hotel Industries
VanieTAnggita
 
Top Best Astrologer +91-9463629203 LoVe Problem SolUtion specialist In InDia ...
Top Best Astrologer +91-9463629203 LoVe Problem SolUtion specialist In InDia ...Top Best Astrologer +91-9463629203 LoVe Problem SolUtion specialist In InDia ...
Top Best Astrologer +91-9463629203 LoVe Problem SolUtion specialist In InDia ...
gitapress3
 
SECUREX UK FOR SECURITY SERVICES AND MOBILE PATROL
SECUREX UK FOR SECURITY SERVICES AND MOBILE PATROLSECUREX UK FOR SECURITY SERVICES AND MOBILE PATROL
SECUREX UK FOR SECURITY SERVICES AND MOBILE PATROL
securexukweb
 
Bulk SMS Service Provider In Mumbai | sms2orbit
Bulk SMS Service Provider In Mumbai | sms2orbitBulk SMS Service Provider In Mumbai | sms2orbit
Bulk SMS Service Provider In Mumbai | sms2orbit
Orbit Messaging Hub
 
The Jamstack Revolution: Building Dynamic Websites with Static Site Generator...
The Jamstack Revolution: Building Dynamic Websites with Static Site Generator...The Jamstack Revolution: Building Dynamic Websites with Static Site Generator...
The Jamstack Revolution: Building Dynamic Websites with Static Site Generator...
Softradix Technologies
 
Inspect Edge & NSPIRE Inspection Application - Streamline Housing Inspections
Inspect Edge & NSPIRE Inspection Application - Streamline Housing InspectionsInspect Edge & NSPIRE Inspection Application - Streamline Housing Inspections
Inspect Edge & NSPIRE Inspection Application - Streamline Housing Inspections
inspectedge1
 
Earthmovers: Top Earth Moving Equipments
Earthmovers: Top Earth Moving EquipmentsEarthmovers: Top Earth Moving Equipments
Earthmovers: Top Earth Moving Equipments
earthmoverinternatio
 
Comprehensive Water Damage Restoration Services
Comprehensive Water Damage Restoration ServicesComprehensive Water Damage Restoration Services
Comprehensive Water Damage Restoration Services
kleenupdisaster
 
Unlocking Insights: AI-powered Enhanced Due Diligence Strategies for Increase...
Unlocking Insights: AI-powered Enhanced Due Diligence Strategies for Increase...Unlocking Insights: AI-powered Enhanced Due Diligence Strategies for Increase...
Unlocking Insights: AI-powered Enhanced Due Diligence Strategies for Increase...
RNayak3
 
Maximizing Efficiency with Integrated Water Management Systems
Maximizing Efficiency with Integrated Water Management SystemsMaximizing Efficiency with Integrated Water Management Systems
Maximizing Efficiency with Integrated Water Management Systems
Irri Design Studio
 
All Trophies at Trophy-World Malaysia | Custom Trophies & Plaques Supplier
All Trophies at Trophy-World Malaysia | Custom Trophies & Plaques SupplierAll Trophies at Trophy-World Malaysia | Custom Trophies & Plaques Supplier
All Trophies at Trophy-World Malaysia | Custom Trophies & Plaques Supplier
Trophy-World Malaysia Your #1 Rated Trophy Supplier
 
Office Business Furnishings | Office Equipment
Office Business Furnishings |  Office EquipmentOffice Business Furnishings |  Office Equipment
Office Business Furnishings | Office Equipment
OFWD
 
BesT panDit Ji LoVe problem solution 9463629203 UK uSA California New Zealand...
BesT panDit Ji LoVe problem solution 9463629203 UK uSA California New Zealand...BesT panDit Ji LoVe problem solution 9463629203 UK uSA California New Zealand...
BesT panDit Ji LoVe problem solution 9463629203 UK uSA California New Zealand...
gitapress3
 
x ray baggage scanner manufacturers in India
x ray baggage scanner manufacturers in Indiax ray baggage scanner manufacturers in India
x ray baggage scanner manufacturers in India
Gujar Industries India Pvt. Ltd
 
BEst VASHIKARAN SPECIALIST 9463629203 in UK Baba ji Love Marriage problem sol...
BEst VASHIKARAN SPECIALIST 9463629203 in UK Baba ji Love Marriage problem sol...BEst VASHIKARAN SPECIALIST 9463629203 in UK Baba ji Love Marriage problem sol...
BEst VASHIKARAN SPECIALIST 9463629203 in UK Baba ji Love Marriage problem sol...
gitapress3
 
The Best Premium IPTV Service Frane.docx
The Best Premium IPTV Service Frane.docxThe Best Premium IPTV Service Frane.docx
The Best Premium IPTV Service Frane.docx
Industry Foods UK
 

Recently uploaded (20)

Elevate Your Brand with Digital Marketing for Fashion Industry
Elevate Your Brand with Digital Marketing for Fashion IndustryElevate Your Brand with Digital Marketing for Fashion Industry
Elevate Your Brand with Digital Marketing for Fashion Industry
 
Waikiki Sunset Catamaran ! MAITAI Catamaran
Waikiki Sunset Catamaran !  MAITAI CatamaranWaikiki Sunset Catamaran !  MAITAI Catamaran
Waikiki Sunset Catamaran ! MAITAI Catamaran
 
Colors of Wall Paint and Their Mentally Properties.pptx
Colors of Wall Paint and Their Mentally Properties.pptxColors of Wall Paint and Their Mentally Properties.pptx
Colors of Wall Paint and Their Mentally Properties.pptx
 
What Are the Latest Trends in Endpoint Security for 2024?
What Are the Latest Trends in Endpoint Security for 2024?What Are the Latest Trends in Endpoint Security for 2024?
What Are the Latest Trends in Endpoint Security for 2024?
 
Hospitality Training for Hotel Industries
Hospitality Training for Hotel IndustriesHospitality Training for Hotel Industries
Hospitality Training for Hotel Industries
 
Top Best Astrologer +91-9463629203 LoVe Problem SolUtion specialist In InDia ...
Top Best Astrologer +91-9463629203 LoVe Problem SolUtion specialist In InDia ...Top Best Astrologer +91-9463629203 LoVe Problem SolUtion specialist In InDia ...
Top Best Astrologer +91-9463629203 LoVe Problem SolUtion specialist In InDia ...
 
SECUREX UK FOR SECURITY SERVICES AND MOBILE PATROL
SECUREX UK FOR SECURITY SERVICES AND MOBILE PATROLSECUREX UK FOR SECURITY SERVICES AND MOBILE PATROL
SECUREX UK FOR SECURITY SERVICES AND MOBILE PATROL
 
Bulk SMS Service Provider In Mumbai | sms2orbit
Bulk SMS Service Provider In Mumbai | sms2orbitBulk SMS Service Provider In Mumbai | sms2orbit
Bulk SMS Service Provider In Mumbai | sms2orbit
 
The Jamstack Revolution: Building Dynamic Websites with Static Site Generator...
The Jamstack Revolution: Building Dynamic Websites with Static Site Generator...The Jamstack Revolution: Building Dynamic Websites with Static Site Generator...
The Jamstack Revolution: Building Dynamic Websites with Static Site Generator...
 
Inspect Edge & NSPIRE Inspection Application - Streamline Housing Inspections
Inspect Edge & NSPIRE Inspection Application - Streamline Housing InspectionsInspect Edge & NSPIRE Inspection Application - Streamline Housing Inspections
Inspect Edge & NSPIRE Inspection Application - Streamline Housing Inspections
 
Earthmovers: Top Earth Moving Equipments
Earthmovers: Top Earth Moving EquipmentsEarthmovers: Top Earth Moving Equipments
Earthmovers: Top Earth Moving Equipments
 
Comprehensive Water Damage Restoration Services
Comprehensive Water Damage Restoration ServicesComprehensive Water Damage Restoration Services
Comprehensive Water Damage Restoration Services
 
Unlocking Insights: AI-powered Enhanced Due Diligence Strategies for Increase...
Unlocking Insights: AI-powered Enhanced Due Diligence Strategies for Increase...Unlocking Insights: AI-powered Enhanced Due Diligence Strategies for Increase...
Unlocking Insights: AI-powered Enhanced Due Diligence Strategies for Increase...
 
Maximizing Efficiency with Integrated Water Management Systems
Maximizing Efficiency with Integrated Water Management SystemsMaximizing Efficiency with Integrated Water Management Systems
Maximizing Efficiency with Integrated Water Management Systems
 
All Trophies at Trophy-World Malaysia | Custom Trophies & Plaques Supplier
All Trophies at Trophy-World Malaysia | Custom Trophies & Plaques SupplierAll Trophies at Trophy-World Malaysia | Custom Trophies & Plaques Supplier
All Trophies at Trophy-World Malaysia | Custom Trophies & Plaques Supplier
 
Office Business Furnishings | Office Equipment
Office Business Furnishings |  Office EquipmentOffice Business Furnishings |  Office Equipment
Office Business Furnishings | Office Equipment
 
BesT panDit Ji LoVe problem solution 9463629203 UK uSA California New Zealand...
BesT panDit Ji LoVe problem solution 9463629203 UK uSA California New Zealand...BesT panDit Ji LoVe problem solution 9463629203 UK uSA California New Zealand...
BesT panDit Ji LoVe problem solution 9463629203 UK uSA California New Zealand...
 
x ray baggage scanner manufacturers in India
x ray baggage scanner manufacturers in Indiax ray baggage scanner manufacturers in India
x ray baggage scanner manufacturers in India
 
BEst VASHIKARAN SPECIALIST 9463629203 in UK Baba ji Love Marriage problem sol...
BEst VASHIKARAN SPECIALIST 9463629203 in UK Baba ji Love Marriage problem sol...BEst VASHIKARAN SPECIALIST 9463629203 in UK Baba ji Love Marriage problem sol...
BEst VASHIKARAN SPECIALIST 9463629203 in UK Baba ji Love Marriage problem sol...
 
The Best Premium IPTV Service Frane.docx
The Best Premium IPTV Service Frane.docxThe Best Premium IPTV Service Frane.docx
The Best Premium IPTV Service Frane.docx
 

Detecting Off-Topic Pages in Web Archives

  • 1. Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives Detecting Off-Topic Pages in Web Archives Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson Old Dominion University Web Science and Digital Libraries Group http://ws-dl.cs.odu.edu/, @WebSciDL 1 Internet Archive Aug. 3, 2015
  • 2. Detecting Off-Topic Pages in Web Archives Pages go off-topic through time! 2
  • 3. Detecting Off-Topic Pages in Web Archives Over 70% of hamdeensabahy.com archived versions (269) are off-topic May 13, 2012: The page started as ontopic. May 24, 2012: Off-topic due to a database error. Mar. 21, 2013: Not working because of financial problems. May. 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired. 3 http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
  • 4. Detecting Off-Topic Pages in Web Archives Social media pages go off-topic Dec. 22, 2011: Facebook page was relevant to the Occupy collection Aug. 10, 2012: URI redirects to www.facebook.com 4 http://wayback.archive-it.org/2950/*/http://www.facebook.com/MayorJeanQuan
  • 5. Detecting Off-Topic Pages in Web Archives TimeMap Behavior 5
  • 6. Detecting Off-Topic Pages in Web Archives TimeMap behavior 6 1. Always On 2. Step Function On 3. Step Function Off 4. Oscillating 5. Always Off 1. wayback.archive-it.org/2950/*/http://occupypsl.org 2. wayback.archive-it.org/2950/*/http://occupygso.tumblr.com 3. wayback.archive-it.org/2950/*/http://occupyashland.com 4. wayback.archive-it.org/2950/*/http://www.indyows.org 5. wayback.archive-it.org/2950/*/http://occupy605.com
  • 7. Detecting Off-Topic Pages in Web Archives TimeMap behavior 7 1. Always On 2. Step Function On 3. Step Function Off 4. Oscillating 5. Always Off 1. wayback.archive-it.org/2950/*/http://occupypsl.org 2. wayback.archive-it.org/2950/*/http://occupygso.tumblr.com 3. wayback.archive-it.org/2950/*/http://occupyashland.com 4. wayback.archive-it.org/2950/*/http://www.indyows.org 5. wayback.archive-it.org/2950/*/http://occupy605.com 0-2% 6-15% ~0% 74% 8-11%
  • 8. Detecting Off-Topic Pages in Web Archives Methods for detecting off-topic pages 8
  • 9. Detecting Off-Topic Pages in Web Archives The textual content cosine similarity, intersecting the most frequent terms, Jaccard similarity 9 Method Similarity cosine 0.7 TF-Intersection 0.6 Jaccard 0.5 Method Similarity cosine 0.0 TF-Intersection 0.0 Jaccard 0.0
  • 10. Detecting Off-Topic Pages in Web Archives The semantics of the text Web based kernel function using the search engine (SE) 10 Egypt, Tahrir, president, protests, army, Cairo Egypt, protests, Morsi, Cairo, president Feb. 2011 July 2013 Tahrir, Egypt, army Cairo, Morsi, protestsNo term-wise overlap Method Similarity SE-Kernel 0.7
  • 11. Detecting Off-Topic Pages in Web Archives Structural methods no. of words, content-length 11 100 109 100 5 Method Similarity (% change) WordCount 0.09 Method Similarity (% change) WordCount -0.95
  • 12. Detecting Off-Topic Pages in Web Archives We manually labeled 15,760 mementos for evaluating the methods Egypt Revolution and Politics URI-Ms: 6,570 Off-topic URI-Ms: 458 Occupy Movement URI-Ms: 6,886 Off-topic URI-Ms: 384 Human Rights collection URI-Ms: 2,304 Off-topic URI-Ms: 94 12
  • 13. Detecting Off-Topic Pages in Web Archives The results of evaluating the similarity approaches 13
  • 14. Detecting Off-Topic Pages in Web Archives The best three combined methods averaged on three collections 14
  • 15. Detecting Off-Topic Pages in Web Archives Finding off-topic pages in other Archive-It collections 15
  • 16. Detecting Off-Topic Pages in Web Archives The average precision of evaluating the 11 collections of Archive-It is 0.92 16
  • 17. Detecting Off-Topic Pages in Web Archives Detecting off-topic pages in web archives • A command-line tool for suggesting off-topic pages in web archives • Publically available at: https://github.com/yasmina85/OffTopic- Detection 17
  • 18. Detecting Off-Topic Pages in Web Archives Detecting off-topic pages in 1826 collection python detect_off_topic.py -i 1826 -th 0.15 extracting seed list … http://agroecol.umd.edu/Research/index.cfm http://casademaryland.org … 50 URIs are extracted from collection https://archive-it.org/collections/1826 Downloading timemap using uri http://wayback.archive- it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm Downloading timemap using uri http://wayback.archive- it.org/1826/timemap/link/http://casademaryland.org … Downloading 4 mementos out of 306 Downloading 14 mementos out of 306 … Detecting off-topic mementos Similarity memento_uri 0.0 http://wayback.archive-it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/ 0.0 http://wayback.archive-it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html/ 18
  • 19. Detecting Off-Topic Pages in Web Archives Detecting off-topic pages for http://hamdeensabahy.com/ Downloading 0 mementos out of 270 http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/ http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/ http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/ http://wayback.archive-it.org/2358/20140602131307/http://hamdeensabahy.com/ http://wayback.archive-it.org/2358/20140528131258/http://www.hamdeensabahy.com/ http://wayback.archive-it.org/2358/20130617131324/http://www.hamdeensabahy.com/ … Downloading 4 mementos out of 270 … Extracting text from the html … Detecting off-topic mementos Similarity memento_uri -0.979434447301 http://wayback.archive-it.org/2358/20121213102904/http://hamdeensabahy.com/ -0.966580976864 http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/ -0.94087403599 http://wayback.archive-it.org/2358/20130526131402/http://www.hamdeensabahy.com/ -0.94087403599 http://wayback.archive-it.org/2358/20130527143614/http://www.hamdeensabahy.com/ 19 python detect_off_topic.py -t https://wayback.archive- it.org/2358/timemap/link/http://hamdeensabahy.com/ -m wcount -th -0.85
  • 20. Detecting Off-Topic Pages in Web Archives What next • Feedback • Collaboration 20
  • 21. Detecting Off-Topic Pages in Web Archives Extra Slides 21
  • 22. Detecting Off-Topic Pages in Web Archives On-topic: Egyptian Revolution coverage Off-topic: the domain registration is lost Web page goes off-topic (Step Function On) http://wayback.archive-it.org/2358/*/http://www.7amla.net 22
  • 23. Detecting Off-Topic Pages in Web Archives On-topic: Egyptian Revolution coverage Off-topic: the domain registration is lost 23 Web page goes off-topic (Step Function On) http://wayback.archive-it.org/2358/*/http://www.7amla.net
  • 24. Detecting Off-Topic Pages in Web Archives Web pages go off-topic and on-topic many times (Oscillating) http://wayback.archive-it.org/2358/*/http://www.bbc.co.uk/news/world/middle_east/ c On-topic: Egyptian Revolution coverage Off-topic: news about Iraq Off-topic: news about Syria On-topic: Egypt news Off-topic: Palestine 24 24
  • 25. Detecting Off-Topic Pages in Web Archives On-topic: Egyptian Revolution coverage Off-topic: news about Iraq Off-topic: news about Syria Off-topic: news about Syria On-topic: Egypt news Off-topic: Palestine 25 Web pages go off-topic and on-topic many times (Oscillating) http://wayback.archive-it.org/2358/*/http://www.bbc.co.uk/news/world/middle_east/
  • 26. Detecting Off-Topic Pages in Web Archives Preprocessing 1. Obtain the seed URIs from the front-end interface of Archive-It 2. Obtain the TimeMap of the seed URIs from the CDX file* 3. Extract the HTML of the mementos from the WARC files* 4. Extract the text of the page using the Boilerpipe library 5. Extract terms from the page, using scikit-learn to tokenize, remove stop words, and apply stemming 26 *locally hosted at ODU