SlideShare a Scribd company logo
Tools for Managing Seed URIs
(Detecting Off-Topic Pages)
Old Dominion University
Web Science and Digital Libraries Group
http://ws-dl.cs.odu.edu/, @WebSciDL
Web Archiving Collaboration: New Tools and Models
June 4-5, 2015
Yasmin AlNoamany, Michele C. Weigle,
Michael L. Nelson
Funded by Columbia University Libraries Web Archiving Incentive program
Archive-It hosts curated web collections
2
> 3,000
collections
~340
institutions
> 10B archived
pages
Curatorā€™s view of Archive-It
3
The collection curator specifies
seed URIs
4
Curators specify the breadth and depth of
the crawl
5
Current tools measure HTTP events,
not "aboutness"
6
Pages can go off-topic through time
May 13, 2012: The page started as
on-topic.
7http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
Pages can go off-topic through time
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
8http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
Pages can go off-topic through time
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
9http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
Pages can go off-topic through time
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again
10http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
Pages can go off-topic through time
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked
11http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
Over 60% of archived versions of
hamdeensabahy.com are off-topic
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
12http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
Social media pages can go off-topic
Dec. 22, 2011: Facebook page was relevant to
the Occupy collection
13http://wayback.archive-it.org/2950/*/http://www.facebook.com/MayorJeanQuan
Social media pages can go off-topic
Dec. 22, 2011: Facebook page was relevant to
the Occupy collection
Aug. 10, 2012: URI redirects to www.facebook.com
14http://wayback.archive-it.org/2950/*/http://www.facebook.com/MayorJeanQuan
Classifying web page behavior over
time
15
A TimeMap is the list of a URI-R's
mementos
16
We identified 5 classes of TimeMaps
17
1. Always On
2. Step Function On
3. Step Function Off
4. Oscillating
5. Always Off
1. wayback.archive-it.org/2950/*/http://occupypsl.org
2. wayback.archive-it.org/2950/*/http://occupygso.tumblr.com
3. wayback.archive-it.org/2950/*/http://occupyashland.com
4. wayback.archive-it.org/2950/*/http://www.indyows.org
5. wayback.archive-it.org/2950/*/http://occupy605.com
On-topic: Egyptian
Revolution coverage
Off-topic: the domain
registration is lost
A web page goes off-topic
(Step Function On)
18
http://wayback.archive-it.org/2358/*/http://www.7amla.net
On-topic: Egyptian
Revolution coverage
Off-topic: the domain
registration is lost
A web page goes off-topic
(Step Function On)
19
http://wayback.archive-it.org/2358/*/http://www.7amla.net
A web page goes off-topic and
on-topic many times (Oscillating)
On-topic: Egyptian
Revolution coverage
Off-topic: news
about Iraq
Off-topic:
news about Syria
On-topic:
Egypt news
Off-topic:
Palestine
20
http://wayback.archive-it.org/2358/*/http://www.bbc.co.uk/news/world/middle_east/
A web page goes off-topic and
on-topic many times (Oscillating)
On-topic: Egyptian
Revolution coverage
Off-topic: news
about Iraq
Off-topic:
news about Syria
Off-topic: news
about Syria
On-topic:
Egypt news
Off-topic:
Palestine
21
http://wayback.archive-it.org/2358/*/http://www.bbc.co.uk/news/world/middle_east/
Most TimeMaps are Always On
22
1. Always On
2. Step Function On
3. Step Function Off
4. Oscillating
5. Always Off
1. wayback.archive-it.org/2950/*/http://occupypsl.org
2. wayback.archive-it.org/2950/*/http://occupygso.tumblr.com
3. wayback.archive-it.org/2950/*/http://occupyashland.com
4. wayback.archive-it.org/2950/*/http://www.indyows.org
5. wayback.archive-it.org/2950/*/http://occupy605.com
0-2%
6-15%
~0%
74%
8-11%
Methods for detecting off-topic pages
23
From Archive-It collection to terms
1. Obtain the seed URIs from the front-end
interface of Archive-It
2. Obtain the TimeMap of the seed URIs from the
CDX file*
3. Extract the HTML of the mementos from the
WARC files*
4. Extract the text of the page using the Boilerpipe
library
5. Extract terms from the page, using scikit-learn to
tokenize, remove stop words, and apply
stemming
24
*locally hosted at ODU
We investigated 6 similarity metrics
ā€¢ Textual Content
ā€“ cosine similarity of TF-IDF
ā€“ intersection of the 20 most frequent terms
ā€“ Jaccard similarity coefficient
ā€¢ Semantics
ā€“ Web-based kernel function using a search engine (SE)
ā€¢ Structural
ā€“ the change in number of words
ā€“ the change in content length
25
Textual Content
cosine similarity, intersecting the most frequent terms,
Jaccard similarity
26
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
Textual Content
cosine similarity, intersecting the most frequent terms,
Jaccard similarity
27
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
Method Similarity
cosine 0.0
TF-Intersection 0.0
Jaccard 0.0
Semantics of the Text
Web based kernel function using the search engine (SE)
28
Feb. 2011 July 2013
Tahrir, Egypt, army Cairo, Morsi, protestsNo term-wise overlap
Semantics of the Text
Web based kernel function using the search engine (SE)
29
Egypt, Tahrir, president, protests, army, Cairo Egypt, protests, Morsi, Cairo, president
Feb. 2011 July 2013
Tahrir, Egypt, army Cairo, Morsi, protestsNo term-wise overlap
Method Similarity
SE-Kernel 0.7
Technique inspired by Sahami and Heilman, WWW 2006
Structural Methods
no. of words, content-length
30
100 109
Method % change
WordCount 0.09
Structural Methods
no. of words, content-length
31
100 109
100 5
Method % change
WordCount 0.09
Method % change
WordCount -0.95
We built a gold standard data set to
evaluate the methods
32
We manually labeled 15,760 mementos
Egypt Revolution and Politics
URI-Rs: 136
URI-Ms: 6,886
Off-topic URI-Ms: 384
Occupy Movement
URI-Rs: 255
URI-Ms: 6,570
Off-topic URI-Ms: 458
Columbia Univ. Human Rights collection
URI-Rs: 198
URI-Ms: 2,304
Off-topic URI-Ms: 94
33
Example of manually labeled set
Future work: convert to annotated/extended
TimeMap format
34
id date URI label
9 20120124014240 http://wayback.archive-it.org/2950/20120124014240/http://occupysarasota.com/ 1
9 20120131014118 http://wayback.archive-it.org/2950/20120131014118/http://occupysarasota.com/ 1
9 20120207014119 http://wayback.archive-it.org/2950/20120207014119/http://occupysarasota.com/ 1
9 20120501041141 http://wayback.archive-it.org/2950/20120501041141/http://occupysarasota.com/ 0
9 20120508032644 http://wayback.archive-it.org/2950/20120508032644/http://occupysarasota.com/ 0
9 20120515034720 http://wayback.archive-it.org/2950/20120515034720/http://occupysarasota.com/ 0
Evaluated 6 methods at 21 thresholds
ā€¢ Assumed first memento was on-topic
ā€¢ Combined two methods ('OR') to find best
combination method
ā€“ 15 combinations
ā€“ 6,615 tests (15 combinations x 21 thresholds x 21
thresholds)
ā€¢ Averaged the results at each threshold over
the three collections
35
Evaluated based on 5 metrics
ā€¢ False positives (FP)
ā€“ on-topic labeled as off-topic
ā€¢ False negatives (FN)
ā€“ off-topic labeled as on-topic
ā€¢ Accuracy (ACC)
ā€“ proportion of correct
classifications
ā€“ (TP + TN)/(TP + FP + FN + TN)
ā€¢ F1 score
ā€“ weighted average of precision
and recall
ā€“ 2TP/(2TP + FP + FN)
ā€¢ AUC
ā€“ area under the ROC curve
ā€“ ROC - plots false positive rate
vs. true positive rate
36
Cosine Similarity performed well
37
Similarity Measure Threshold FP FN FP+FN ACC F1 AUC
Cosine|WordCount 0.10|-0.85 24 10 34 0.987 0.906 0.968
Cosine|SEKernel 0.10|0.00 6 35 40 0.990 0.901 0.934
Cosine 0.15 31 22 53 0.983 0.881 0.961
WordCount|SEKernel -0.80|0.00 14 27 42 0.985 0.818 0.885
WordCount -0.85 6 44 50 0.982 0.806 0.870
SEKernel 0.05 64 83 147 0.965 0.683 0.865
Bytes -0.65 28 133 161 0.962 0.584 0.746
Jaccard 0.05 74 86 159 0.962 0.538 0.809
TF-Intersection 0.00 49 104 153 0.967 0.537 0.740
Finding off-topic pages in other
Archive-It collections
38
Applied best method to 11 Archive-It
collections
ā€¢ Cosine|Word Count with 0.10|-0.85
thresholds
ā€¢ Collection Characteristics
ā€“ governmental, event-based, theme-based
ā€“ time spans of 1 week - 7 years
ā€“ 35 - 1459 URI-Rs
ā€“ 118 - 10,283 URI-Ms
39
Average precision of 0.92 on 11
Archive-It collections
40
ID Collection URI-Rs URI-Ms Off-topic
URI-Ms
Affected
URI-Rs
TP FP P
2893 Global Food Crisis 65 3063 22 7 22 0 1.000
1084 Government in Alaska 68 506 16 4 16 0 1.000
2966 Virginia Tech Shootings 239 1670 24 2 24 0 1.000
2017 Wikileaks 2010 Document 35 2360 107 8 107 0 1.000
2323 Jasmine Revolution 2011 231 4076 114 31 107 7 0.939
1827 IT Historical Resource 1459 10,283 59 34 45 14 0.763
1475 Human Rights Document 147 1530 54 20 39 15 0.722
1826 Maryland State Document 69 184 0 0 - - -
694 April 16 Archive 35 118 0 0 - - -
2535 Brazilian School Shooting 476 1092 0 0 - - -
2823 Russia Plane Crash 65 447 0 0 - - -
Summary
ā€¢ We investigated six methods for measuring similarity
between mementos in a TimeMap:
ā€“ cosine similarity of TF-IDF
ā€“ Jaccard similarity
ā€“ intersection of the 20 most frequent terms
ā€“ Web-based kernel function
ā€“ change in number of words
ā€“ change in content length
ā€¢ We tested the approaches on a gold standard data set from
three Archive-It collections
ā€¢ We evaluated best approach on 11 diverse Archive-It
collections
41
Findings
ā€¢ Combining cosine similarity at threshold 0.10 and
change in size using word count at threshold
āˆ’0.85 gives the best performance
ā€¢ Cosine similarity at threshold = 0.15 is the best
single method
ā€¢ Using the combined method, we achieved 0.92
average precision on 11 Archive-It collections
42
Tool for detecting off-topic pages
ā€¢ A python command-line tool for suggesting
off-topic pages in web archives
ā€“ Cosine Similarity
ā€“ default threshold is 0.15
ā€“ operates on live TimeMaps
Available at
https://github.com/yasmina85/OffTopic-Detection
43
Detecting off-topic pages in an
Archive-It collection (Maryland State Docs)
% python detect_off_topic.py -i 1826 -th 0.15
extracting seed list
ā€¦
http://agroecol.umd.edu/Research/index.cfm
http://casademaryland.org
ā€¦
50 URIs are extracted from collection https://archive-it.org/collections/1826
Downloading timemap using uri http://wayback.archive-
it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm
Downloading timemap using uri http://wayback.archive-
it.org/1826/timemap/link/http://casademaryland.org
ā€¦
Downloading 4 mementos out of 306
Downloading 14 mementos out of 306
ā€¦
Detecting off-topic mementos
Similarity memento_uri
0.0 http://wayback.archive-
it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/
0.0 http://wayback.archive-
it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html/
44
This was run live after we did the
evaluation, so now there are off-
topic mementos
Detecting off-topic pages in a single TimeMap
% python detect_off_topic.py -t https://wayback.archive-
it.org/2358/timemap/link/http://hamdeensabahy.com/
Downloading 0 mementos out of 270
http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/
ā€¦
Downloading 270 mementos out of 270
ā€¦
Extracting text from the html
ā€¦
Detecting off-topic mementos
Similarity memento_uri
0.0509170839413 http://wayback.archive-
it.org/2358/20140524131241/http://www.hamdeensabahy.com/
0.0 http://wayback.archive-
it.org/2358/20130321080254/http://hamdeensabahy.com/
0.0368021561791 http://wayback.archive-
it.org/2358/20130621131337/http://www.hamdeensabahy.com/
0.12899637517 http://wayback.archive-
it.org/2358/20140602131307/http://hamdeensabahy.com/
ā€¦ 45
We're continuing work on this
ā€¢ Enhancements to the detection tool
ā€“ add the other similarity methods (WordCount first)
ā€“ allow input of local CDX and WARC files
ā€¢ Investigate characteristics of collections and
TimeMaps that affect choosing thresholds
ā€¢ Detect off-topic seeds (URI-Rs) in a collection
ā€“ determine collection aboutness
46
Tools for Managing Seed URIs
(Detecting Off-Topic Pages)
Old Dominion University
Web Science and Digital Libraries Group
http://ws-dl.cs.odu.edu/, @WebSciDL
Web Archiving Collaboration: New Tools and Models
June 4-5, 2015
Yasmin AlNoamany, Michele C. Weigle,
Michael L. Nelson
Python Tool: https://github.com/yasmina85/OffTopic-Detection

More Related Content

What's hot

Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
Martin Klein
Ā 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
Michael Nelson
Ā 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to Omeka
Shawn Day
Ā 
ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019
Helge Holzmann
Ā 
Seamless access to the worldā€™s open access research papers via ResourceSync
Seamless access to the worldā€™s open access research papers via ResourceSyncSeamless access to the worldā€™s open access research papers via ResourceSync
Seamless access to the worldā€™s open access research papers via ResourceSync
petrknoth
Ā 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
Martin Klein
Ā 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Anna Perricci
Ā 
The Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domain
oli-unima
Ā 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
Denis Shestakov
Ā 
Archiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional MemoryArchiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional Memory
Samantha Norling
Ā 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
Ā 
Signposting Overview (Version November 2017)
Signposting Overview (Version November 2017)Signposting Overview (Version November 2017)
Signposting Overview (Version November 2017)
Herbert Van de Sompel
Ā 
The web is rotting and what to do about it
The web is rotting and what to do about itThe web is rotting and what to do about it
The web is rotting and what to do about it
Herbert Van de Sompel
Ā 
Building A Virtual Learning Commons
Building A Virtual Learning CommonsBuilding A Virtual Learning Commons
Building A Virtual Learning Commons
Ray Palin
Ā 
Paul Evan Peters Lecture
Paul Evan Peters LecturePaul Evan Peters Lecture
Paul Evan Peters Lecture
Herbert Van de Sompel
Ā 
Tel presentation
Tel presentationTel presentation
Tel presentationNick Sheppard
Ā 
Introduction To Linked Data
Introduction To Linked DataIntroduction To Linked Data
Introduction To Linked DataLeigh Dodds
Ā 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
Michael Nelson
Ā 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
Michele Weigle
Ā 
Presenting Your Digital Research
Presenting Your Digital ResearchPresenting Your Digital Research
Presenting Your Digital ResearchShawn Day
Ā 

What's hot (20)

Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
Ā 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
Ā 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to Omeka
Ā 
ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019
Ā 
Seamless access to the worldā€™s open access research papers via ResourceSync
Seamless access to the worldā€™s open access research papers via ResourceSyncSeamless access to the worldā€™s open access research papers via ResourceSync
Seamless access to the worldā€™s open access research papers via ResourceSync
Ā 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
Ā 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Ā 
The Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domain
Ā 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
Ā 
Archiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional MemoryArchiving Web-Based #musetech for Institutional Memory
Archiving Web-Based #musetech for Institutional Memory
Ā 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
Ā 
Signposting Overview (Version November 2017)
Signposting Overview (Version November 2017)Signposting Overview (Version November 2017)
Signposting Overview (Version November 2017)
Ā 
The web is rotting and what to do about it
The web is rotting and what to do about itThe web is rotting and what to do about it
The web is rotting and what to do about it
Ā 
Building A Virtual Learning Commons
Building A Virtual Learning CommonsBuilding A Virtual Learning Commons
Building A Virtual Learning Commons
Ā 
Paul Evan Peters Lecture
Paul Evan Peters LecturePaul Evan Peters Lecture
Paul Evan Peters Lecture
Ā 
Tel presentation
Tel presentationTel presentation
Tel presentation
Ā 
Introduction To Linked Data
Introduction To Linked DataIntroduction To Linked Data
Introduction To Linked Data
Ā 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
Ā 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
Ā 
Presenting Your Digital Research
Presenting Your Digital ResearchPresenting Your Digital Research
Presenting Your Digital Research
Ā 

Similar to Detecting Off-Topic Web Pages at #CUWARC

Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
Yasmin AlNoamany, PhD
Ā 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Sawood Alam
Ā 
Smarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter Libraries
OCLC
Ā 
Information sharing about Columbia University Libraryā€™s recent web archiving ...
Information sharing about Columbia University Libraryā€™s recent web archiving ...Information sharing about Columbia University Libraryā€™s recent web archiving ...
Information sharing about Columbia University Libraryā€™s recent web archiving ...
Anna Perricci
Ā 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
Richard Wallis
Ā 
Representing the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makersRepresenting the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makersjudell
Ā 
TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22
jodischneider
Ā 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
Filip Radulovic
Ā 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we have
Richard Wallis
Ā 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
Richard Wallis
Ā 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
Joshua Shinavier
Ā 
Government Documents Disposition Project Made Easy with Aleph V.18
Government Documents Disposition Project Made Easy with Aleph V.18Government Documents Disposition Project Made Easy with Aleph V.18
Government Documents Disposition Project Made Easy with Aleph V.18
guest61f1b7d
Ā 
Linked Open Data for Archives
Linked Open Data for ArchivesLinked Open Data for Archives
Linked Open Data for Archives
Cliff Landis
Ā 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIG
Chris Ewing
Ā 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URI
LulwahMA
Ā 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
Michael Nelson
Ā 
The methods and practices of Linked Open Data
The methods and practices of Linked Open DataThe methods and practices of Linked Open Data
The methods and practices of Linked Open Data
Dongpo Deng
Ā 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
Enrico Daga
Ā 
Open Data - The Fingal Perspective
Open Data - The Fingal PerspectiveOpen Data - The Fingal Perspective
Open Data - The Fingal Perspective
Fingal Open Data
Ā 
Linked Statistical Data: does it actually pay off?
Linked Statistical Data: does it actually pay off?Linked Statistical Data: does it actually pay off?
Linked Statistical Data: does it actually pay off?
Oscar Corcho
Ā 

Similar to Detecting Off-Topic Web Pages at #CUWARC (20)

Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
Ā 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Ā 
Smarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter Libraries
Ā 
Information sharing about Columbia University Libraryā€™s recent web archiving ...
Information sharing about Columbia University Libraryā€™s recent web archiving ...Information sharing about Columbia University Libraryā€™s recent web archiving ...
Information sharing about Columbia University Libraryā€™s recent web archiving ...
Ā 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
Ā 
Representing the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makersRepresenting the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makers
Ā 
TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22
Ā 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
Ā 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we have
Ā 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
Ā 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
Ā 
Government Documents Disposition Project Made Easy with Aleph V.18
Government Documents Disposition Project Made Easy with Aleph V.18Government Documents Disposition Project Made Easy with Aleph V.18
Government Documents Disposition Project Made Easy with Aleph V.18
Ā 
Linked Open Data for Archives
Linked Open Data for ArchivesLinked Open Data for Archives
Linked Open Data for Archives
Ā 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIG
Ā 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URI
Ā 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
Ā 
The methods and practices of Linked Open Data
The methods and practices of Linked Open DataThe methods and practices of Linked Open Data
The methods and practices of Linked Open Data
Ā 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
Ā 
Open Data - The Fingal Perspective
Open Data - The Fingal PerspectiveOpen Data - The Fingal Perspective
Open Data - The Fingal Perspective
Ā 
Linked Statistical Data: does it actually pay off?
Linked Statistical Data: does it actually pay off?Linked Statistical Data: does it actually pay off?
Linked Statistical Data: does it actually pay off?
Ā 

More from Michele Weigle

Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
Michele Weigle
Ā 
WS-DLā€™s Work towards Enabling Personal Use of Web Archives
WS-DLā€™s Work towards Enabling Personal Use of Web ArchivesWS-DLā€™s Work towards Enabling Personal Use of Web Archives
WS-DLā€™s Work towards Enabling Personal Use of Web Archives
Michele Weigle
Ā 
Intro to Web Archiving
Intro to Web ArchivingIntro to Web Archiving
Intro to Web Archiving
Michele Weigle
Ā 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web Archives
Michele Weigle
Ā 
Visualizing Webpage Changes Over Time
Visualizing Webpage Changes Over TimeVisualizing Webpage Changes Over Time
Visualizing Webpage Changes Over Time
Michele Weigle
Ā 
How to Write an Academic Paper
How to Write an Academic PaperHow to Write an Academic Paper
How to Write an Academic Paper
Michele Weigle
Ā 
How to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationHow to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic Presentation
Michele Weigle
Ā 
My Academic Story via Internet Archive
My Academic Story via Internet ArchiveMy Academic Story via Internet Archive
My Academic Story via Internet Archive
Michele Weigle
Ā 
A Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor NetworksA Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor Networks
Michele Weigle
Ā 
Strategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency ResponseStrategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency Response
Michele Weigle
Ā 
Energy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless NanonetworksEnergy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless Nanonetworks
Michele Weigle
Ā 
2015-capwic-gradschool
2015-capwic-gradschool2015-capwic-gradschool
2015-capwic-gradschool
Michele Weigle
Ā 
2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-web2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-web
Michele Weigle
Ā 
Tools for Managing the Past Web
Tools for Managing the Past WebTools for Managing the Past Web
Tools for Managing the Past Web
Michele Weigle
Ā 
Archive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH OverviewArchive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH Overview
Michele Weigle
Ā 
Bits of Research
Bits of ResearchBits of Research
Bits of Research
Michele Weigle
Ā 
"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overview"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overview
Michele Weigle
Ā 
TDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETsTDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETs
Michele Weigle
Ā 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-It
Michele Weigle
Ā 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-It
Michele Weigle
Ā 

More from Michele Weigle (20)

Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
Comparing the Archival Rate of Arabic, English, Danish, and Korean Language W...
Ā 
WS-DLā€™s Work towards Enabling Personal Use of Web Archives
WS-DLā€™s Work towards Enabling Personal Use of Web ArchivesWS-DLā€™s Work towards Enabling Personal Use of Web Archives
WS-DLā€™s Work towards Enabling Personal Use of Web Archives
Ā 
Intro to Web Archiving
Intro to Web ArchivingIntro to Web Archiving
Intro to Web Archiving
Ā 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web Archives
Ā 
Visualizing Webpage Changes Over Time
Visualizing Webpage Changes Over TimeVisualizing Webpage Changes Over Time
Visualizing Webpage Changes Over Time
Ā 
How to Write an Academic Paper
How to Write an Academic PaperHow to Write an Academic Paper
How to Write an Academic Paper
Ā 
How to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationHow to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic Presentation
Ā 
My Academic Story via Internet Archive
My Academic Story via Internet ArchiveMy Academic Story via Internet Archive
My Academic Story via Internet Archive
Ā 
A Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor NetworksA Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor Networks
Ā 
Strategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency ResponseStrategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency Response
Ā 
Energy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless NanonetworksEnergy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless Nanonetworks
Ā 
2015-capwic-gradschool
2015-capwic-gradschool2015-capwic-gradschool
2015-capwic-gradschool
Ā 
2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-web2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-web
Ā 
Tools for Managing the Past Web
Tools for Managing the Past WebTools for Managing the Past Web
Tools for Managing the Past Web
Ā 
Archive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH OverviewArchive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH Overview
Ā 
Bits of Research
Bits of ResearchBits of Research
Bits of Research
Ā 
"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overview"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overview
Ā 
TDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETsTDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETs
Ā 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-It
Ā 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-It
Ā 

Recently uploaded

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
Ā 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
Ā 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
Ā 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
Ā 
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
UiPathCommunity
Ā 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
Ā 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
Ā 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
Ā 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
Ā 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
Ā 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
Ā 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
Ā 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
Ā 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
Ā 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
Ā 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
Ā 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
Ā 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
Ā 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
Ā 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
Ā 

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Ā 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Ā 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Ā 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Ā 
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Ā 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Ā 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
Ā 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Ā 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Ā 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Ā 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
Ā 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Ā 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Ā 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
Ā 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Ā 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Ā 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Ā 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Ā 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Ā 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Ā 

Detecting Off-Topic Web Pages at #CUWARC

  • 1. Tools for Managing Seed URIs (Detecting Off-Topic Pages) Old Dominion University Web Science and Digital Libraries Group http://ws-dl.cs.odu.edu/, @WebSciDL Web Archiving Collaboration: New Tools and Models June 4-5, 2015 Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson Funded by Columbia University Libraries Web Archiving Incentive program
  • 2. Archive-It hosts curated web collections 2 > 3,000 collections ~340 institutions > 10B archived pages
  • 3. Curatorā€™s view of Archive-It 3
  • 4. The collection curator specifies seed URIs 4
  • 5. Curators specify the breadth and depth of the crawl 5
  • 6. Current tools measure HTTP events, not "aboutness" 6
  • 7. Pages can go off-topic through time May 13, 2012: The page started as on-topic. 7http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
  • 8. Pages can go off-topic through time May 13, 2012: The page started as on-topic. May 24, 2012: Off-topic due to a database error. 8http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
  • 9. Pages can go off-topic through time May 13, 2012: The page started as on-topic. May 24, 2012: Off-topic due to a database error. Mar. 21, 2013: Not working because of financial problems. 9http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
  • 10. Pages can go off-topic through time May 13, 2012: The page started as on-topic. May 24, 2012: Off-topic due to a database error. Mar. 21, 2013: Not working because of financial problems. May 21, 2013: On-topic again 10http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
  • 11. Pages can go off-topic through time May 13, 2012: The page started as on-topic. May 24, 2012: Off-topic due to a database error. Mar. 21, 2013: Not working because of financial problems. May 21, 2013: On-topic again June 5, 2014: The site has been hacked 11http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
  • 12. Over 60% of archived versions of hamdeensabahy.com are off-topic May 13, 2012: The page started as on-topic. May 24, 2012: Off-topic due to a database error. Mar. 21, 2013: Not working because of financial problems. May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired. 12http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
  • 13. Social media pages can go off-topic Dec. 22, 2011: Facebook page was relevant to the Occupy collection 13http://wayback.archive-it.org/2950/*/http://www.facebook.com/MayorJeanQuan
  • 14. Social media pages can go off-topic Dec. 22, 2011: Facebook page was relevant to the Occupy collection Aug. 10, 2012: URI redirects to www.facebook.com 14http://wayback.archive-it.org/2950/*/http://www.facebook.com/MayorJeanQuan
  • 15. Classifying web page behavior over time 15
  • 16. A TimeMap is the list of a URI-R's mementos 16
  • 17. We identified 5 classes of TimeMaps 17 1. Always On 2. Step Function On 3. Step Function Off 4. Oscillating 5. Always Off 1. wayback.archive-it.org/2950/*/http://occupypsl.org 2. wayback.archive-it.org/2950/*/http://occupygso.tumblr.com 3. wayback.archive-it.org/2950/*/http://occupyashland.com 4. wayback.archive-it.org/2950/*/http://www.indyows.org 5. wayback.archive-it.org/2950/*/http://occupy605.com
  • 18. On-topic: Egyptian Revolution coverage Off-topic: the domain registration is lost A web page goes off-topic (Step Function On) 18 http://wayback.archive-it.org/2358/*/http://www.7amla.net
  • 19. On-topic: Egyptian Revolution coverage Off-topic: the domain registration is lost A web page goes off-topic (Step Function On) 19 http://wayback.archive-it.org/2358/*/http://www.7amla.net
  • 20. A web page goes off-topic and on-topic many times (Oscillating) On-topic: Egyptian Revolution coverage Off-topic: news about Iraq Off-topic: news about Syria On-topic: Egypt news Off-topic: Palestine 20 http://wayback.archive-it.org/2358/*/http://www.bbc.co.uk/news/world/middle_east/
  • 21. A web page goes off-topic and on-topic many times (Oscillating) On-topic: Egyptian Revolution coverage Off-topic: news about Iraq Off-topic: news about Syria Off-topic: news about Syria On-topic: Egypt news Off-topic: Palestine 21 http://wayback.archive-it.org/2358/*/http://www.bbc.co.uk/news/world/middle_east/
  • 22. Most TimeMaps are Always On 22 1. Always On 2. Step Function On 3. Step Function Off 4. Oscillating 5. Always Off 1. wayback.archive-it.org/2950/*/http://occupypsl.org 2. wayback.archive-it.org/2950/*/http://occupygso.tumblr.com 3. wayback.archive-it.org/2950/*/http://occupyashland.com 4. wayback.archive-it.org/2950/*/http://www.indyows.org 5. wayback.archive-it.org/2950/*/http://occupy605.com 0-2% 6-15% ~0% 74% 8-11%
  • 23. Methods for detecting off-topic pages 23
  • 24. From Archive-It collection to terms 1. Obtain the seed URIs from the front-end interface of Archive-It 2. Obtain the TimeMap of the seed URIs from the CDX file* 3. Extract the HTML of the mementos from the WARC files* 4. Extract the text of the page using the Boilerpipe library 5. Extract terms from the page, using scikit-learn to tokenize, remove stop words, and apply stemming 24 *locally hosted at ODU
  • 25. We investigated 6 similarity metrics ā€¢ Textual Content ā€“ cosine similarity of TF-IDF ā€“ intersection of the 20 most frequent terms ā€“ Jaccard similarity coefficient ā€¢ Semantics ā€“ Web-based kernel function using a search engine (SE) ā€¢ Structural ā€“ the change in number of words ā€“ the change in content length 25
  • 26. Textual Content cosine similarity, intersecting the most frequent terms, Jaccard similarity 26 Method Similarity cosine 0.7 TF-Intersection 0.6 Jaccard 0.5
  • 27. Textual Content cosine similarity, intersecting the most frequent terms, Jaccard similarity 27 Method Similarity cosine 0.7 TF-Intersection 0.6 Jaccard 0.5 Method Similarity cosine 0.0 TF-Intersection 0.0 Jaccard 0.0
  • 28. Semantics of the Text Web based kernel function using the search engine (SE) 28 Feb. 2011 July 2013 Tahrir, Egypt, army Cairo, Morsi, protestsNo term-wise overlap
  • 29. Semantics of the Text Web based kernel function using the search engine (SE) 29 Egypt, Tahrir, president, protests, army, Cairo Egypt, protests, Morsi, Cairo, president Feb. 2011 July 2013 Tahrir, Egypt, army Cairo, Morsi, protestsNo term-wise overlap Method Similarity SE-Kernel 0.7 Technique inspired by Sahami and Heilman, WWW 2006
  • 30. Structural Methods no. of words, content-length 30 100 109 Method % change WordCount 0.09
  • 31. Structural Methods no. of words, content-length 31 100 109 100 5 Method % change WordCount 0.09 Method % change WordCount -0.95
  • 32. We built a gold standard data set to evaluate the methods 32
  • 33. We manually labeled 15,760 mementos Egypt Revolution and Politics URI-Rs: 136 URI-Ms: 6,886 Off-topic URI-Ms: 384 Occupy Movement URI-Rs: 255 URI-Ms: 6,570 Off-topic URI-Ms: 458 Columbia Univ. Human Rights collection URI-Rs: 198 URI-Ms: 2,304 Off-topic URI-Ms: 94 33
  • 34. Example of manually labeled set Future work: convert to annotated/extended TimeMap format 34 id date URI label 9 20120124014240 http://wayback.archive-it.org/2950/20120124014240/http://occupysarasota.com/ 1 9 20120131014118 http://wayback.archive-it.org/2950/20120131014118/http://occupysarasota.com/ 1 9 20120207014119 http://wayback.archive-it.org/2950/20120207014119/http://occupysarasota.com/ 1 9 20120501041141 http://wayback.archive-it.org/2950/20120501041141/http://occupysarasota.com/ 0 9 20120508032644 http://wayback.archive-it.org/2950/20120508032644/http://occupysarasota.com/ 0 9 20120515034720 http://wayback.archive-it.org/2950/20120515034720/http://occupysarasota.com/ 0
  • 35. Evaluated 6 methods at 21 thresholds ā€¢ Assumed first memento was on-topic ā€¢ Combined two methods ('OR') to find best combination method ā€“ 15 combinations ā€“ 6,615 tests (15 combinations x 21 thresholds x 21 thresholds) ā€¢ Averaged the results at each threshold over the three collections 35
  • 36. Evaluated based on 5 metrics ā€¢ False positives (FP) ā€“ on-topic labeled as off-topic ā€¢ False negatives (FN) ā€“ off-topic labeled as on-topic ā€¢ Accuracy (ACC) ā€“ proportion of correct classifications ā€“ (TP + TN)/(TP + FP + FN + TN) ā€¢ F1 score ā€“ weighted average of precision and recall ā€“ 2TP/(2TP + FP + FN) ā€¢ AUC ā€“ area under the ROC curve ā€“ ROC - plots false positive rate vs. true positive rate 36
  • 37. Cosine Similarity performed well 37 Similarity Measure Threshold FP FN FP+FN ACC F1 AUC Cosine|WordCount 0.10|-0.85 24 10 34 0.987 0.906 0.968 Cosine|SEKernel 0.10|0.00 6 35 40 0.990 0.901 0.934 Cosine 0.15 31 22 53 0.983 0.881 0.961 WordCount|SEKernel -0.80|0.00 14 27 42 0.985 0.818 0.885 WordCount -0.85 6 44 50 0.982 0.806 0.870 SEKernel 0.05 64 83 147 0.965 0.683 0.865 Bytes -0.65 28 133 161 0.962 0.584 0.746 Jaccard 0.05 74 86 159 0.962 0.538 0.809 TF-Intersection 0.00 49 104 153 0.967 0.537 0.740
  • 38. Finding off-topic pages in other Archive-It collections 38
  • 39. Applied best method to 11 Archive-It collections ā€¢ Cosine|Word Count with 0.10|-0.85 thresholds ā€¢ Collection Characteristics ā€“ governmental, event-based, theme-based ā€“ time spans of 1 week - 7 years ā€“ 35 - 1459 URI-Rs ā€“ 118 - 10,283 URI-Ms 39
  • 40. Average precision of 0.92 on 11 Archive-It collections 40 ID Collection URI-Rs URI-Ms Off-topic URI-Ms Affected URI-Rs TP FP P 2893 Global Food Crisis 65 3063 22 7 22 0 1.000 1084 Government in Alaska 68 506 16 4 16 0 1.000 2966 Virginia Tech Shootings 239 1670 24 2 24 0 1.000 2017 Wikileaks 2010 Document 35 2360 107 8 107 0 1.000 2323 Jasmine Revolution 2011 231 4076 114 31 107 7 0.939 1827 IT Historical Resource 1459 10,283 59 34 45 14 0.763 1475 Human Rights Document 147 1530 54 20 39 15 0.722 1826 Maryland State Document 69 184 0 0 - - - 694 April 16 Archive 35 118 0 0 - - - 2535 Brazilian School Shooting 476 1092 0 0 - - - 2823 Russia Plane Crash 65 447 0 0 - - -
  • 41. Summary ā€¢ We investigated six methods for measuring similarity between mementos in a TimeMap: ā€“ cosine similarity of TF-IDF ā€“ Jaccard similarity ā€“ intersection of the 20 most frequent terms ā€“ Web-based kernel function ā€“ change in number of words ā€“ change in content length ā€¢ We tested the approaches on a gold standard data set from three Archive-It collections ā€¢ We evaluated best approach on 11 diverse Archive-It collections 41
  • 42. Findings ā€¢ Combining cosine similarity at threshold 0.10 and change in size using word count at threshold āˆ’0.85 gives the best performance ā€¢ Cosine similarity at threshold = 0.15 is the best single method ā€¢ Using the combined method, we achieved 0.92 average precision on 11 Archive-It collections 42
  • 43. Tool for detecting off-topic pages ā€¢ A python command-line tool for suggesting off-topic pages in web archives ā€“ Cosine Similarity ā€“ default threshold is 0.15 ā€“ operates on live TimeMaps Available at https://github.com/yasmina85/OffTopic-Detection 43
  • 44. Detecting off-topic pages in an Archive-It collection (Maryland State Docs) % python detect_off_topic.py -i 1826 -th 0.15 extracting seed list ā€¦ http://agroecol.umd.edu/Research/index.cfm http://casademaryland.org ā€¦ 50 URIs are extracted from collection https://archive-it.org/collections/1826 Downloading timemap using uri http://wayback.archive- it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm Downloading timemap using uri http://wayback.archive- it.org/1826/timemap/link/http://casademaryland.org ā€¦ Downloading 4 mementos out of 306 Downloading 14 mementos out of 306 ā€¦ Detecting off-topic mementos Similarity memento_uri 0.0 http://wayback.archive- it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/ 0.0 http://wayback.archive- it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html/ 44 This was run live after we did the evaluation, so now there are off- topic mementos
  • 45. Detecting off-topic pages in a single TimeMap % python detect_off_topic.py -t https://wayback.archive- it.org/2358/timemap/link/http://hamdeensabahy.com/ Downloading 0 mementos out of 270 http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/ http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/ http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/ ā€¦ Downloading 270 mementos out of 270 ā€¦ Extracting text from the html ā€¦ Detecting off-topic mementos Similarity memento_uri 0.0509170839413 http://wayback.archive- it.org/2358/20140524131241/http://www.hamdeensabahy.com/ 0.0 http://wayback.archive- it.org/2358/20130321080254/http://hamdeensabahy.com/ 0.0368021561791 http://wayback.archive- it.org/2358/20130621131337/http://www.hamdeensabahy.com/ 0.12899637517 http://wayback.archive- it.org/2358/20140602131307/http://hamdeensabahy.com/ ā€¦ 45
  • 46. We're continuing work on this ā€¢ Enhancements to the detection tool ā€“ add the other similarity methods (WordCount first) ā€“ allow input of local CDX and WARC files ā€¢ Investigate characteristics of collections and TimeMaps that affect choosing thresholds ā€¢ Detect off-topic seeds (URI-Rs) in a collection ā€“ determine collection aboutness 46
  • 47. Tools for Managing Seed URIs (Detecting Off-Topic Pages) Old Dominion University Web Science and Digital Libraries Group http://ws-dl.cs.odu.edu/, @WebSciDL Web Archiving Collaboration: New Tools and Models June 4-5, 2015 Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson Python Tool: https://github.com/yasmina85/OffTopic-Detection

Editor's Notes

  1. First deployed in 2006, Archive-It is a subscription web archiving service from the Internet Archive that helps organizations to harvest, build, and preserve collections of digital content.Ā  We go to this effort to collect good seeds for our collections We specify archiving period and depth But once the crawl is running, we don't know what really happens to our seeds We can tell what types of data we're gathering or if it's 404, but what if the content changes significantly? Is there any way other than manual inspection to detect off-topic mementos?
  2. (the frequency is tunable by the user), and to what depth (e.g., follow the pages linked to from the seeds two-levels out). The Heritrix crawler at Archive-It then recrawls these seeds at the specified frequency and depth to, while the crawler is capturing the seed periodically at the time that lori specified,
  3. There is no tool to detect when the page goes off-topic
  4. Add the links here
  5. Add the links here
  6. The textual content: cosine similarity intersection of the most frequent terms Jaccard coefficient The semantics of the text: Web based kernel function using the search engine (SE) Structural methods: the change in number of words the change in content length
  7. The textual content: cosine similarity intersection of the most frequent terms Jaccard coefficient The semantics of the text: Web based kernel function using the search engine (SE) Structural methods: the change in number of words the change in content length
  8. The textual content: cosine similarity intersection of the most frequent terms Jaccard coefficient The semantics of the text: Web based kernel function using the search engine (SE) Structural methods: the change in number of words the change in content length
  9. Top 5 terms Extract terms from Top 10 snippets Combine original page terms with snippet terms Compute Jaccard coefficient for similarity *** PAPER SAYS THAT SE EXPANSION ONLY DONE FOR 1ST MEMENTO. TERMS FROM SNIPPETS COMBINED WITH ORIGINAL TERMS AND THAT WAS COMPARED AGAINST CANDIDATE TERMS ***
  10. Top 5 terms Extract terms from Top 10 snippets Combine original page terms with snippet terms Compute Jaccard coefficient for similarity *** SE EXPANSION ONLY DONE FOR 1ST MEMENTO. TERMS FROM SNIPPETS COMBINED WITH ORIGINAL TERMS AND THAT WAS COMPARED AGAINST CANDIDATE TERMS ***
  11. Cosine similarity at threshold = 0.15 is the best single method If cosine similarity between candidate memento and first memento < 0.15, then candidate memento is marked as 'off-topic' If cosine similarity between candidate memento and first memento < 0.10 OR word count between candidate memento and first memento has decreased by more than 85%, then candidate memento is marked as 'off-topic'
  12. We have shown 98% ACC of the tool. Next, we evaluate the tool on other Archive-It collections for which we do not know the answer
  13. FP - classified as off-topic, but really on-topic