The Off-Topic Memento
Toolkit
Shawn M. Jones Michele C. Weigle Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Research Group
@WebSciDL
sjone@cs.odu.edu
@shawnmjones
mweigle@cs.odu.edu
@weiglemc
mln@cs.odu.edu
@phonedude_mln
Thanks to:
@shawnmjones @WebSciDL
Many Curators Use Archive-It To Create Web Archive
Collections
2
Archive-It makes it easy for curators to build collections and supply metadata
for a collection.
@shawnmjones @WebSciDL
When Building A Web Archive Collection…
 Curators select web resources
as seeds
 Each version of a seed
becomes a memento
3
@shawnmjones @WebSciDL
When Building A Web Archive Collection…
 Curators select web resources
as seeds
 Each version of a seed
becomes a memento
 They create a web archive
collection with a purpose in
mind
4
@shawnmjones @WebSciDL
When Researchers Prepare to Analyze a Web Archive
Collection…
5
Some collections have thousands of seeds.
Remember: Each seed has one or more
mementos.
The sheer number of mementos to process
means that researchers will need to quickly
identify mementos with low information value.
Off-topic mementos have low information value.
We want to identify, not delete, these for further
decision-making.
We identify them to not consider them for selection
as exemplars for storytelling.
81,014 seeds
486,227 seed mementos
@shawnmjones @WebSciDL
How Can Mementos Go Off-Topic?
6
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
For Technical Reasons
7
http://wayback.archive-it.org/1068/20130306212205/http://bo.amnesty.org/
http://wayback.archive-it.org/1068/20120303011104/http://bo.amnesty.org/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Page Gone
8
http://wayback.archive-it.org/1068/20101221161732/http://www.acdauk.org.uk/
http://wayback.archive-it.org/1068/20110902210644/http://www.acdauk.org.uk/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Content Drift – A Change in Languages
9
http://wayback.archive-it.org/1068/20130306231537/http://ecwronline.org/
http://wayback.archive-it.org/1068/20110129043404/http://ecwronline.org/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Server Maintenance
10
http://wayback.archive-it.org/1068/20111202210620/http://amnestyghana.org/
http://wayback.archive-it.org/1068/20120302232416/http://amnestyghana.org/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Account Suspension
11
http://wayback.archive-it.org/1068/20110317151735/http://amnestymauritius.org/french/news.php
http://wayback.archive-it.org/1068/20111202210625/http://amnestymauritius.org/cgi-sys/suspendedpage.cgi
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Site Redesign
12
http://wayback.archive-it.org/1068/20120302224302/http://ombuds.am/main/
http://wayback.archive-it.org/1068/20100510173253/http://ombuds.am/main
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Change Site Ownership
13
http://wayback.archive-it.org/1068/20090210190543/http://www.afapredesa.org/index.php
http://wayback.archive-it.org/1068/20120302210439/http://www.afapredesa.org/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
The Site Was Hacked
14
http://wayback.archive-it.org/2950/20120327032244/http://occupyevansville.org/
http://wayback.archive-it.org/2950/20120410032628/http://occupyevansville.org/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
The Site Moves On From The Topic
15
http://wayback.archive-it.org/2358/20120803140009/http://www.bbc.co.uk/news/world/middle_east/
http://wayback.archive-it.org/2358/20110202225040/http://www.bbc.co.uk/news/world/middle_east/
@shawnmjones @WebSciDL
Presenting the Off-Topic Memento Toolkit (OTMT)
a tool for identifying these off-topic mementos
16
@shawnmjones @WebSciDL
The Off-Topic Memento Toolkit (OTMT)
 Currently in alpha status, the
OTMT
 Accepts a collection of mementos
 Executes similarity measures on
those mementos
 Rates them as on or off-topic
 Identifies, does not delete, off-
topic mementos
17
https://github.com/oduwsdl/off-topic-memento-toolkit
@shawnmjones @WebSciDL
Background and Related Work
18
@shawnmjones @WebSciDL
Related Work – Similarity Measures for Documents
19
Manku (2007)
Sorensen (1948)
Dice (1945)
Jaccard (1912)
Simhash
Charikar (2002)
Sørensen-Dice
Coefficient
Jaccard Index
Hajishirzi
(2010)
Cosine Similarity of
TF-IDF Vectors
Cosine Similarity of
Latent Semantic
Indexing Vectors
Deerweister (1990)
Adar (2009)
Sivakumar (2015) Řehůřek
(2011)
Content Drift in
Web Archives
Jones (2016)
Zittrain (2014)
@shawnmjones @WebSciDL
Related Work – Similarity Measures for Documents
20
Manku (2007)
Sorensen (1948)
Dice (1945)
Jaccard (1912)
Simhash
Charikar (2002)
Sørensen-Dice
Coefficient
Jaccard Index
Hajishirzi
(2010)
Cosine Similarity of
TF-IDF Vectors
Cosine Similarity of
Latent Semantic
Indexing Vectors
Deerweister (1990)
OTMT supports these similarity measures
Adar (2009)
Sivakumar (2015) Řehůřek
(2011)
Content Drift in
Web Archives
Jones (2016)
Zittrain (2014)
@shawnmjones @WebSciDL
Related Work – Similarity Measures for Documents
21
Manku (2007)
Sorensen (1948)
Dice (1945)
Jaccard (1912)
Simhash
Charikar (2002)
Sørensen-Dice
Coefficient
Jaccard Index
Hajishirzi
(2010)
Cosine Similarity of
TF-IDF Vectors
Cosine Similarity of
Latent Semantic
Indexing Vectors
Deerweister (1990)
OTMT supports these similarity measures
Adar (2009)
Sivakumar (2015) Řehůřek
(2011)
Like these studies,
we also use these
similarity measures
on mementos
Content Drift in
Web Archives
Jones (2016)
Zittrain (2014)
@shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
22
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
@shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
23
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
Topic modeling
should help us
find off-topic
documents,
but which
cluster is off-
topic?
@shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
24
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
Topic modeling
should help us
find off-topic
documents,
but which
cluster is off-
topic?
It is costly to
manually
review
browser
thumbnails to
find off-topic
mementos
@shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
25
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
We build on AlNoamany’s work to
bring you the Off-Topic Memento
Toolkit
Topic modeling
should help us
find off-topic
documents,
but which
cluster is off-
topic?
It is costly to
manually
review
browser
thumbnails to
find off-topic
mementos
@shawnmjones @WebSciDL
Memento Protocol Terminology
<http://a.example.org>;rel="original",
<http://arxiv.example.net/timemap/http://a.example.org>; rel="self"; type="application/link-format"
; from="Tue, 20 Jun 2000 18:02:59 GMT"
; until="Wed, 21 Jun 2000 04:41:56 GMT",
<http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate",
<http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento";
datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento";
datetime="Tue, 27 Oct 2009 20:49:54 GMT",
<http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento";
datetime="Wed, 21 Jun 2000 01:17:31 GMT",
<http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento";
datetime="Wed, 21 Jun 2000 04:41:56 GMT"
…
26
Each seed, or original resource, has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento-
datetimes.
URI-T: a URI for a TimeMap
URI-M: a URI for a memento
@shawnmjones @WebSciDL
Web Archives Augment Their Mementos
27
Banners Rewritten Links
http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html
http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
@shawnmjones @WebSciDL
The OTMT Uses Raw Mementos
28
Raw mementos are free of these augmentations.
Archive-It and the Internet Archive provide access
to raw mementos at special URIs.
The OTMT finds these raw mementos and uses
them in its similarity comparisons.
http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html
http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
@shawnmjones @WebSciDL
The OTMT Performs Preprocessing
29
<p class=“homepage-description”>The Women’s Initiatives for Gender Justice works globally
to ensure justice for women and an independent and effective International Criminal Court.</p>
['The', 'Women', '’', 's', 'Initiatives', 'for', 'Gender', 'Justice', 'works', 'globally', 'to', 'ensure', 'justice', 'for',
'women', 'and', 'an', 'independent', 'and', 'effective', 'International', 'Criminal', 'Court', '.']
Tokenization
Remove stop words
['Women', '’', 'Initiatives', 'Gender', 'Justice', 'works', 'globally', 'ensure', 'justice', 'women', 'independent',
'effective', 'International', 'Criminal', 'Court']
Stemming
['women', '’', 'initi', 'gender', 'justic', 'work', 'global', 'ensur', 'justic', 'women', 'independ', 'effect', 'intern',
'crimin', 'court']
Boilerplate removal
The Women’s Initiatives for Gender Justice works globally to ensure justice for women and an independent
and effective International Criminal Court.
@shawnmjones @WebSciDL
We Evaluated the OTMT with a Gold Standard Dataset
 In “Detecting off-topic pages within
TimeMaps in Web archives”,
AlNoamany performed a study to
detect off-topic Mementos
 The mementos were manually
marked as on or off-topic
 We reuse this dataset in our
evaluation
30
https://github.com/oduwsdl/offtopic-goldstandard-data
@shawnmjones @WebSciDL
TimeMap Measures Supported by OTMT
31
@shawnmjones @WebSciDL
General algorithm
 For each TimeMap in a collection
1. Get the first memento
2. Preprocess it
3. For each memento in the TimeMap
1. Get the memento
2. Preprocess it
3. Compute the similarity to the first
memento using a given measure
4. Save the score
5. A threshold value determines if a
memento is on or off-topic
32
First
memento
Considered
memento
@shawnmjones @WebSciDL
Structural Measures – Byte Count and Word Count
33
On-topic:
9599 bytes
183 words (after preprocessing)
Off-topic:
401 bytes
22 words (after preprocessing)
Off-topic mementos tend to have less bytes/words
Scores range from 0 to -1
@shawnmjones @WebSciDL
Set Operation Measures
34
Jaccard Distance Sørensen-Dice Distance
Size of Intersection over size of union Twice the size of
intersection over size of
both sets
Scores range from 0 to 1
['women', '’', 'initi', 'gender', 'justic', 'current', 'work', 'uganda',
'democrat', 'republ', 'congo', 'libya']
['women', '’', 'initi', 'gender', 'justic', 'work', 'uganda', 'democrat',
'republ', 'congo', 'sudan', 'central', 'african', 'republ', 'kenya', 'libya',
'kyrgyzstan']
Highlighted words are the intersection
Words from
Doc #1:
Words from
Doc #2:
@shawnmjones @WebSciDL
Simhash of Term Frequencies
35
('women', 4),
('justic', 4),
('’', 3),
('gender', 3),
('initi', 2),
('intern', 2),
('icc', 2),
('work', 2),
('republ', 2),
('human', 1),
…
13221438115839111206 13797903006343525414
('women', 4),
('justic', 4),
('’', 3),
('gender', 3),
('initi', 2),
('intern', 2),
('icc', 2),
('work', 2),
('human', 1),
('right', 1),
…
6 bits
Scores range from 0 to 64 bits
Simhash Distance:
Simhash of Terms and Frequencies
from Document #1:
Simhash of Terms and Frequencies
from Document #2:
@shawnmjones @WebSciDL
Simhash of raw content
36
The Women’s Initiatives for Gender Justice is an international women’s human
rights organisation that advocates for gender justice through the International
Criminal Court (ICC) and through domestic mechanisms, including peace
negotiations and justice processes.We work with women and communities
most affected by the armed conflict with a focus on countries with situations
under investigation by the ICC. The Women’s Initiatives for Gender Justice
currently works in Uganda, the Democratic Republic of the Congo and Libya.
The Women’s Initiatives for Gender Justice is an international women’s
human rights organisation that advocates for gender justice through the
International Criminal Court (ICC) and through domestic mechanisms,
including peace negotiations and justice processes. We work with women
most affected by the conflict situations under investigation by the ICC. The
Women’s Initiatives for Gender Justice works in Uganda, the Democratic
Republic of the Congo, Sudan, the Central African Republic, Kenya, Libya
and Kyrgyzstan.
12358429319379250844 12359555184926328508
6 bits
Scores range from 0 to 64 bits
Simhash of Document #1: Simhash of Document #2:
Simhash Distance:
@shawnmjones @WebSciDL
Cosine Similiarities
37
Take the cosine of the document vectors.
Cosine of TF-IDF
Vectors are formed from each document and their term frequencies.
Cosine of Latent Semantic Indexing (LSI)
Each vector is informed by LSI.
Scores range from 1 to 0.
@shawnmjones @WebSciDL
Using the OTMT
38
@shawnmjones @WebSciDL
OTMT Installation Options
1. Pip from Pypi (preferred):
pip install otmt
2. Experimental Docker Image:
docker pull shawnmjones/otmt
3. Source Code:
git clone https://github.com/oduwsdl/off-topic-memento-
toolkit.git
39
@shawnmjones @WebSciDL
OTMT Usage
40
# detect_off_topic -i archiveit=7877 -tm jaccard=0.80,bytecount=-0.50 -o outputfile.json
Input Types for -i:
• timemap – followed by 1 or more
TimeMap URIs, separated by
commas
• warc – followed by 1 or more
WARC files, separated by
commas
• archiveit – followed by an
Archive-It collection ID
TimeMap measures for -tm:
• bytecount
• wordcount
• jaccard
• sorensen
• simhash-tf
• simhash-raw
• cosine
• gensim_lsi
Input
OutputMeasure
Output types for -ot:
• json
• csv
@shawnmjones @WebSciDL
OTMT Output - JSON
41
"http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": {
"http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": {
"timemap measures": {
"cosine": {
"stemmed": true,
"tokenized": true,
"removed boilerplate": true,
"comparison score": 0.10969941307631487,
"topic status": "off-topic”
},
"bytecount": {
"stemmed": false,
"tokenized": false,
"removed boilerplate": false,
"comparison score": 0.15971409055425445,
"topic status": "on-topic"
}
},
"overall topic status": "off-topic"
}, ...
Measure
Information
Preprocessing status
Measure Score On or off topic
status by measure
On or off topic
status overall
URI-T of TimeMap
URI-M of Memento
@shawnmjones @WebSciDL
OTMT Output - JSON
42
"http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": {
"http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": {
"timemap measures": {
"cosine": {
"stemmed": true,
"tokenized": true,
"removed boilerplate": true,
"comparison score": 0.10969941307631487,
"topic status": "off-topic”
},
"bytecount": {
"stemmed": false,
"tokenized": false,
"removed boilerplate": false,
"comparison score": 0.15971409055425445,
"topic status": "on-topic"
}
},
"overall topic status": "off-topic"
}, ...
URI-T of TimeMap
URI-M of Memento
Measure
Information
Preprocessing status
Measure Score On or off topic
status by measure
On or off topic
status overall
If one measure scores as off-topic, the memento is considered off-topic
@shawnmjones @WebSciDL
Supported Similarity Measures
Measure Fully Equivalent
Score
Fully Dissimilar
Score
Preprocessing
Performed
OTMT -tm
keyword
Byte Count 0.0 -1.0 No bytecount
Word Count 0.0 -1.0 Yes wordcount
Jaccard Distance 0.0 1.0 Yes jaccard
Sørensen-Dice 0.0 1.0 Yes sorensen
Simhash of Term
Frequencies
0 64 Yes simhash-tf
Simhash or raw
memento
0 64 No simhash-raw
Cosine Similarity of
TF-IDF Vectors
1.0 0 Yes cosine
Cosine Similarity of
LSI Vectors
1.0 0 Yes gensim_lsi
43
@shawnmjones @WebSciDL
Establishing Reasonable Defaults
44
@shawnmjones @WebSciDL
Experiment setup
 For each measure:
1. Start the threshold at the score of
complete dissimilarity
2. Test with the URI-Ms from the gold
standard data set as if that threshold
indicated off-topic
3. Compute F1 using real off-topic status
of the memento from the gold standard
data
4. Increment the threshold
5. Repeat 2 – 4 until the threshold
matches complete equivalence score
45
 Example using Byte Count:
1. Start threshold at -1
2. Test with the URI-Ms from the gold
standard data set as if -1 indicated off-
topic
3. Compute F1 using real off-topic status
of the memento from the gold standard
data
4. Increment the threshold to -0.99
5. Test with the URI-Ms from the gold
standard data set as if -0.99 indicated
off-topic
6. Compute F1 with real status
7. Increment to -0.98
8. Repeat until the threshold is 0
@shawnmjones @WebSciDL
Our results do not match AlNoamany’s, but the world is
not the same as it was in 2015…
AlNoamany’s Study Our Study
Year Conducted 2015 2017
Boilerplate Removal Boilerpipe (Java) Justext
Tokenization and Stemming Scikit-learn NLTK
46
Other changes:
• Download errors
• Gold Standard Dataset updates
@shawnmjones @WebSciDL
Simhash of Term Frequencies
47
Our Results:
AlNoamany’s Results
Not tested
@shawnmjones @WebSciDL
Simhash of raw memento
48
Our Results: AlNoamany’s Results
Not tested
@shawnmjones @WebSciDL
Sørensen-Dice Distance Results
49
Our Results: AlNoamany’s Results
Not tested
@shawnmjones @WebSciDL
Jaccard Distance Results
50
Our Results: AlNoamany’s Results
Best F1 Score: 0.538
Threshold: 0.95
@shawnmjones @WebSciDL
Cosine Similarity of LSI Vectors
51
AlNoamany’s Results
Not tested
Our Results:
Note: LSI scores are
non-deterministic
@shawnmjones @WebSciDL
Byte Count Results
52
AlNoamany’s Results
Best F1 Score: 0.584
Threshold: -0.65
Our Results:
@shawnmjones @WebSciDL
Cosine Similarity of TF-IDF Vectors
53
Our Results: AlNoamany’s Results
Best F1 Score: 0.881
Threshold: 0.15
Best score in AlNoamany’s Results
@shawnmjones @WebSciDL
Word Count Results
54
Best Score in Our Results: AlNoamany’s Results
Best F1 Score: 0.806
Threshold: -0.85
@shawnmjones @WebSciDL
Results Summarized – Best F1 Score is Word Count
55
AlNoamany's Results Results of this study
Similarity Measure
Best F1
Score
Corresponding
Accuracy
Corresponding
Threshold
Best F1
Score
Corresponding
Accuracy
Corresponding
Threshold
Word Count 0.806 0.982 -0.85 0.788 0.971 -0.7
Cosine Similarity of TF-IDF
Vectors 0.881 0.983 0.15 0.766 0.965 0.12
Byte Count 0.584 0.962 -0.65 0.756 0.965 -0.39
Cosine Similarity of LSI Vectors Not tested 0.711 0.965 0.12 with 10 topics
Jaccard Distance 0.538 0.962 0.95 0.651 0.953 0.94
Sørensen-Dice Distance Not tested 0.649 0.953 0.88
Simhash on raw memento
content Not tested 0.578 0.934 25
Simhash on TF Not tested 0.523 0.942 28
Our word count measure came out ahead of AlNoamany’s.
AlNoamany’s Cosine Similarity measure came out ahead of ours.
@shawnmjones @WebSciDL
What about using measures together?
56
AlNoamany found that using cosine
similarity of TF-IDF vectors and word
count together produced even better
results.
Our best F1 score for word count
alone was 0.788.
Word count combined with LSI
turned out to be slightly better with
the same Accuracy.
The success of word count appears
to exert influence on the threshold
of its partner measure, making its
threshold more strict.
@shawnmjones @WebSciDL
The Future of OTMT
57
@shawnmjones @WebSciDL
Improving the OTMT
 Bug fixes
 Make LSI scores reproducible
 New Measures
 TimeMap Measures – compare first memento with considered memento:
 Spamsum of the raw content – used by Andy Jackson at the UKWA
 Cosine of LDA Vectors via Gensim
 Collection Measures
1. Develop a collection-wide picture
2. Compare each memento against that picture
 Control over preprocessing:
 Options to change use a different boilerplate method
 Options to turn off stemming or stop word removal
58
@shawnmjones @WebSciDL
Conclusion
59
@shawnmjones @WebSciDL
Motivation - Mementos Can Go Off-Topic
60
Hacked
Moved on from topic
Collections have a theme
Seeds are selected to
support that theme
Mementos are versions of
seeds
Some of these versions are
off-topic
Identifying these off-topic
mementos is key to some
research activities, like
summarization
Web Page Gone
Account Suspension
@shawnmjones @WebSciDL
OTMT supports different similarity measures
with thresholds established based on experimentation
 Byte count
 Word count
 Jaccard distance
 Sørensen-Dice distance
 Simhash of term frequencies
 Simhash of raw memento content
 Cosine similarity of TF-IDF vectors
 Cosine similarity of LSI vectors
61
@shawnmjones @WebSciDL
Please try out the Off-Topic Memento Toolkit!
62
Thanks to:
1. Pip (preferred):
pip install otmt
2. Experimental Docker Image:
docker pull shawnmjones/otmt
3. Source Code:
git clone
https://github.com/oduwsdl/off-topic-
memento-toolkit.git
https://github.com/oduwsdl/off-topic-memento-toolkit
https://github.com/oduwsdl/offtopic-goldstandard-data

The Off-Topic Memento Toolkit

  • 1.
    The Off-Topic Memento Toolkit ShawnM. Jones Michele C. Weigle Michael L. Nelson Old Dominion University Web Science and Digital Libraries Research Group @WebSciDL sjone@cs.odu.edu @shawnmjones mweigle@cs.odu.edu @weiglemc mln@cs.odu.edu @phonedude_mln Thanks to:
  • 2.
    @shawnmjones @WebSciDL Many CuratorsUse Archive-It To Create Web Archive Collections 2 Archive-It makes it easy for curators to build collections and supply metadata for a collection.
  • 3.
    @shawnmjones @WebSciDL When BuildingA Web Archive Collection…  Curators select web resources as seeds  Each version of a seed becomes a memento 3
  • 4.
    @shawnmjones @WebSciDL When BuildingA Web Archive Collection…  Curators select web resources as seeds  Each version of a seed becomes a memento  They create a web archive collection with a purpose in mind 4
  • 5.
    @shawnmjones @WebSciDL When ResearchersPrepare to Analyze a Web Archive Collection… 5 Some collections have thousands of seeds. Remember: Each seed has one or more mementos. The sheer number of mementos to process means that researchers will need to quickly identify mementos with low information value. Off-topic mementos have low information value. We want to identify, not delete, these for further decision-making. We identify them to not consider them for selection as exemplars for storytelling. 81,014 seeds 486,227 seed mementos
  • 6.
    @shawnmjones @WebSciDL How CanMementos Go Off-Topic? 6
  • 7.
    @shawnmjones @WebSciDL Mementos ina Collection Can Go Off-Topic: For Technical Reasons 7 http://wayback.archive-it.org/1068/20130306212205/http://bo.amnesty.org/ http://wayback.archive-it.org/1068/20120303011104/http://bo.amnesty.org/
  • 8.
    @shawnmjones @WebSciDL Mementos ina Collection Can Go Off-Topic: Page Gone 8 http://wayback.archive-it.org/1068/20101221161732/http://www.acdauk.org.uk/ http://wayback.archive-it.org/1068/20110902210644/http://www.acdauk.org.uk/
  • 9.
    @shawnmjones @WebSciDL Mementos ina Collection Can Go Off-Topic: Content Drift – A Change in Languages 9 http://wayback.archive-it.org/1068/20130306231537/http://ecwronline.org/ http://wayback.archive-it.org/1068/20110129043404/http://ecwronline.org/
  • 10.
    @shawnmjones @WebSciDL Mementos ina Collection Can Go Off-Topic: Server Maintenance 10 http://wayback.archive-it.org/1068/20111202210620/http://amnestyghana.org/ http://wayback.archive-it.org/1068/20120302232416/http://amnestyghana.org/
  • 11.
    @shawnmjones @WebSciDL Mementos ina Collection Can Go Off-Topic: Account Suspension 11 http://wayback.archive-it.org/1068/20110317151735/http://amnestymauritius.org/french/news.php http://wayback.archive-it.org/1068/20111202210625/http://amnestymauritius.org/cgi-sys/suspendedpage.cgi
  • 12.
    @shawnmjones @WebSciDL Mementos ina Collection Can Go Off-Topic: Site Redesign 12 http://wayback.archive-it.org/1068/20120302224302/http://ombuds.am/main/ http://wayback.archive-it.org/1068/20100510173253/http://ombuds.am/main
  • 13.
    @shawnmjones @WebSciDL Mementos ina Collection Can Go Off-Topic: Change Site Ownership 13 http://wayback.archive-it.org/1068/20090210190543/http://www.afapredesa.org/index.php http://wayback.archive-it.org/1068/20120302210439/http://www.afapredesa.org/
  • 14.
    @shawnmjones @WebSciDL Mementos ina Collection Can Go Off-Topic: The Site Was Hacked 14 http://wayback.archive-it.org/2950/20120327032244/http://occupyevansville.org/ http://wayback.archive-it.org/2950/20120410032628/http://occupyevansville.org/
  • 15.
    @shawnmjones @WebSciDL Mementos ina Collection Can Go Off-Topic: The Site Moves On From The Topic 15 http://wayback.archive-it.org/2358/20120803140009/http://www.bbc.co.uk/news/world/middle_east/ http://wayback.archive-it.org/2358/20110202225040/http://www.bbc.co.uk/news/world/middle_east/
  • 16.
    @shawnmjones @WebSciDL Presenting theOff-Topic Memento Toolkit (OTMT) a tool for identifying these off-topic mementos 16
  • 17.
    @shawnmjones @WebSciDL The Off-TopicMemento Toolkit (OTMT)  Currently in alpha status, the OTMT  Accepts a collection of mementos  Executes similarity measures on those mementos  Rates them as on or off-topic  Identifies, does not delete, off- topic mementos 17 https://github.com/oduwsdl/off-topic-memento-toolkit
  • 18.
  • 19.
    @shawnmjones @WebSciDL Related Work– Similarity Measures for Documents 19 Manku (2007) Sorensen (1948) Dice (1945) Jaccard (1912) Simhash Charikar (2002) Sørensen-Dice Coefficient Jaccard Index Hajishirzi (2010) Cosine Similarity of TF-IDF Vectors Cosine Similarity of Latent Semantic Indexing Vectors Deerweister (1990) Adar (2009) Sivakumar (2015) Řehůřek (2011) Content Drift in Web Archives Jones (2016) Zittrain (2014)
  • 20.
    @shawnmjones @WebSciDL Related Work– Similarity Measures for Documents 20 Manku (2007) Sorensen (1948) Dice (1945) Jaccard (1912) Simhash Charikar (2002) Sørensen-Dice Coefficient Jaccard Index Hajishirzi (2010) Cosine Similarity of TF-IDF Vectors Cosine Similarity of Latent Semantic Indexing Vectors Deerweister (1990) OTMT supports these similarity measures Adar (2009) Sivakumar (2015) Řehůřek (2011) Content Drift in Web Archives Jones (2016) Zittrain (2014)
  • 21.
    @shawnmjones @WebSciDL Related Work– Similarity Measures for Documents 21 Manku (2007) Sorensen (1948) Dice (1945) Jaccard (1912) Simhash Charikar (2002) Sørensen-Dice Coefficient Jaccard Index Hajishirzi (2010) Cosine Similarity of TF-IDF Vectors Cosine Similarity of Latent Semantic Indexing Vectors Deerweister (1990) OTMT supports these similarity measures Adar (2009) Sivakumar (2015) Řehůřek (2011) Like these studies, we also use these similarity measures on mementos Content Drift in Web Archives Jones (2016) Zittrain (2014)
  • 22.
    @shawnmjones @WebSciDL Related Work– Other Methods of Off-Topic Detection 22 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis
  • 23.
    @shawnmjones @WebSciDL Related Work– Other Methods of Off-Topic Detection 23 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis Topic modeling should help us find off-topic documents, but which cluster is off- topic?
  • 24.
    @shawnmjones @WebSciDL Related Work– Other Methods of Off-Topic Detection 24 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis Topic modeling should help us find off-topic documents, but which cluster is off- topic? It is costly to manually review browser thumbnails to find off-topic mementos
  • 25.
    @shawnmjones @WebSciDL Related Work– Other Methods of Off-Topic Detection 25 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis We build on AlNoamany’s work to bring you the Off-Topic Memento Toolkit Topic modeling should help us find off-topic documents, but which cluster is off- topic? It is costly to manually review browser thumbnails to find off-topic mementos
  • 26.
    @shawnmjones @WebSciDL Memento ProtocolTerminology <http://a.example.org>;rel="original", <http://arxiv.example.net/timemap/http://a.example.org>; rel="self"; type="application/link-format" ; from="Tue, 20 Jun 2000 18:02:59 GMT" ; until="Wed, 21 Jun 2000 04:41:56 GMT", <http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate", <http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento"; datetime="Tue, 20 Jun 2000 18:02:59 GMT", <http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento"; datetime="Tue, 27 Oct 2009 20:49:54 GMT", <http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento"; datetime="Wed, 21 Jun 2000 01:17:31 GMT", <http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento"; datetime="Wed, 21 Jun 2000 04:41:56 GMT" … 26 Each seed, or original resource, has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento- datetimes. URI-T: a URI for a TimeMap URI-M: a URI for a memento
  • 27.
    @shawnmjones @WebSciDL Web ArchivesAugment Their Mementos 27 Banners Rewritten Links http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
  • 28.
    @shawnmjones @WebSciDL The OTMTUses Raw Mementos 28 Raw mementos are free of these augmentations. Archive-It and the Internet Archive provide access to raw mementos at special URIs. The OTMT finds these raw mementos and uses them in its similarity comparisons. http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
  • 29.
    @shawnmjones @WebSciDL The OTMTPerforms Preprocessing 29 <p class=“homepage-description”>The Women’s Initiatives for Gender Justice works globally to ensure justice for women and an independent and effective International Criminal Court.</p> ['The', 'Women', '’', 's', 'Initiatives', 'for', 'Gender', 'Justice', 'works', 'globally', 'to', 'ensure', 'justice', 'for', 'women', 'and', 'an', 'independent', 'and', 'effective', 'International', 'Criminal', 'Court', '.'] Tokenization Remove stop words ['Women', '’', 'Initiatives', 'Gender', 'Justice', 'works', 'globally', 'ensure', 'justice', 'women', 'independent', 'effective', 'International', 'Criminal', 'Court'] Stemming ['women', '’', 'initi', 'gender', 'justic', 'work', 'global', 'ensur', 'justic', 'women', 'independ', 'effect', 'intern', 'crimin', 'court'] Boilerplate removal The Women’s Initiatives for Gender Justice works globally to ensure justice for women and an independent and effective International Criminal Court.
  • 30.
    @shawnmjones @WebSciDL We Evaluatedthe OTMT with a Gold Standard Dataset  In “Detecting off-topic pages within TimeMaps in Web archives”, AlNoamany performed a study to detect off-topic Mementos  The mementos were manually marked as on or off-topic  We reuse this dataset in our evaluation 30 https://github.com/oduwsdl/offtopic-goldstandard-data
  • 31.
  • 32.
    @shawnmjones @WebSciDL General algorithm For each TimeMap in a collection 1. Get the first memento 2. Preprocess it 3. For each memento in the TimeMap 1. Get the memento 2. Preprocess it 3. Compute the similarity to the first memento using a given measure 4. Save the score 5. A threshold value determines if a memento is on or off-topic 32 First memento Considered memento
  • 33.
    @shawnmjones @WebSciDL Structural Measures– Byte Count and Word Count 33 On-topic: 9599 bytes 183 words (after preprocessing) Off-topic: 401 bytes 22 words (after preprocessing) Off-topic mementos tend to have less bytes/words Scores range from 0 to -1
  • 34.
    @shawnmjones @WebSciDL Set OperationMeasures 34 Jaccard Distance Sørensen-Dice Distance Size of Intersection over size of union Twice the size of intersection over size of both sets Scores range from 0 to 1 ['women', '’', 'initi', 'gender', 'justic', 'current', 'work', 'uganda', 'democrat', 'republ', 'congo', 'libya'] ['women', '’', 'initi', 'gender', 'justic', 'work', 'uganda', 'democrat', 'republ', 'congo', 'sudan', 'central', 'african', 'republ', 'kenya', 'libya', 'kyrgyzstan'] Highlighted words are the intersection Words from Doc #1: Words from Doc #2:
  • 35.
    @shawnmjones @WebSciDL Simhash ofTerm Frequencies 35 ('women', 4), ('justic', 4), ('’', 3), ('gender', 3), ('initi', 2), ('intern', 2), ('icc', 2), ('work', 2), ('republ', 2), ('human', 1), … 13221438115839111206 13797903006343525414 ('women', 4), ('justic', 4), ('’', 3), ('gender', 3), ('initi', 2), ('intern', 2), ('icc', 2), ('work', 2), ('human', 1), ('right', 1), … 6 bits Scores range from 0 to 64 bits Simhash Distance: Simhash of Terms and Frequencies from Document #1: Simhash of Terms and Frequencies from Document #2:
  • 36.
    @shawnmjones @WebSciDL Simhash ofraw content 36 The Women’s Initiatives for Gender Justice is an international women’s human rights organisation that advocates for gender justice through the International Criminal Court (ICC) and through domestic mechanisms, including peace negotiations and justice processes.We work with women and communities most affected by the armed conflict with a focus on countries with situations under investigation by the ICC. The Women’s Initiatives for Gender Justice currently works in Uganda, the Democratic Republic of the Congo and Libya. The Women’s Initiatives for Gender Justice is an international women’s human rights organisation that advocates for gender justice through the International Criminal Court (ICC) and through domestic mechanisms, including peace negotiations and justice processes. We work with women most affected by the conflict situations under investigation by the ICC. The Women’s Initiatives for Gender Justice works in Uganda, the Democratic Republic of the Congo, Sudan, the Central African Republic, Kenya, Libya and Kyrgyzstan. 12358429319379250844 12359555184926328508 6 bits Scores range from 0 to 64 bits Simhash of Document #1: Simhash of Document #2: Simhash Distance:
  • 37.
    @shawnmjones @WebSciDL Cosine Similiarities 37 Takethe cosine of the document vectors. Cosine of TF-IDF Vectors are formed from each document and their term frequencies. Cosine of Latent Semantic Indexing (LSI) Each vector is informed by LSI. Scores range from 1 to 0.
  • 38.
  • 39.
    @shawnmjones @WebSciDL OTMT InstallationOptions 1. Pip from Pypi (preferred): pip install otmt 2. Experimental Docker Image: docker pull shawnmjones/otmt 3. Source Code: git clone https://github.com/oduwsdl/off-topic-memento- toolkit.git 39
  • 40.
    @shawnmjones @WebSciDL OTMT Usage 40 #detect_off_topic -i archiveit=7877 -tm jaccard=0.80,bytecount=-0.50 -o outputfile.json Input Types for -i: • timemap – followed by 1 or more TimeMap URIs, separated by commas • warc – followed by 1 or more WARC files, separated by commas • archiveit – followed by an Archive-It collection ID TimeMap measures for -tm: • bytecount • wordcount • jaccard • sorensen • simhash-tf • simhash-raw • cosine • gensim_lsi Input OutputMeasure Output types for -ot: • json • csv
  • 41.
    @shawnmjones @WebSciDL OTMT Output- JSON 41 "http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": { "http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": { "timemap measures": { "cosine": { "stemmed": true, "tokenized": true, "removed boilerplate": true, "comparison score": 0.10969941307631487, "topic status": "off-topic” }, "bytecount": { "stemmed": false, "tokenized": false, "removed boilerplate": false, "comparison score": 0.15971409055425445, "topic status": "on-topic" } }, "overall topic status": "off-topic" }, ... Measure Information Preprocessing status Measure Score On or off topic status by measure On or off topic status overall URI-T of TimeMap URI-M of Memento
  • 42.
    @shawnmjones @WebSciDL OTMT Output- JSON 42 "http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": { "http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": { "timemap measures": { "cosine": { "stemmed": true, "tokenized": true, "removed boilerplate": true, "comparison score": 0.10969941307631487, "topic status": "off-topic” }, "bytecount": { "stemmed": false, "tokenized": false, "removed boilerplate": false, "comparison score": 0.15971409055425445, "topic status": "on-topic" } }, "overall topic status": "off-topic" }, ... URI-T of TimeMap URI-M of Memento Measure Information Preprocessing status Measure Score On or off topic status by measure On or off topic status overall If one measure scores as off-topic, the memento is considered off-topic
  • 43.
    @shawnmjones @WebSciDL Supported SimilarityMeasures Measure Fully Equivalent Score Fully Dissimilar Score Preprocessing Performed OTMT -tm keyword Byte Count 0.0 -1.0 No bytecount Word Count 0.0 -1.0 Yes wordcount Jaccard Distance 0.0 1.0 Yes jaccard Sørensen-Dice 0.0 1.0 Yes sorensen Simhash of Term Frequencies 0 64 Yes simhash-tf Simhash or raw memento 0 64 No simhash-raw Cosine Similarity of TF-IDF Vectors 1.0 0 Yes cosine Cosine Similarity of LSI Vectors 1.0 0 Yes gensim_lsi 43
  • 44.
  • 45.
    @shawnmjones @WebSciDL Experiment setup For each measure: 1. Start the threshold at the score of complete dissimilarity 2. Test with the URI-Ms from the gold standard data set as if that threshold indicated off-topic 3. Compute F1 using real off-topic status of the memento from the gold standard data 4. Increment the threshold 5. Repeat 2 – 4 until the threshold matches complete equivalence score 45  Example using Byte Count: 1. Start threshold at -1 2. Test with the URI-Ms from the gold standard data set as if -1 indicated off- topic 3. Compute F1 using real off-topic status of the memento from the gold standard data 4. Increment the threshold to -0.99 5. Test with the URI-Ms from the gold standard data set as if -0.99 indicated off-topic 6. Compute F1 with real status 7. Increment to -0.98 8. Repeat until the threshold is 0
  • 46.
    @shawnmjones @WebSciDL Our resultsdo not match AlNoamany’s, but the world is not the same as it was in 2015… AlNoamany’s Study Our Study Year Conducted 2015 2017 Boilerplate Removal Boilerpipe (Java) Justext Tokenization and Stemming Scikit-learn NLTK 46 Other changes: • Download errors • Gold Standard Dataset updates
  • 47.
    @shawnmjones @WebSciDL Simhash ofTerm Frequencies 47 Our Results: AlNoamany’s Results Not tested
  • 48.
    @shawnmjones @WebSciDL Simhash ofraw memento 48 Our Results: AlNoamany’s Results Not tested
  • 49.
    @shawnmjones @WebSciDL Sørensen-Dice DistanceResults 49 Our Results: AlNoamany’s Results Not tested
  • 50.
    @shawnmjones @WebSciDL Jaccard DistanceResults 50 Our Results: AlNoamany’s Results Best F1 Score: 0.538 Threshold: 0.95
  • 51.
    @shawnmjones @WebSciDL Cosine Similarityof LSI Vectors 51 AlNoamany’s Results Not tested Our Results: Note: LSI scores are non-deterministic
  • 52.
    @shawnmjones @WebSciDL Byte CountResults 52 AlNoamany’s Results Best F1 Score: 0.584 Threshold: -0.65 Our Results:
  • 53.
    @shawnmjones @WebSciDL Cosine Similarityof TF-IDF Vectors 53 Our Results: AlNoamany’s Results Best F1 Score: 0.881 Threshold: 0.15 Best score in AlNoamany’s Results
  • 54.
    @shawnmjones @WebSciDL Word CountResults 54 Best Score in Our Results: AlNoamany’s Results Best F1 Score: 0.806 Threshold: -0.85
  • 55.
    @shawnmjones @WebSciDL Results Summarized– Best F1 Score is Word Count 55 AlNoamany's Results Results of this study Similarity Measure Best F1 Score Corresponding Accuracy Corresponding Threshold Best F1 Score Corresponding Accuracy Corresponding Threshold Word Count 0.806 0.982 -0.85 0.788 0.971 -0.7 Cosine Similarity of TF-IDF Vectors 0.881 0.983 0.15 0.766 0.965 0.12 Byte Count 0.584 0.962 -0.65 0.756 0.965 -0.39 Cosine Similarity of LSI Vectors Not tested 0.711 0.965 0.12 with 10 topics Jaccard Distance 0.538 0.962 0.95 0.651 0.953 0.94 Sørensen-Dice Distance Not tested 0.649 0.953 0.88 Simhash on raw memento content Not tested 0.578 0.934 25 Simhash on TF Not tested 0.523 0.942 28 Our word count measure came out ahead of AlNoamany’s. AlNoamany’s Cosine Similarity measure came out ahead of ours.
  • 56.
    @shawnmjones @WebSciDL What aboutusing measures together? 56 AlNoamany found that using cosine similarity of TF-IDF vectors and word count together produced even better results. Our best F1 score for word count alone was 0.788. Word count combined with LSI turned out to be slightly better with the same Accuracy. The success of word count appears to exert influence on the threshold of its partner measure, making its threshold more strict.
  • 57.
  • 58.
    @shawnmjones @WebSciDL Improving theOTMT  Bug fixes  Make LSI scores reproducible  New Measures  TimeMap Measures – compare first memento with considered memento:  Spamsum of the raw content – used by Andy Jackson at the UKWA  Cosine of LDA Vectors via Gensim  Collection Measures 1. Develop a collection-wide picture 2. Compare each memento against that picture  Control over preprocessing:  Options to change use a different boilerplate method  Options to turn off stemming or stop word removal 58
  • 59.
  • 60.
    @shawnmjones @WebSciDL Motivation -Mementos Can Go Off-Topic 60 Hacked Moved on from topic Collections have a theme Seeds are selected to support that theme Mementos are versions of seeds Some of these versions are off-topic Identifying these off-topic mementos is key to some research activities, like summarization Web Page Gone Account Suspension
  • 61.
    @shawnmjones @WebSciDL OTMT supportsdifferent similarity measures with thresholds established based on experimentation  Byte count  Word count  Jaccard distance  Sørensen-Dice distance  Simhash of term frequencies  Simhash of raw memento content  Cosine similarity of TF-IDF vectors  Cosine similarity of LSI vectors 61
  • 62.
    @shawnmjones @WebSciDL Please tryout the Off-Topic Memento Toolkit! 62 Thanks to: 1. Pip (preferred): pip install otmt 2. Experimental Docker Image: docker pull shawnmjones/otmt 3. Source Code: git clone https://github.com/oduwsdl/off-topic- memento-toolkit.git https://github.com/oduwsdl/off-topic-memento-toolkit https://github.com/oduwsdl/offtopic-goldstandard-data