Ian Milligan (@ianmilligan1)
Assistant Professor of History
i2millig@uwaterloo.ca
Clustering Search
to Navigate A Case
Study of the
Canadian World
Wide Web as a
Historical Resource
Why?
!
Historians need to think
about Computational
Methods in an era of
web archives.
INTERNET ARCHIVE
~ 10,240 TBs
LIBRARY of CONGRESS
~ 200 TBs
est. HOLDINGS:
The 80TB Wide Web
Scrape
[March - December 2011]
Wayback
Machine
or
WARC files?
Building a .ca sample:
!
622,365 distinct URLs /
8,512,275 overall URLs =
7.31% in case study
WARC
Web ARChive file format
ISO 28500:2009
filesdump.py available at https://github.com/ianmilligan1/
Historian-WARC-1/tree/master/WARC/warc-tools-mandel
WARC File
WARC-Tools/Lynx!
(warcfilter.py,
warchtmlindex.py
and filesdump.py)
Indexing
CDX Files !
(finding aids)
Full Text Index
Clustering
Workbench
Other sorts
of text
analysis
https://github.com/ianmilligan1/Historian-WARC-1/tree/
master/WARC/warc-tools-mandel
WARC File WARC-Tools/Lynx!
(warchtmlindex.py
and filesdump.py)
Indexing
Downside is you still
have to know what
you’re looking for.
Playing with
images?
Ian Milligan
Assistant Professor of History
i2millig@uwaterloo.ca
Thanks (to you all
and to funders).
!
http://
ianmilligan.ca/

Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource