Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Ian Milligan (@ianmilligan1)
Assistant Professor of History
i2millig@uwaterloo.ca
Clustering Search
to Navigate A Case
Study of the
Canadian World
Wide Web as a
Historical Resource

Why?
!
Historians need to think
about Computational
Methods in an era of
web archives.

INTERNET ARCHIVE
~ 10,240 TBs
LIBRARY of CONGRESS
~ 200 TBs
est. HOLDINGS:

The 80TB Wide Web
Scrape
[March - December 2011]

Wayback
Machine
or
WARC ﬁles?

Building a .ca sample:
!
622,365 distinct URLs /
8,512,275 overall URLs =
7.31% in case study

WARC
Web ARChive ﬁle format
ISO 28500:2009

filesdump.py available at https://github.com/ianmilligan1/
Historian-WARC-1/tree/master/WARC/warc-tools-mandel
WARC File
WARC-Tools/Lynx!
(warcfilter.py,
warchtmlindex.py
and filesdump.py)
Indexing
CDX Files !
(ﬁnding aids)

Full Text Index
Clustering
Workbench
Other sorts
of text
analysis

https://github.com/ianmilligan1/Historian-WARC-1/tree/
master/WARC/warc-tools-mandel
WARC File WARC-Tools/Lynx!
(warchtmlindex.py
and filesdump.py)
Indexing

Downside is you still
have to know what
you’re looking for.

Ian Milligan
Assistant Professor of History
i2millig@uwaterloo.ca
Thanks (to you all
and to funders).
!
http://
ianmilligan.ca/

Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

More Related Content

What's hot

Similar to Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource