Internet content as research data

Internet Content as
Research Data
Digital Humanities Australia
March 2012, Canberra
Monica Omodei & Gordon Mohr

Research Examples
•  Social networking
•  Lexicography
•  Linguistics
•  Network Science
•  Political Science
•  Media Studies
•  Contemporary history

Common
Collec)on
Strategies

•  Crawl
Scope
&
Focus

1)  Thema)c/Topical
(elec)ons,
events,
global
warming…)

2)  Resource-‐speciﬁc
(video,
pdf,
etc.)

3)  Broad
survey
(domain
wide
for
.com/.net/.org/.edu/.gov)

4)  Exhaus)ve
(end
of
life, closure crawls, natl domains)

5)  Frequency-‐Based

•  Key
Inputs:
nomina)ons
from
subject
maSer
experts,

prior
crawl
data,
registry
data,
trusted
directories,

wikipedia

Exis)ng
web
archives

•  Internet
Archive

•  Common
Crawl

•  Pandora
Archive

•  Internet
Memory
Founda)on
Archive

•  Other
na)onal
archives

•  Research,
University
Library
archives

Internet Archive’s Web Archive

Positives
–  Very broad – 175+ billion web instances
–  Historic – started 1996
–  Publicly accessible
–  Time-based URL search
–  API access
–  Not constrained by legislation – covered by
fair use and fast take-down response

Internet
Archive’s
Web
Archive

Negatives
–  Because of size can’t search by keyword
–  Because of size, fully automated - QA not
possible

Common
Use
Cases
for
IA’s
web

archive

•  Content
discovery

•  Nostalgia
queries

•  Web
site
restora)on
and
ﬁle
recovery

•  Domain
name
valua)on

•  Collabora)ve
R&D

•  Prior
art
analysis
and
patent/copyright
infringement

research

•  Legal
cases

•  Topic
analysis,
web
trends
analysis,
popularity

analysis

Common
Crawl

•  Non-‐proﬁt
founda)on
building
an
open
crawl

of
the
web
to
seed
research
and
innova)on

•  Currently
5
billion
pages

•  Stored
on
Amazon’s
S3

•  Accessible
via
MapReduce
processing
in

Amazon’s
EC2
compute
cloud

•  Wholesale
extrac)on,
transforma)on,
and

analysis
of
web
data
cheap
and
easy

•  commoncrawl.org/data/accessing-‐the-‐data/

Common
Crawl

Nega)ves

•  Not
designed
for
human
browsing
but
for

machine
access

•  Objec)ve
is
to
support
large-‐scale
analysis
and

text
mining/indexing
–
not
long-‐term

preserva)on

•  Some
costs
are
involved
for
direct
extrac)on

of
data
from
S3
storage
using
Requester-‐Pays

API

Pandora
Archive

•  Posi)ves

–  Quality
checked

–  Targeted
Australian
content
with
selec)on
policy

–  Historical
–
started
1996

–  Bibliocentric
approach
–we
sites/publica)ons

selected
for
archiving
are
catalogued
(see
Trove)

–  Keyword
search

–  Publicly
accessible

–  You
can
nominate
Australian
web
sites
for

inclusion
-‐
pandora.nla.gov.au/
registra)on_form.html

Pandora
Archive

•  Nega)ves

–  labour
intensive
so
small

–  signiﬁcant
content
missed
because
permission
to

copy
refused

•  Situa)on
will
improve
markedly
if
Legal

Deposit
provisions
extended
to
digital

publica)ons

•  Broader
coverage
will
be
achieved
when

infrastructure
is
upgraded
hence
reducing

labour
costs
for
checking/ﬁxing
crawls

Pandora
Archive
Stats

•  Size
–
6.32
TB

•  Number
of
Files

>
140
million

•  Number
of
‘)tles’
>
30.5K

•  Number
of
)tle
instances
>
73.5K

.au
Domain
Annual
Snapshots

•  Annual
crawls
since
2005
commissioned
from

Internet
Archive

•  Includes
sites
on
servers
located
in
Australia

as
well
as
.au
domain

•  Robots.txt
respected
except
for
inline
images

and
stylesheets

•  No
public
access
–
researcher
access
protocols

are
being
developed

•  Full
text
search
–
tailored
to
archive
search

•  Separate
.gov
crawl
publicly
accessible
soon

Australian
web
domain
crawls

Year
2005
2006
2007
2008
2009
2011

Files
185
596
516
1
billion
765
660

million
million
million
million
million

Hosts
811,523
1,046,038
1,247,614
3,038,658
1,074,645
1,346,549

crawled

Size
(TBs)
6.69
19.04
18.47
34.55
24.29
30.71

Internet
Memory
Founda)on

Archive

•  internetmemory.org/en/

•  no
keyword
search
yet
–
only
URL

•  Number
of
European
partners

Other
Na)onal
Archives

•  List
of
Interna)onal
Internet
Preserva)on

Consor)um
member
archives
–

netpreserve.org/about/archiveList.php

•  Some
are
whole
domain
archives,
some

are

selec)ve
archives,
many
are
both

•  Some
have
public
access,
others
you
will
need

to
nego)ate
access
for
research

•  Most
archives
have
been
collected
using
the

heritrix
open-‐source
crawler
and
thus
use
the

standard
format
(warc
ISO
format)

Research
Archives

•  California
Digital
Library

•  Harvard
University
Libraries

•  Columbia

University
Libraries

•  University
of
North
Texas

….
and
many
more

•  WebCITE
-‐
webcita)on.org
(cita)on
service

archive)

Bringing
Archives
Together

•  Common
standard
and
APIs

•  Memento
project

Create
your
own
Archive

•  Use
a
subscrip)on
service

•  Build
your
own
archive
using
open-‐source

crawler
heritrix
and
standard
ﬁle
format
.warc

•  Use
web
cita)on
services
that
create
archive

copies
as
you
bookmark
pages

Subscrip)on
Services

•  archive-‐it.org
(service
operated
by
non-‐proﬁt

Internet
Archive
since
2006)

•  archivethe.net
(service
operated
by
non-‐proﬁt

Internet
Memory
Founda)on)

•  California
Digital
Library
Web
Archiving

Service
-‐
cdlib.org/services/uc3/was.html

•  OCLC
Harvester
Service
-‐
oclc.org/
webharvester/overview/default.htm

Install
web
archiving
system
locally

•  Easy-‐to-‐deploy
web
archiving
toolkit
not
yet

available
(that
meets
web
archive
standards)

•  Ins)tu)onal
web
archiving
infrastructure
is

feasible
and
has
been
established
at
a
number

of
universi)es
for
use
by
researchers
–
needs

IT
systems
engineers
to
set
up
though

•  Archives
can
be
deposited
with
the
NLA
for

long-‐term
preserva)on

'Memento':
adding
)me
to
the

web

Protocol
and
browser
add-‐on
(MementoFox)

•  Aids
discovery,
aggrega)on
of
page
histories

Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
Data Analysis

Need fast iteration to understand the right
questions to ask
More minds able to contribute = more value
(perceived and real) placed on the importance
of the data
Increased demand for/value of the data = more
funding to support it
Need to surface the Information amongst all
that data…

Platform & Toolkit: Overview

•  Software

–  Apache Hadoop

–  Apache Pig

•  Data/File format

–  WARC

–  CDX

–  WAT (new!)

Apache Hadoop

•  HDFS

–  Distributed storage

–  Durable, default 3x replication

–  Scalable: Yahoo! 60+PB HDFS

•  MapReduce

–  Distributed computation

–  You write Java functions

–  Hadoop distributes work across cluster

–  Tolerates failures

File formats and data: CDX
•  Index for Wayback Machine: used to browse
WARC-based archive

•  Space-delimited text ﬁle

•  Only essential metadata needed by Wayback

–  URL

–  Content Digest

–  Capture Timestamp

–  Content-Type

–  HTTP response code

–  etc.

File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹

•  Not preservation format

•  Data exchange and analysis

•  Less than full WARC, more than CDX

•  Essential metadata for many types of analysis

•  Avoids barriers to data exchange: copyright,
privacy

•  Work-in-progress: we want your feedback

File formats and data: WAT
•  WAT is WARC ☺

–  WAT records are WARC
metadata records

File formats & data:

–  WARC-Refers-To header •  CDX: 53 MB

identiﬁes original WARC
record

•  WAT: 443 MB

•  WAT payload is JSON

•  WARC: 8,651 MB

–  Compact

–  Hierarchical

–  Supported by every
programming environ

Some
References

•  hSp://en.wikipedia.org/wiki/Web_archiving

•  hSp://netpreserve.org/about/archiveList.php

•  Web
Archives:
The
Future(s)
-‐

hSp://www.netpreserve.org/publica)ons/
2011_06_IIPC_WebArchives-‐TheFutures.pdf

Contacts

•  Webarchive
@
nla.gov.au

•  Secretariat
@
internetmemory.org

•  Queries
about
the
internet
archive
web
archive

hSp://iawebarchiving.wordpress.com/

•  Queries
about
Archive-‐It
service

hSp://www.archive-‐it.org/contact-‐us

•  momodei
@
nla.gov.au

•  gojomo
@
xavvy.com

Internet content as research data

More Related Content

What's hot

Similar to Internet content as research data

More from National Library of Australia

Recently uploaded

Internet content as research data