Slides anu talkwebarchivingaug2012

Internet Content as
Research Data
Australian National University
August 2012, Canberra
Monica Omodei

Research Examples
•  Social networking •  Political Science
•  Lexicography •  Media Studies
•  Linguistics •  Contemporary history
•  Network Science

Data-driven science is migrating from the
natural sciences to humanities and social
science

Talk
Structure

•  Exis0ng
web
archives

•  Web
archive
use
cases

•  Bringing
archives
together

•  Crea0ng
your
own
archive

•  It’s
ge>ng
harder
–
challenges

•  Web
data
mining
&
analysis

Exis0ng
web
archives

•  Internet
Archive

•  Common
Crawl

•  Pandora
Archive

•  Internet
Memory
Founda0on
Archive

•  Other
na0onal
archives

•  Research,
University
Library
archives

Common
Collec0on
Strategies

•  Crawl
Scope
&
Focus

1)  Thema0c/Topical
(elec0ons,
events,
global
warming…)

2)  Resource-‐specific
(video,
pdf,
etc.)

3)  Broad
survey
(domain
wide
for
.com/.net/.org/.edu/.gov)

4)  Exhaus0ve
(end
of
life, closure crawls, natl domains)

5)  Frequency-‐Based

•  Key
Inputs:
nomina0ons
from
subject
maêr
experts,

prior
crawl
data,
registry
data,
trusted
directories,

wikipedia,
twiêr

Internet Archive’s Web Archive

Positives
–  Very broad – 175+ billion web instances
–  Historic – started 1996
–  Publicly accessible
–  Time-based URL search
–  API access
–  Not constrained by legislation – covered by
fair use and fast take-down response

Internet
Archive’s
Web
Archive

Negatives
–  Because of size can’t search by keyword
–  Because of size crawling is fully automated –
ergo QA is not possible

Common
Crawl

•  Non-‐proﬁt
founda0on
building
an
open
crawl

of
the
web
to
seed
research
and
innova0on

•  Currently
5
billion
pages

•  Stored
on
Amazon’s
S3

•  Accessible
via
MapReduce
processing
in

Amazon’s
EC2
compute
cloud

•  Wholesale
extrac0on,
transforma0on,
and

analysis
of
web
data
cheap
and
easy

Common
Crawl

Nega0ves

•  Not
designed
for
human
browsing
but
for

machine
access

•  Objec0ve
is
to
support
large-‐scale
analysis
and

text
mining/indexing
–
not
long-‐term

preserva0on

•  Some
costs
are
involved
for
direct
extrac0on

of
data
from
S3
storage
using
Requester-‐Pays

API

Pandora
Archive

•  Posi0ves

–  Quality
checked

–  Targeted
Australian
content
with
selec0on
policy

–  Historical
–
started
1996

–  Bibliocentric
approach
–web
sites/publica0ons

selected
for
archiving
are
catalogued
(see
Trove)

–  Keyword
search

–  Publicly
accessible

–  You
can
nominate
Australian
web
sites
for

inclusion
-‐
pandora.nla.gov.au/
registra0on_form.html

Pandora
Archive

•  Nega0ves

–  labour
intensive
thus
quite
small

–  signiﬁcant
content
missed
because
permission
to

copy
refused

•  Situa0on
will
improve
markedly
if
Legal

Deposit
provisions
extended
to
digital

publica0ons

•  Broader
coverage
will
be
achieved
when

infrastructure
is
upgraded
hence
reducing

labour
costs
for
checking/ﬁxing
crawls

Pandora
Archive
Stats

•  Size
–
6.32
TB

•  Number
of
Files

>
140
million

•  Number
of
‘0tles’
>
30.5K

•  Number
of
0tle
instances
>
73.5K

Which archived sites are popular ?

•  Measure: ﬁltered, aggregated web access
log data which counts access to titles "
•  Examined top 30 archived titles (# of
accesses) for each year 2009 to 2012"
•  Selected some to examine and speculate
as to why they might be popular"
•  Selected those with consistently high
ranking, and ones that were very variable
between years

Reasons for popularity of archived
version

•  Were once popular and are now
decommissioned, particularly if domain
name continues to exist and redirects to
the archive"
•  May not be that popular as live sites but
their live site links prominently to Pandora
as an archive for their content"
•  Popular referencing sources cite the
archive as well as the live site (if it still
exists)

Improving visibility and usage of
Pandora archive

•  Articles about interesting content on the
Australia Web Archives blog –http://
blogs.nla.gov.au/australias-web-archives/"
•  More effort to identify archived sites that are
no longer ʻliveʼ"
•  Market automatic redirect services to web
site owners/managers"
•  Allow Google to index archive content for
ʻnon-liveʼ sites (problematic)"
•  Install Twittervane - draws
site
nomina0ons

for
archiving
based
on
trending
Twi^er
topics.

"

.au
Domain
Annual
Snapshots

•  Annual
crawls
since
2005
commissioned
from

Internet
Archive

•  Includes
sites
on
servers
located
in
Australia

as
well
as
.au
domain

•  Robots.txt
respected
except
for
inline
images

and
stylesheets

•  No
public
access
–
researcher
access
protocols

are
being
developed

•  Full
text
search
–
suited
to
searching
archives

•  Separate
.gov
crawl
publicly
accessible
soon

Australian
web
domain
crawls

Year
2005
2006
2007
2008
2009
2011

Files
185
596
516
1
billion
765
660

million
million
million
million
million

Hosts
811,523
1,046,038
1,247,614
3,038,658
1,074,645
1,346,549

crawled

Size
(TBs)
6.69
19.04
18.47
34.55
24.29
30.71

Internet
Memory
Founda0on

•  Number
of
European
partners

•  LiWA
–
Living
Web
Archives:
next
genera0on

Web
archiving
methods
and
tools

•  LAWA
–
Longitudinal
Analy0cs
of
Web
Archive

Data:
experimental
testbed
for
large-‐scale

data
analy0cs

•  ARCOMEM
(Collect-‐All
ARchives
to

COmmunity
MEMories)
leveraging
social

media
for
Intelligent
Preserva0on

•  SCAPE
–
Scalable
Preserva0on
Environments

Other
Na0onal
Archives

•  List
of
Interna0onal
Internet
Preserva0on

Consor0um
member
archives
–

netpreserve.org/about/archiveList.php

•  Some
are
whole
domain
archives,
some

are

selec0ve
archives,
many
are
both

•  Some
have
public
access,
others
you
will
need

to
nego0ate
access
for
research

•  Most
archives
have
been
collected
using
the

heritrix
open-‐source
crawler
and
thus
use
the

standard
format
(warc
ISO
format)

Research
Archives

•  California
Digital
Library

•  Harvard
University
Libraries

•  Columbia

University
Libraries

•  University
of
North
Texas

….
and
many
more

•  WebCITE
-‐
webcita0on.org
(cita0on
service

archive)

Example:
Columbia
University

•  Member
of
the
IIPC

•  They
use
the
ArchiveIt
service

•  A
Research
library
that
sees
web
archiving
as

fundamental
to
their
collec0ng

•  They
complement
and
coordinate
with
other
web

archives

•  Their
collec0ng
focus
is
thema0c
–
eg
human
rights,

historic
preserva0on,
NY
religious
ins0tu0ons

•  They
also
archive
web
content
as
part
of
personal

and
organisa0onal
archives
(c.f.
manuscripts
coll)

•  Archive
their
own
web
site
regularly

Bringing
Archives
Together

•  Common
standards
and
APIs

•  Memento
project
–
adding
0me
to
the
web

–  Aggregates
CDX
ﬁles
(URL
index)
from
mul0ple

archives

–  Has
a
Firefox
plug-‐in
which
allows
0me-‐based

browsing

–  Ini0a0ve
of
Los
Alamos
Laboratories

–  See
h^p://www.mementoweb.org/demo/

Common
Use
Cases
for
a
web

archive

•  Content
discovery

•  Nostalgia
queries

•  Web
site
restora0on
and
ﬁle
recovery

•  Domain
name
valua0on

•  Fall-‐back
for
link-‐rot

•  Prior
art
analysis
and
patent/copyright
infringement

research

•  Legal
cases

•  Topic
analysis,
web
trends
analysis,
popularity

analysis,
network
analysis,
linguis0c
analysis

Create
your
own
Archive

•  Use
a
subscrip0on
service

•  Build
your
own
web
archiving
infrastructure

with
open
source
sonware
(
ie
Heritrix
and

Wayback)

•  Use
web
cita0on
services
that
create
archive

copies
as
you
bookmark
pages

Subscrip0on
Services

•  archive-‐it.org
(service
operated
by
non-‐proﬁt

Internet
Archive
since
2006)

•  archivethe.net
(service
operated
by
non-‐proﬁt

Internet
Memory
Founda0on)

•  California
Digital
Library
Web
Archiving

Service
-‐
cdlib.org/services/uc3/was.html

•  OCLC
Harvester
Service
-‐
oclc.org/
webharvester/overview/default.htm

Install
web
archiving
system
locally

•  Easy-‐to-‐deploy
web
archiving
toolkit
not
yet

available

•  Ins0tu0onal
web
archiving
infrastructure
is

feasible
and
has
been
established
at
a
number

of
universi0es
for
use
by
researchers
–
needs

IT
systems
engineers
to
set
up
though

•  Archives
can
be
deposited
with
the
NLA
for

long-‐term
preserva0on

Personal
Web
Archiving

•  WARCreate
–
recently
released
free
tool
which

creates
wayback-‐consumable
warc
ﬁles
from
any

web
page

•  Google
Chrome
extension

•  Enables
preserva0on
by
users
from
their
desktop

•  Can
target
content
unreachable
by
crawlers

•  Brings
WARC
to
personal
digital
archiving

•  What
you
do
with
the
WARC
ﬁles
is
up
to
you

•  Install
suite
provided
to
set
up
local
Wayback

instance
and
Memento
0megate

Current
challenges

•  Database-‐driven
features
and
func0ons

•  Complex
and
varying
URI
formats
and
non-‐
standard
link
implementa0ons
eg
Twi^er

•  Dynamically
generated
ever-‐changing
URIs

–  For
serving
the
same
resources

•  Rich
Media
–
eg
streamed
media
with
custom

apps
and
ant-‐collec0on
measures

•  Scripted
incremental
display
and
page-‐loading

…
more…

•  Scripted
HTML
forms

•  Mul0-‐sourced
embedded
material

•  Dynamic
authen0ca0on
e.g.
captchas,
cross-‐
site
authen0ca0on,
user-‐sensi0ve
embeds

•  Alternate
display
based
on
browser
or
device,

or
other
parameter

•  Site
architecture
designed
to
inhibit
crawling

and
indexing
–
but
if
poorly
done
even
‘polite’

harvesters
like
Heritrix
may
crash
their
server

..
but
wait,
there’s
more
…

•  Server-‐side
scripts
and
remote
procedure
calls

–
the
full
variety
of
paths
through
a
site
are

now
onen
hidden
in
remote/opaque
server-‐
side
code
–
not
a
new
problem
but
now

effects
80+%
of
online
resources

•  HTML
5
web
sockets
–
effec0vely
codifies

incremental
updates
without
page
reloads

•  Mobile
publishing

Transac0onal
Web
Archiving

•  Useful
for
ins0tu0onal
archiving

–  Best
for
record-‐keeping
purposes
-‐
when

challenged
in
court
about
content
on
web
site

–  Can
be
used
to
ensure
URL
persistence
eg
when

site
has
a
make-‐over
–
can
intercept
404s

–  No
‘gaps’
c.f.
crawl
approach
–
every
change
in

accessed
content
is
archived

–  However
requires
code
snippet
to
be
installed
on

web
server

–  Open
source
sonware
being
developed
by
Los

Alamos
Labs

Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
Data Analysis

Need fast iteration to understand the right
questions to ask
More minds able to contribute = more value
(perceived and real) placed on the importance
of the data
Increased demand for/value of the data = more
funding to support it
Need to surface the Information amongst all
that data…

Platform & Toolkit: Overview

•  Software

–  Apache Hadoop

–  Apache Pig

•  Data/File format

–  WARC

–  CDX

–  WAT (new!)

Apache Hadoop

•  HDFS

–  Distributed storage

–  Durable, default 3x replication

–  Scalable: Yahoo! 60+PB HDFS

•  MapReduce

–  Distributed computation

–  You write Java functions

–  Hadoop distributes work across cluster

–  Tolerates failures

File formats and data: CDX
•  Index used to browse WARC-based archive

•  Space-delimited text ﬁle

•  Only essential the essential metadata needed
by Wayback

–  URL

–  Content Digest

–  Capture Timestamp

–  Content-Type

–  HTTP response code

–  etc.

File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹

•  Not preservation format

•  Data exchange and analysis

•  Less than full WARC, more than CDX

•  Essential metadata for many types of analysis

•  Avoids barriers to data exchange: copyright,
privacy

•  Work-in-progress: we want your feedback

File formats and data: WAT
•  WAT is WARC ☺

–  WAT records are WARC
metadata records

File formats & data:

–  WARC-Refers-To header •  CDX: 53 MB

identiﬁes original WARC
record

•  WAT: 443 MB

•  WAT payload is JSON

•  WARC: 8,651 MB

–  Compact

–  Hierarchical

–  Supported by every
programming environ

Some
References

•  h^p://en.wikipedia.org/wiki/Web_archiving

•  h^p://netpreserve.org/about/archiveList.php

•  Web
Archives:
The
Future(s)
-‐

h^p://www.netpreserve.org/publica0ons/
2011_06_IIPC_WebArchives-‐TheFutures.pdf

•  h^p://matkelly.com/warcreate/

•  Common
Crawl:
h^p://commoncrawl.org/
data/accessing-‐the-‐data/

Contacts

•  Webarchive
@
nla.gov.au

•  Secretariat
@
internetmemory.org

•  Queries
about
the
internet
archive
web
archive

h^p://iawebarchiving.wordpress.com/

•  Queries
about
Archive-‐It
service

h^p://www.archive-‐it.org/contact-‐us

momodei
@
nla.gov.au
(un0l
31
Aug
2012
)

or

monica.omodei
@
gmail.com

Slides anu talkwebarchivingaug2012

More Related Content

What's hot

Viewers also liked

Similar to Slides anu talkwebarchivingaug2012

More from Roxanne Missingham

Recently uploaded

Slides anu talkwebarchivingaug2012