Internet Content as
   Research Data
Australian National University
  August 2012, Canberra
       Monica Omodei
Research Examples
•    Social networking   •  Political Science
•    Lexicography        •  Media Studies
•    Linguistics         •  Contemporary history
•    Network Science


Data-driven science is migrating from the
natural sciences to humanities and social
science
Talk	
  Structure	
  
•      Exis0ng	
  web	
  archives	
  
•      Web	
  archive	
  use	
  cases	
  
•      Bringing	
  archives	
  together	
  
•      Crea0ng	
  your	
  own	
  archive	
  
•      It’s	
  ge>ng	
  harder	
  –	
  challenges	
  
•      Web	
  data	
  mining	
  &	
  analysis	
  	
  
	
  
	
  
Exis0ng	
  web	
  archives	
  	
  
•    Internet	
  Archive	
  
•    Common	
  Crawl	
  	
  
•    Pandora	
  Archive	
  
•    Internet	
  Memory	
  Founda0on	
  Archive	
  
•    Other	
  na0onal	
  archives	
  
•    Research,	
  University	
  Library	
  archives	
  	
  
Common	
  Collec0on	
  Strategies	
  
•  Crawl	
  Scope	
  &	
  Focus	
  
    1)       Thema0c/Topical	
  (elec0ons,	
  events,	
  global	
  warming…)	
  
    2)       Resource-­‐specific	
  (video,	
  pdf,	
  etc.)	
  
    3)       Broad	
  survey	
  (domain	
  wide	
  for	
  .com/.net/.org/.edu/.gov)	
  
    4)       Exhaus0ve	
  (end	
  of	
  life, closure crawls, natl domains)	
  
    5)       Frequency-­‐Based	
  
    	
  
•  Key	
  Inputs:	
  nomina0ons	
  from	
  subject	
  ma^er	
  experts,	
  
   prior	
  crawl	
  data,	
  registry	
  data,	
  trusted	
  directories,	
  
   wikipedia,	
  twi^er	
  
Internet Archive’s Web Archive

Positives
  –  Very broad – 175+ billion web instances
  –  Historic – started 1996
  –  Publicly accessible
  –  Time-based URL search
  –  API access
  –  Not constrained by legislation – covered by
     fair use and fast take-down response
Internet	
  Archive’s	
  Web	
  Archive	
  
Negatives
       –  Because of size can’t search by keyword
       –  Because of size crawling is fully automated –
          ergo QA is not possible
	
  
Common	
  Crawl	
  
•  Non-­‐profit	
  founda0on	
  building	
  an	
  open	
  crawl	
  
   of	
  the	
  web	
  to	
  seed	
  research	
  and	
  innova0on	
  
•  Currently	
  5	
  billion	
  pages	
  
•  Stored	
  on	
  Amazon’s	
  S3	
  	
  
•  Accessible	
  via	
  MapReduce	
  processing	
  in	
  
   Amazon’s	
  EC2	
  compute	
  cloud	
  
•  Wholesale	
  extrac0on,	
  transforma0on,	
  and	
  
   analysis	
  of	
  web	
  data	
  cheap	
  and	
  easy	
  
Common	
  Crawl	
  
Nega0ves	
  
•  Not	
  designed	
  for	
  human	
  browsing	
  but	
  for	
  
   machine	
  access	
  
•  Objec0ve	
  is	
  to	
  support	
  large-­‐scale	
  analysis	
  and	
  
   text	
  mining/indexing	
  –	
  not	
  long-­‐term	
  
   preserva0on	
  
•  Some	
  costs	
  are	
  involved	
  for	
  direct	
  extrac0on	
  
   of	
  data	
  from	
  S3	
  storage	
  using	
  Requester-­‐Pays	
  
   API	
  	
  
Pandora	
  Archive	
  
•  Posi0ves	
  
   –  Quality	
  checked	
  
   –  Targeted	
  Australian	
  content	
  with	
  selec0on	
  policy	
  
   –  Historical	
  –	
  started	
  1996	
  
   –  Bibliocentric	
  approach	
  –web	
  sites/publica0ons	
  
      selected	
  for	
  archiving	
  are	
  catalogued	
  (see	
  Trove)	
  
   –  Keyword	
  search	
  
   –  Publicly	
  accessible	
  
   –  You	
  can	
  nominate	
  Australian	
  web	
  sites	
  for	
  
      inclusion	
  -­‐	
  pandora.nla.gov.au/
      registra0on_form.html	
  
Pandora	
  Archive	
  
•  Nega0ves	
  
   –  labour	
  intensive	
  thus	
  quite	
  small	
  
   –  significant	
  content	
  missed	
  because	
  permission	
  to	
  
      copy	
  refused	
  
•  Situa0on	
  will	
  improve	
  markedly	
  if	
  Legal	
  
   Deposit	
  provisions	
  extended	
  to	
  digital	
  
   publica0ons	
  
•  Broader	
  coverage	
  will	
  be	
  achieved	
  when	
  
   infrastructure	
  is	
  upgraded	
  hence	
  reducing	
  
   labour	
  costs	
  for	
  checking/fixing	
  crawls	
  
Pandora	
  Archive	
  Stats	
  
•    Size	
  –	
  6.32	
  TB	
  
•    Number	
  of	
  Files	
  	
  >	
  140	
  million	
  
•    Number	
  of	
  ‘0tles’	
  >	
  30.5K	
  
•    Number	
  of	
  0tle	
  instances	
  >	
  73.5K	
  
Which archived sites are popular ?	
  
   •  Measure: filtered, aggregated web access
      log data which counts access to titles "
   •  Examined top 30 archived titles (# of
      accesses) for each year 2009 to 2012"
   •  Selected some to examine and speculate
      as to why they might be popular"
   •  Selected those with consistently high
      ranking, and ones that were very variable
      between years	
  
Reasons for popularity of archived
             version	
  
•  Were once popular and are now
   decommissioned, particularly if domain
   name continues to exist and redirects to
   the archive"
•  May not be that popular as live sites but
   their live site links prominently to Pandora
   as an archive for their content"
•  Popular referencing sources cite the
   archive as well as the live site (if it still
   exists)	
  
Improving visibility and usage of
       Pandora archive	
  
•  Articles about interesting content on the
   Australia Web Archives blog –http://
   blogs.nla.gov.au/australias-web-archives/"
•  More effort to identify archived sites that are
   no longer ʻliveʼ"
•  Market automatic redirect services to web
   site owners/managers"
•  Allow Google to index archive content for
   ʻnon-liveʼ sites (problematic)"
•  Install Twittervane - draws	
  site	
  nomina0ons	
  
   for	
  archiving	
  based	
  on	
  trending	
  Twi^er	
  topics.	
  	
  	
  "
.au	
  Domain	
  Annual	
  Snapshots	
  
•  Annual	
  crawls	
  since	
  2005	
  commissioned	
  from	
  
   Internet	
  Archive	
  
•  Includes	
  sites	
  on	
  servers	
  located	
  in	
  Australia	
  
   as	
  well	
  as	
  .au	
  domain	
  
•  Robots.txt	
  respected	
  except	
  for	
  inline	
  images	
  
   and	
  stylesheets	
  
•  No	
  public	
  access	
  –	
  researcher	
  access	
  protocols	
  
   are	
  being	
  developed	
  
•  Full	
  text	
  search	
  –	
  suited	
  to	
  searching	
  archives	
  
•  Separate	
  .gov	
  crawl	
  publicly	
  accessible	
  soon	
  
Australian	
  web	
  domain	
  crawls	
  

Year	
              2005	
        2006	
        2007	
        2008	
             2009	
        2011	
  
Files	
             185	
         596	
         516	
         1	
  billion	
     765	
         660	
  
                    million	
     million	
     million	
                        million	
     million	
  
Hosts	
             811,523	
     1,046,038	
   1,247,614	
   3,038,658	
   1,074,645	
   1,346,549	
  
crawled	
  
Size	
  (TBs)	
     6.69	
        19.04	
       18.47	
       34.55	
            24.29	
       30.71	
  
Internet	
  Memory	
  Founda0on	
  
•  Number	
  of	
  European	
  partners	
  	
  
•  LiWA	
  –	
  Living	
  Web	
  Archives:	
  next	
  genera0on	
  
   Web	
  archiving	
  methods	
  and	
  tools	
  	
  
•  LAWA	
  –	
  Longitudinal	
  Analy0cs	
  of	
  Web	
  Archive	
  
   Data:	
  experimental	
  testbed	
  for	
  large-­‐scale	
  
   data	
  analy0cs	
  
•  ARCOMEM	
  (Collect-­‐All	
  ARchives	
  to	
  
   COmmunity	
  MEMories)	
  leveraging	
  social	
  
   media	
  for	
  Intelligent	
  Preserva0on	
  	
  
•  SCAPE	
  –	
  Scalable	
  Preserva0on	
  Environments	
  
Other	
  Na0onal	
  Archives	
  
•  List	
  of	
  Interna0onal	
  Internet	
  Preserva0on	
  
   Consor0um	
  member	
  archives	
  –	
  
   netpreserve.org/about/archiveList.php	
  
•  Some	
  are	
  whole	
  domain	
  archives,	
  some	
  	
  are	
  
   selec0ve	
  archives,	
  many	
  are	
  both	
  
•  Some	
  have	
  public	
  access,	
  others	
  you	
  will	
  need	
  
   to	
  nego0ate	
  access	
  for	
  research	
  
•  Most	
  archives	
  have	
  been	
  collected	
  using	
  the	
  
   heritrix	
  open-­‐source	
  crawler	
  and	
  thus	
  use	
  the	
  
   standard	
  format	
  (warc	
  ISO	
  format)	
  
Research	
  Archives	
  
•  California	
  Digital	
  Library	
  
•  Harvard	
  University	
  Libraries	
  
•  Columbia	
  	
  University	
  Libraries	
  
•  University	
  of	
  North	
  Texas	
  
….	
  and	
  many	
  more	
  
	
  
•  WebCITE	
  -­‐	
  webcita0on.org	
  (cita0on	
  service	
  
     archive)	
  
Example:	
  Columbia	
  University	
  
•  Member	
  of	
  the	
  IIPC	
  
•  They	
  use	
  the	
  ArchiveIt	
  service	
  
•  A	
  Research	
  library	
  that	
  sees	
  web	
  archiving	
  as	
  
   fundamental	
  to	
  their	
  collec0ng	
  	
  
•  They	
  complement	
  and	
  coordinate	
  with	
  other	
  web	
  
   archives	
  
•  Their	
  collec0ng	
  focus	
  is	
  thema0c	
  –	
  eg	
  human	
  rights,	
  
   historic	
  preserva0on,	
  NY	
  religious	
  ins0tu0ons	
  
•  They	
  also	
  archive	
  web	
  content	
  as	
  part	
  of	
  personal	
  
   and	
  organisa0onal	
  archives	
  (c.f.	
  manuscripts	
  coll)	
  
•  Archive	
  their	
  own	
  web	
  site	
  regularly	
  
Bringing	
  Archives	
  Together	
  
•  Common	
  standards	
  and	
  APIs	
  
•  Memento	
  project	
  –	
  adding	
  0me	
  to	
  the	
  web	
  
       –  Aggregates	
  CDX	
  files	
  (URL	
  index)	
  from	
  mul0ple	
  
          archives	
  
       –  Has	
  a	
  Firefox	
  plug-­‐in	
  which	
  allows	
  0me-­‐based	
  
          browsing	
  
       –  Ini0a0ve	
  of	
  Los	
  Alamos	
  Laboratories	
  
       –  See	
  h^p://www.mementoweb.org/demo/	
  
	
  
Common	
  Use	
  Cases	
  for	
  a	
  web	
  
                 archive	
  
•  Content	
  discovery	
  
•  Nostalgia	
  queries	
  
•  Web	
  site	
  restora0on	
  and	
  file	
  recovery	
  
•  Domain	
  name	
  valua0on	
  
•  Fall-­‐back	
  for	
  link-­‐rot	
  
•  Prior	
  art	
  analysis	
  and	
  patent/copyright	
  infringement	
  
   research	
  
•  Legal	
  cases	
  
•  Topic	
  analysis,	
  web	
  trends	
  analysis,	
  popularity	
  
   analysis,	
  network	
  analysis,	
  linguis0c	
  analysis	
  
Create	
  your	
  own	
  Archive	
  
•  Use	
  a	
  subscrip0on	
  service	
  
•  Build	
  your	
  own	
  web	
  archiving	
  infrastructure	
  
   with	
  open	
  source	
  sonware	
  (	
  ie	
  Heritrix	
  and	
  
   Wayback)	
  
•  Use	
  web	
  cita0on	
  services	
  that	
  create	
  archive	
  
   copies	
  as	
  you	
  bookmark	
  pages	
  
Subscrip0on	
  Services	
  
•  archive-­‐it.org	
  (service	
  operated	
  by	
  non-­‐profit	
  
   Internet	
  Archive	
  since	
  2006)	
  
•  archivethe.net	
  (service	
  operated	
  by	
  non-­‐profit	
  	
  
   Internet	
  Memory	
  Founda0on)	
  
•  California	
  Digital	
  Library	
  Web	
  Archiving	
  
   Service	
  -­‐	
  cdlib.org/services/uc3/was.html	
  
•  OCLC	
  Harvester	
  Service	
  -­‐	
  oclc.org/
   webharvester/overview/default.htm	
  
Install	
  web	
  archiving	
  system	
  locally	
  
•  Easy-­‐to-­‐deploy	
  web	
  archiving	
  toolkit	
  not	
  yet	
  
   available	
  	
  
•  Ins0tu0onal	
  web	
  archiving	
  infrastructure	
  is	
  
   feasible	
  and	
  has	
  been	
  established	
  at	
  a	
  number	
  
   of	
  universi0es	
  for	
  use	
  by	
  researchers	
  –	
  needs	
  
   IT	
  systems	
  engineers	
  to	
  set	
  up	
  though	
  
•  Archives	
  can	
  be	
  deposited	
  with	
  the	
  NLA	
  for	
  
   long-­‐term	
  preserva0on	
  
Personal	
  Web	
  Archiving	
  
•  WARCreate	
  –	
  recently	
  released	
  free	
  tool	
  which	
  
   creates	
  wayback-­‐consumable	
  warc	
  files	
  from	
  any	
  
   web	
  page	
  
•  Google	
  Chrome	
  extension	
  
•  Enables	
  preserva0on	
  by	
  users	
  from	
  their	
  desktop	
  
•  Can	
  target	
  content	
  unreachable	
  by	
  crawlers	
  
•  Brings	
  WARC	
  to	
  personal	
  digital	
  archiving	
  
•  What	
  you	
  do	
  with	
  the	
  WARC	
  files	
  is	
  up	
  to	
  you	
  
•  Install	
  suite	
  provided	
  to	
  set	
  up	
  local	
  Wayback	
  
   instance	
  and	
  Memento	
  0megate	
  
Current	
  challenges	
  
•  Database-­‐driven	
  features	
  and	
  func0ons	
  
•  Complex	
  and	
  varying	
  URI	
  formats	
  and	
  non-­‐
   standard	
  link	
  implementa0ons	
  eg	
  Twi^er	
  
•  Dynamically	
  generated	
  ever-­‐changing	
  URIs	
  
   –  For	
  serving	
  the	
  same	
  resources	
  
•  Rich	
  Media	
  –	
  eg	
  streamed	
  media	
  with	
  custom	
  
   apps	
  and	
  ant-­‐collec0on	
  measures	
  
•  Scripted	
  incremental	
  display	
  and	
  page-­‐loading	
  
…	
  more…	
  
•  Scripted	
  HTML	
  forms	
  
•  Mul0-­‐sourced	
  embedded	
  material	
  
•  Dynamic	
  authen0ca0on	
  e.g.	
  captchas,	
  cross-­‐
   site	
  authen0ca0on,	
  user-­‐sensi0ve	
  embeds	
  
•  Alternate	
  display	
  based	
  on	
  browser	
  or	
  device,	
  
   or	
  other	
  parameter	
  
•  Site	
  architecture	
  designed	
  to	
  inhibit	
  crawling	
  
   and	
  indexing	
  –	
  but	
  if	
  poorly	
  done	
  even	
  ‘polite’	
  
   harvesters	
  like	
  Heritrix	
  may	
  crash	
  their	
  server	
  
..	
  but	
  wait,	
  there’s	
  more	
  …	
  
•  Server-­‐side	
  scripts	
  and	
  remote	
  procedure	
  calls	
  
   –	
  the	
  full	
  variety	
  of	
  paths	
  through	
  a	
  site	
  are	
  
   now	
  onen	
  hidden	
  in	
  remote/opaque	
  server-­‐
   side	
  code	
  –	
  not	
  a	
  new	
  problem	
  but	
  now	
  
   effects	
  80+%	
  of	
  online	
  resources	
  
•  HTML	
  5	
  web	
  sockets	
  –	
  effec0vely	
  codifies	
  
   incremental	
  updates	
  without	
  page	
  reloads	
  
•  Mobile	
  publishing	
  
Transac0onal	
  Web	
  Archiving	
  
•  Useful	
  for	
  ins0tu0onal	
  archiving	
  	
  
    –  Best	
  for	
  record-­‐keeping	
  purposes	
  -­‐	
  when	
  
       challenged	
  in	
  court	
  about	
  content	
  on	
  web	
  site	
  
    –  Can	
  be	
  used	
  to	
  ensure	
  URL	
  persistence	
  eg	
  when	
  
       site	
  has	
  a	
  make-­‐over	
  –	
  can	
  intercept	
  404s	
  	
  	
  
    –  No	
  ‘gaps’	
  c.f.	
  crawl	
  approach	
  –	
  every	
  change	
  in	
  
       accessed	
  content	
  is	
  archived	
  
    –  However	
  requires	
  code	
  snippet	
  to	
  be	
  installed	
  on	
  
       web	
  server	
  
    –  Open	
  source	
  sonware	
  being	
  developed	
  by	
  Los	
  
       Alamos	
  Labs	
  
Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
  Data Analysis

  Need fast iteration to understand the right
  questions to ask
  More minds able to contribute = more value
  (perceived and real) placed on the importance
  of the data
  Increased demand for/value of the data = more
  funding to support it
  Need to surface the Information amongst all
  that data…
Platform & Toolkit: Overview

•  Software	

   –  Apache Hadoop	

   –  Apache Pig	

•  Data/File format	

   –  WARC	

   –  CDX	

   –  WAT (new!)
Apache Hadoop

•  HDFS	

   –  Distributed storage	

   –  Durable, default 3x replication	

   –  Scalable: Yahoo! 60+PB HDFS	

•  MapReduce	

   –  Distributed computation	

   –  You write Java functions	

   –  Hadoop distributes work across cluster	

   –  Tolerates failures
File formats and data: WARC
File formats and data: CDX
•  Index used to browse WARC-based archive	

•  Space-delimited text file	

•  Only essential the essential metadata needed
   by Wayback	

  –  URL	

  –  Content Digest	

  –  Capture Timestamp	

  –  Content-Type	

  –  HTTP response code	

  –  etc.
File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹	

•  Not preservation format	

•  Data exchange and analysis	

•  Less than full WARC, more than CDX	

•  Essential metadata for many types of analysis	

•  Avoids barriers to data exchange: copyright,
   privacy	

•  Work-in-progress: we want your feedback
File formats and data: WAT
•  WAT is WARC ☺	

  –  WAT records are WARC
     metadata records	

       File formats & data:	

  –  WARC-Refers-To header     •  CDX: 53 MB	

     identifies original WARC
     record	

                 •  WAT: 443 MB	

•  WAT payload is JSON	

      •  WARC: 8,651 MB	

  –  Compact	

  –  Hierarchical	

  –  Supported by every
     programming environ
Some	
  References	
  
•  h^p://en.wikipedia.org/wiki/Web_archiving	
  
•  h^p://netpreserve.org/about/archiveList.php	
  
•  Web	
  Archives:	
  The	
  Future(s)	
  -­‐	
  
   h^p://www.netpreserve.org/publica0ons/
   2011_06_IIPC_WebArchives-­‐TheFutures.pdf	
  
•  h^p://matkelly.com/warcreate/	
  
•  Common	
  Crawl:	
  h^p://commoncrawl.org/
   data/accessing-­‐the-­‐data/	
  
Contacts	
  
•  Webarchive	
  @	
  nla.gov.au	
  
•  Secretariat	
  @	
  internetmemory.org	
  
•  Queries	
  about	
  the	
  internet	
  archive	
  web	
  archive	
  
   h^p://iawebarchiving.wordpress.com/	
  
•  Queries	
  about	
  Archive-­‐It	
  service	
  
   h^p://www.archive-­‐it.org/contact-­‐us	
  

momodei	
  @	
  nla.gov.au	
  (un0l	
  31	
  Aug	
  2012	
  )	
  
or	
  
monica.omodei	
  @	
  gmail.com	
  
	
  

Slides anu talkwebarchivingaug2012

  • 1.
    Internet Content as Research Data Australian National University August 2012, Canberra Monica Omodei
  • 2.
    Research Examples •  Social networking •  Political Science •  Lexicography •  Media Studies •  Linguistics •  Contemporary history •  Network Science Data-driven science is migrating from the natural sciences to humanities and social science
  • 3.
    Talk  Structure   •  Exis0ng  web  archives   •  Web  archive  use  cases   •  Bringing  archives  together   •  Crea0ng  your  own  archive   •  It’s  ge>ng  harder  –  challenges   •  Web  data  mining  &  analysis        
  • 4.
    Exis0ng  web  archives     •  Internet  Archive   •  Common  Crawl     •  Pandora  Archive   •  Internet  Memory  Founda0on  Archive   •  Other  na0onal  archives   •  Research,  University  Library  archives    
  • 5.
    Common  Collec0on  Strategies   •  Crawl  Scope  &  Focus   1)  Thema0c/Topical  (elec0ons,  events,  global  warming…)   2)  Resource-­‐specific  (video,  pdf,  etc.)   3)  Broad  survey  (domain  wide  for  .com/.net/.org/.edu/.gov)   4)  Exhaus0ve  (end  of  life, closure crawls, natl domains)   5)  Frequency-­‐Based     •  Key  Inputs:  nomina0ons  from  subject  ma^er  experts,   prior  crawl  data,  registry  data,  trusted  directories,   wikipedia,  twi^er  
  • 6.
    Internet Archive’s WebArchive Positives –  Very broad – 175+ billion web instances –  Historic – started 1996 –  Publicly accessible –  Time-based URL search –  API access –  Not constrained by legislation – covered by fair use and fast take-down response
  • 7.
    Internet  Archive’s  Web  Archive   Negatives –  Because of size can’t search by keyword –  Because of size crawling is fully automated – ergo QA is not possible  
  • 11.
    Common  Crawl   • Non-­‐profit  founda0on  building  an  open  crawl   of  the  web  to  seed  research  and  innova0on   •  Currently  5  billion  pages   •  Stored  on  Amazon’s  S3     •  Accessible  via  MapReduce  processing  in   Amazon’s  EC2  compute  cloud   •  Wholesale  extrac0on,  transforma0on,  and   analysis  of  web  data  cheap  and  easy  
  • 12.
    Common  Crawl   Nega0ves   •  Not  designed  for  human  browsing  but  for   machine  access   •  Objec0ve  is  to  support  large-­‐scale  analysis  and   text  mining/indexing  –  not  long-­‐term   preserva0on   •  Some  costs  are  involved  for  direct  extrac0on   of  data  from  S3  storage  using  Requester-­‐Pays   API    
  • 13.
    Pandora  Archive   • Posi0ves   –  Quality  checked   –  Targeted  Australian  content  with  selec0on  policy   –  Historical  –  started  1996   –  Bibliocentric  approach  –web  sites/publica0ons   selected  for  archiving  are  catalogued  (see  Trove)   –  Keyword  search   –  Publicly  accessible   –  You  can  nominate  Australian  web  sites  for   inclusion  -­‐  pandora.nla.gov.au/ registra0on_form.html  
  • 15.
    Pandora  Archive   • Nega0ves   –  labour  intensive  thus  quite  small   –  significant  content  missed  because  permission  to   copy  refused   •  Situa0on  will  improve  markedly  if  Legal   Deposit  provisions  extended  to  digital   publica0ons   •  Broader  coverage  will  be  achieved  when   infrastructure  is  upgraded  hence  reducing   labour  costs  for  checking/fixing  crawls  
  • 16.
    Pandora  Archive  Stats   •  Size  –  6.32  TB   •  Number  of  Files    >  140  million   •  Number  of  ‘0tles’  >  30.5K   •  Number  of  0tle  instances  >  73.5K  
  • 21.
    Which archived sitesare popular ?   •  Measure: filtered, aggregated web access log data which counts access to titles " •  Examined top 30 archived titles (# of accesses) for each year 2009 to 2012" •  Selected some to examine and speculate as to why they might be popular" •  Selected those with consistently high ranking, and ones that were very variable between years  
  • 22.
    Reasons for popularityof archived version   •  Were once popular and are now decommissioned, particularly if domain name continues to exist and redirects to the archive" •  May not be that popular as live sites but their live site links prominently to Pandora as an archive for their content" •  Popular referencing sources cite the archive as well as the live site (if it still exists)  
  • 26.
    Improving visibility andusage of Pandora archive   •  Articles about interesting content on the Australia Web Archives blog –http:// blogs.nla.gov.au/australias-web-archives/" •  More effort to identify archived sites that are no longer ʻliveʼ" •  Market automatic redirect services to web site owners/managers" •  Allow Google to index archive content for ʻnon-liveʼ sites (problematic)" •  Install Twittervane - draws  site  nomina0ons   for  archiving  based  on  trending  Twi^er  topics.      "
  • 27.
    .au  Domain  Annual  Snapshots   •  Annual  crawls  since  2005  commissioned  from   Internet  Archive   •  Includes  sites  on  servers  located  in  Australia   as  well  as  .au  domain   •  Robots.txt  respected  except  for  inline  images   and  stylesheets   •  No  public  access  –  researcher  access  protocols   are  being  developed   •  Full  text  search  –  suited  to  searching  archives   •  Separate  .gov  crawl  publicly  accessible  soon  
  • 28.
    Australian  web  domain  crawls   Year   2005   2006   2007   2008   2009   2011   Files   185   596   516   1  billion   765   660   million   million   million   million   million   Hosts   811,523   1,046,038   1,247,614   3,038,658   1,074,645   1,346,549   crawled   Size  (TBs)   6.69   19.04   18.47   34.55   24.29   30.71  
  • 29.
    Internet  Memory  Founda0on   •  Number  of  European  partners     •  LiWA  –  Living  Web  Archives:  next  genera0on   Web  archiving  methods  and  tools     •  LAWA  –  Longitudinal  Analy0cs  of  Web  Archive   Data:  experimental  testbed  for  large-­‐scale   data  analy0cs   •  ARCOMEM  (Collect-­‐All  ARchives  to   COmmunity  MEMories)  leveraging  social   media  for  Intelligent  Preserva0on     •  SCAPE  –  Scalable  Preserva0on  Environments  
  • 31.
    Other  Na0onal  Archives   •  List  of  Interna0onal  Internet  Preserva0on   Consor0um  member  archives  –   netpreserve.org/about/archiveList.php   •  Some  are  whole  domain  archives,  some    are   selec0ve  archives,  many  are  both   •  Some  have  public  access,  others  you  will  need   to  nego0ate  access  for  research   •  Most  archives  have  been  collected  using  the   heritrix  open-­‐source  crawler  and  thus  use  the   standard  format  (warc  ISO  format)  
  • 32.
    Research  Archives   • California  Digital  Library   •  Harvard  University  Libraries   •  Columbia    University  Libraries   •  University  of  North  Texas   ….  and  many  more     •  WebCITE  -­‐  webcita0on.org  (cita0on  service   archive)  
  • 33.
    Example:  Columbia  University   •  Member  of  the  IIPC   •  They  use  the  ArchiveIt  service   •  A  Research  library  that  sees  web  archiving  as   fundamental  to  their  collec0ng     •  They  complement  and  coordinate  with  other  web   archives   •  Their  collec0ng  focus  is  thema0c  –  eg  human  rights,   historic  preserva0on,  NY  religious  ins0tu0ons   •  They  also  archive  web  content  as  part  of  personal   and  organisa0onal  archives  (c.f.  manuscripts  coll)   •  Archive  their  own  web  site  regularly  
  • 35.
    Bringing  Archives  Together   •  Common  standards  and  APIs   •  Memento  project  –  adding  0me  to  the  web   –  Aggregates  CDX  files  (URL  index)  from  mul0ple   archives   –  Has  a  Firefox  plug-­‐in  which  allows  0me-­‐based   browsing   –  Ini0a0ve  of  Los  Alamos  Laboratories   –  See  h^p://www.mementoweb.org/demo/    
  • 37.
    Common  Use  Cases  for  a  web   archive   •  Content  discovery   •  Nostalgia  queries   •  Web  site  restora0on  and  file  recovery   •  Domain  name  valua0on   •  Fall-­‐back  for  link-­‐rot   •  Prior  art  analysis  and  patent/copyright  infringement   research   •  Legal  cases   •  Topic  analysis,  web  trends  analysis,  popularity   analysis,  network  analysis,  linguis0c  analysis  
  • 38.
    Create  your  own  Archive   •  Use  a  subscrip0on  service   •  Build  your  own  web  archiving  infrastructure   with  open  source  sonware  (  ie  Heritrix  and   Wayback)   •  Use  web  cita0on  services  that  create  archive   copies  as  you  bookmark  pages  
  • 39.
    Subscrip0on  Services   • archive-­‐it.org  (service  operated  by  non-­‐profit   Internet  Archive  since  2006)   •  archivethe.net  (service  operated  by  non-­‐profit     Internet  Memory  Founda0on)   •  California  Digital  Library  Web  Archiving   Service  -­‐  cdlib.org/services/uc3/was.html   •  OCLC  Harvester  Service  -­‐  oclc.org/ webharvester/overview/default.htm  
  • 41.
    Install  web  archiving  system  locally   •  Easy-­‐to-­‐deploy  web  archiving  toolkit  not  yet   available     •  Ins0tu0onal  web  archiving  infrastructure  is   feasible  and  has  been  established  at  a  number   of  universi0es  for  use  by  researchers  –  needs   IT  systems  engineers  to  set  up  though   •  Archives  can  be  deposited  with  the  NLA  for   long-­‐term  preserva0on  
  • 42.
    Personal  Web  Archiving   •  WARCreate  –  recently  released  free  tool  which   creates  wayback-­‐consumable  warc  files  from  any   web  page   •  Google  Chrome  extension   •  Enables  preserva0on  by  users  from  their  desktop   •  Can  target  content  unreachable  by  crawlers   •  Brings  WARC  to  personal  digital  archiving   •  What  you  do  with  the  WARC  files  is  up  to  you   •  Install  suite  provided  to  set  up  local  Wayback   instance  and  Memento  0megate  
  • 43.
    Current  challenges   • Database-­‐driven  features  and  func0ons   •  Complex  and  varying  URI  formats  and  non-­‐ standard  link  implementa0ons  eg  Twi^er   •  Dynamically  generated  ever-­‐changing  URIs   –  For  serving  the  same  resources   •  Rich  Media  –  eg  streamed  media  with  custom   apps  and  ant-­‐collec0on  measures   •  Scripted  incremental  display  and  page-­‐loading  
  • 44.
    …  more…   • Scripted  HTML  forms   •  Mul0-­‐sourced  embedded  material   •  Dynamic  authen0ca0on  e.g.  captchas,  cross-­‐ site  authen0ca0on,  user-­‐sensi0ve  embeds   •  Alternate  display  based  on  browser  or  device,   or  other  parameter   •  Site  architecture  designed  to  inhibit  crawling   and  indexing  –  but  if  poorly  done  even  ‘polite’   harvesters  like  Heritrix  may  crash  their  server  
  • 45.
    ..  but  wait,  there’s  more  …   •  Server-­‐side  scripts  and  remote  procedure  calls   –  the  full  variety  of  paths  through  a  site  are   now  onen  hidden  in  remote/opaque  server-­‐ side  code  –  not  a  new  problem  but  now   effects  80+%  of  online  resources   •  HTML  5  web  sockets  –  effec0vely  codifies   incremental  updates  without  page  reloads   •  Mobile  publishing  
  • 46.
    Transac0onal  Web  Archiving   •  Useful  for  ins0tu0onal  archiving     –  Best  for  record-­‐keeping  purposes  -­‐  when   challenged  in  court  about  content  on  web  site   –  Can  be  used  to  ensure  URL  persistence  eg  when   site  has  a  make-­‐over  –  can  intercept  404s       –  No  ‘gaps’  c.f.  crawl  approach  –  every  change  in   accessed  content  is  archived   –  However  requires  code  snippet  to  be  installed  on   web  server   –  Open  source  sonware  being  developed  by  Los   Alamos  Labs  
  • 47.
    Web Data Mining& Analysis – What is it? Why Do It? Innovation is increasingly driven from Large scale Data Analysis Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…
  • 48.
    Platform & Toolkit:Overview •  Software –  Apache Hadoop –  Apache Pig •  Data/File format –  WARC –  CDX –  WAT (new!)
  • 49.
    Apache Hadoop •  HDFS –  Distributed storage –  Durable, default 3x replication –  Scalable: Yahoo! 60+PB HDFS •  MapReduce –  Distributed computation –  You write Java functions –  Hadoop distributes work across cluster –  Tolerates failures
  • 50.
    File formats anddata: WARC
  • 51.
    File formats anddata: CDX •  Index used to browse WARC-based archive •  Space-delimited text file •  Only essential the essential metadata needed by Wayback –  URL –  Content Digest –  Capture Timestamp –  Content-Type –  HTTP response code –  etc.
  • 52.
    File formats anddata: WAT •  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright, privacy •  Work-in-progress: we want your feedback
  • 53.
    File formats anddata: WAT •  WAT is WARC ☺ –  WAT records are WARC metadata records File formats & data: –  WARC-Refers-To header •  CDX: 53 MB identifies original WARC record •  WAT: 443 MB •  WAT payload is JSON •  WARC: 8,651 MB –  Compact –  Hierarchical –  Supported by every programming environ
  • 54.
    Some  References   • h^p://en.wikipedia.org/wiki/Web_archiving   •  h^p://netpreserve.org/about/archiveList.php   •  Web  Archives:  The  Future(s)  -­‐   h^p://www.netpreserve.org/publica0ons/ 2011_06_IIPC_WebArchives-­‐TheFutures.pdf   •  h^p://matkelly.com/warcreate/   •  Common  Crawl:  h^p://commoncrawl.org/ data/accessing-­‐the-­‐data/  
  • 55.
    Contacts   •  Webarchive  @  nla.gov.au   •  Secretariat  @  internetmemory.org   •  Queries  about  the  internet  archive  web  archive   h^p://iawebarchiving.wordpress.com/   •  Queries  about  Archive-­‐It  service   h^p://www.archive-­‐it.org/contact-­‐us   momodei  @  nla.gov.au  (un0l  31  Aug  2012  )   or   monica.omodei  @  gmail.com