Internet Content as
    Research Data
 Digital Humanities Australia
   March 2012, Canberra
Monica Omodei & Gordon Mohr
Research Examples
•    Social networking
•    Lexicography
•    Linguistics
•    Network Science
•    Political Science
•    Media Studies
•    Contemporary history
Common	
  Collec)on	
  Strategies	
  
•  Crawl	
  Scope	
  &	
  Focus	
  
    1)       Thema)c/Topical	
  (elec)ons,	
  events,	
  global	
  warming…)	
  
    2)       Resource-­‐specific	
  (video,	
  pdf,	
  etc.)	
  
    3)       Broad	
  survey	
  (domain	
  wide	
  for	
  .com/.net/.org/.edu/.gov)	
  
    4)       Exhaus)ve	
  (end	
  of	
  life, closure crawls, natl domains)	
  
    5)       Frequency-­‐Based	
  
    	
  
•  Key	
  Inputs:	
  nomina)ons	
  from	
  subject	
  maSer	
  experts,	
  
   prior	
  crawl	
  data,	
  registry	
  data,	
  trusted	
  directories,	
  
   wikipedia	
  
Exis)ng	
  web	
  archives	
  	
  
•    Internet	
  Archive	
  
•    Common	
  Crawl	
  	
  
•    Pandora	
  Archive	
  
•    Internet	
  Memory	
  Founda)on	
  Archive	
  
•    Other	
  na)onal	
  archives	
  
•    Research,	
  University	
  Library	
  archives	
  	
  
Internet Archive’s Web Archive

Positives
  –  Very broad – 175+ billion web instances
  –  Historic – started 1996
  –  Publicly accessible
  –  Time-based URL search
  –  API access
  –  Not constrained by legislation – covered by
     fair use and fast take-down response
Internet	
  Archive’s	
  Web	
  Archive	
  
Negatives
       –  Because of size can’t search by keyword
       –  Because of size, fully automated - QA not
          possible
	
  
Common	
  Use	
  Cases	
  for	
  IA’s	
  web	
  
                 archive	
  
•  Content	
  discovery	
  
•  Nostalgia	
  queries	
  
•  Web	
  site	
  restora)on	
  and	
  file	
  recovery	
  
•  Domain	
  name	
  valua)on	
  
•  Collabora)ve	
  R&D	
  
•  Prior	
  art	
  analysis	
  and	
  patent/copyright	
  infringement	
  
   research	
  
•  Legal	
  cases	
  
•  Topic	
  analysis,	
  web	
  trends	
  analysis,	
  popularity	
  
   analysis	
  
Common	
  Crawl	
  
•  Non-­‐profit	
  founda)on	
  building	
  an	
  open	
  crawl	
  
   of	
  the	
  web	
  to	
  seed	
  research	
  and	
  innova)on	
  
•  Currently	
  5	
  billion	
  pages	
  
•  Stored	
  on	
  Amazon’s	
  S3	
  	
  
•  Accessible	
  via	
  MapReduce	
  processing	
  in	
  
   Amazon’s	
  EC2	
  compute	
  cloud	
  
•  Wholesale	
  extrac)on,	
  transforma)on,	
  and	
  
   analysis	
  of	
  web	
  data	
  cheap	
  and	
  easy	
  
•  commoncrawl.org/data/accessing-­‐the-­‐data/	
  
Common	
  Crawl	
  
Nega)ves	
  
•  Not	
  designed	
  for	
  human	
  browsing	
  but	
  for	
  
   machine	
  access	
  
•  Objec)ve	
  is	
  to	
  support	
  large-­‐scale	
  analysis	
  and	
  
   text	
  mining/indexing	
  –	
  not	
  long-­‐term	
  
   preserva)on	
  
•  Some	
  costs	
  are	
  involved	
  for	
  direct	
  extrac)on	
  
   of	
  data	
  from	
  S3	
  storage	
  using	
  Requester-­‐Pays	
  
   API	
  	
  
Pandora	
  Archive	
  
•  Posi)ves	
  
   –  Quality	
  checked	
  
   –  Targeted	
  Australian	
  content	
  with	
  selec)on	
  policy	
  
   –  Historical	
  –	
  started	
  1996	
  
   –  Bibliocentric	
  approach	
  –we	
  sites/publica)ons	
  
      selected	
  for	
  archiving	
  are	
  catalogued	
  (see	
  Trove)	
  
   –  Keyword	
  search	
  
   –  Publicly	
  accessible	
  
   –  You	
  can	
  nominate	
  Australian	
  web	
  sites	
  for	
  
      inclusion	
  -­‐	
  pandora.nla.gov.au/
      registra)on_form.html	
  
Pandora	
  Archive	
  
•  Nega)ves	
  
   –  labour	
  intensive	
  so	
  small	
  
   –  significant	
  content	
  missed	
  because	
  permission	
  to	
  
      copy	
  refused	
  
•  Situa)on	
  will	
  improve	
  markedly	
  if	
  Legal	
  
   Deposit	
  provisions	
  extended	
  to	
  digital	
  
   publica)ons	
  
•  Broader	
  coverage	
  will	
  be	
  achieved	
  when	
  
   infrastructure	
  is	
  upgraded	
  hence	
  reducing	
  
   labour	
  costs	
  for	
  checking/fixing	
  crawls	
  
Pandora	
  Archive	
  Stats	
  
•    Size	
  –	
  6.32	
  TB	
  
•    Number	
  of	
  Files	
  	
  >	
  140	
  million	
  
•    Number	
  of	
  ‘)tles’	
  >	
  30.5K	
  
•    Number	
  of	
  )tle	
  instances	
  >	
  73.5K	
  
.au	
  Domain	
  Annual	
  Snapshots	
  
•  Annual	
  crawls	
  since	
  2005	
  commissioned	
  from	
  
   Internet	
  Archive	
  
•  Includes	
  sites	
  on	
  servers	
  located	
  in	
  Australia	
  
   as	
  well	
  as	
  .au	
  domain	
  
•  Robots.txt	
  respected	
  except	
  for	
  inline	
  images	
  
   and	
  stylesheets	
  
•  No	
  public	
  access	
  –	
  researcher	
  access	
  protocols	
  
   are	
  being	
  developed	
  
•  Full	
  text	
  search	
  –	
  tailored	
  to	
  archive	
  search	
  
•  Separate	
  .gov	
  crawl	
  publicly	
  accessible	
  soon	
  
Australian	
  web	
  domain	
  crawls	
  

Year	
              2005	
        2006	
        2007	
        2008	
             2009	
        2011	
  
Files	
             185	
         596	
         516	
         1	
  billion	
     765	
         660	
  
                    million	
     million	
     million	
                        million	
     million	
  
Hosts	
             811,523	
     1,046,038	
   1,247,614	
   3,038,658	
   1,074,645	
   1,346,549	
  
crawled	
  
Size	
  (TBs)	
     6.69	
        19.04	
       18.47	
       34.55	
            24.29	
       30.71	
  
Internet	
  Memory	
  Founda)on	
  
                   Archive	
  
•  internetmemory.org/en/	
  
•  no	
  keyword	
  search	
  yet	
  –	
  only	
  URL	
  
•  Number	
  of	
  European	
  partners	
  
Other	
  Na)onal	
  Archives	
  
•  List	
  of	
  Interna)onal	
  Internet	
  Preserva)on	
  
   Consor)um	
  member	
  archives	
  –	
  
   netpreserve.org/about/archiveList.php	
  
•  Some	
  are	
  whole	
  domain	
  archives,	
  some	
  	
  are	
  
   selec)ve	
  archives,	
  many	
  are	
  both	
  
•  Some	
  have	
  public	
  access,	
  others	
  you	
  will	
  need	
  
   to	
  nego)ate	
  access	
  for	
  research	
  
•  Most	
  archives	
  have	
  been	
  collected	
  using	
  the	
  
   heritrix	
  open-­‐source	
  crawler	
  and	
  thus	
  use	
  the	
  
   standard	
  format	
  (warc	
  ISO	
  format)	
  
Research	
  Archives	
  
•  California	
  Digital	
  Library	
  
•  Harvard	
  University	
  Libraries	
  
•  Columbia	
  	
  University	
  Libraries	
  
•  University	
  of	
  North	
  Texas	
  
….	
  and	
  many	
  more	
  
	
  
•  WebCITE	
  -­‐	
  webcita)on.org	
  (cita)on	
  service	
  
     archive)	
  
Bringing	
  Archives	
  Together	
  
•  Common	
  standard	
  and	
  APIs	
  
•  Memento	
  project	
  
	
  
Create	
  your	
  own	
  Archive	
  
•  Use	
  a	
  subscrip)on	
  service	
  
•  Build	
  your	
  own	
  archive	
  using	
  open-­‐source	
  
   crawler	
  heritrix	
  and	
  standard	
  file	
  format	
  .warc	
  	
  
•  Use	
  web	
  cita)on	
  services	
  that	
  create	
  archive	
  
   copies	
  as	
  you	
  bookmark	
  pages	
  
Subscrip)on	
  Services	
  
•  archive-­‐it.org	
  (service	
  operated	
  by	
  non-­‐profit	
  
   Internet	
  Archive	
  since	
  2006)	
  
•  archivethe.net	
  (service	
  operated	
  by	
  non-­‐profit	
  	
  
   Internet	
  Memory	
  Founda)on)	
  
•  California	
  Digital	
  Library	
  Web	
  Archiving	
  
   Service	
  -­‐	
  cdlib.org/services/uc3/was.html	
  
•  OCLC	
  Harvester	
  Service	
  -­‐	
  oclc.org/
   webharvester/overview/default.htm	
  
Install	
  web	
  archiving	
  system	
  locally	
  
•  Easy-­‐to-­‐deploy	
  web	
  archiving	
  toolkit	
  not	
  yet	
  
   available	
  (that	
  meets	
  web	
  archive	
  standards)	
  
•  Ins)tu)onal	
  web	
  archiving	
  infrastructure	
  is	
  
   feasible	
  and	
  has	
  been	
  established	
  at	
  a	
  number	
  
   of	
  universi)es	
  for	
  use	
  by	
  researchers	
  –	
  needs	
  
   IT	
  systems	
  engineers	
  to	
  set	
  up	
  though	
  
•  Archives	
  can	
  be	
  deposited	
  with	
  the	
  NLA	
  for	
  
   long-­‐term	
  preserva)on	
  
'Memento':	
  adding	
  )me	
  to	
  the	
  
                   web	
  
Protocol	
  and	
  browser	
  add-­‐on	
  (MementoFox)	
  
•  Aids	
  discovery,	
  aggrega)on	
  of	
  page	
  histories	
  


	
  
Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
  Data Analysis

  Need fast iteration to understand the right
  questions to ask
  More minds able to contribute = more value
  (perceived and real) placed on the importance
  of the data
  Increased demand for/value of the data = more
  funding to support it
  Need to surface the Information amongst all
  that data…
Platform & Toolkit: Overview

•  Software	

   –  Apache Hadoop	

   –  Apache Pig	

•  Data/File format	

   –  WARC	

   –  CDX	

   –  WAT (new!)
Apache Hadoop

•  HDFS	

   –  Distributed storage	

   –  Durable, default 3x replication	

   –  Scalable: Yahoo! 60+PB HDFS	

•  MapReduce	

   –  Distributed computation	

   –  You write Java functions	

   –  Hadoop distributes work across cluster	

   –  Tolerates failures
File formats and data: WARC
File formats and data: CDX
•  Index for Wayback Machine: used to browse
   WARC-based archive	

•  Space-delimited text file	

•  Only essential metadata needed by Wayback	

  –  URL	

  –  Content Digest	

  –  Capture Timestamp	

  –  Content-Type	

  –  HTTP response code	

  –  etc.
File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹	

•  Not preservation format	

•  Data exchange and analysis	

•  Less than full WARC, more than CDX	

•  Essential metadata for many types of analysis	

•  Avoids barriers to data exchange: copyright,
   privacy	

•  Work-in-progress: we want your feedback
File formats and data: WAT
•  WAT is WARC ☺	

  –  WAT records are WARC
     metadata records	

       File formats & data:	

  –  WARC-Refers-To header     •  CDX: 53 MB	

     identifies original WARC
     record	

                 •  WAT: 443 MB	

•  WAT payload is JSON	

      •  WARC: 8,651 MB	

  –  Compact	

  –  Hierarchical	

  –  Supported by every
     programming environ
Some	
  References	
  
•  hSp://en.wikipedia.org/wiki/Web_archiving	
  
•  hSp://netpreserve.org/about/archiveList.php	
  
•  Web	
  Archives:	
  The	
  Future(s)	
  -­‐	
  
   hSp://www.netpreserve.org/publica)ons/
   2011_06_IIPC_WebArchives-­‐TheFutures.pdf	
  
Contacts	
  
•  Webarchive	
  @	
  nla.gov.au	
  
•  Secretariat	
  @	
  internetmemory.org	
  
•  Queries	
  about	
  the	
  internet	
  archive	
  web	
  archive	
  
   hSp://iawebarchiving.wordpress.com/	
  
•  Queries	
  about	
  Archive-­‐It	
  service	
  
   hSp://www.archive-­‐it.org/contact-­‐us	
  

•  momodei	
  @	
  nla.gov.au	
  
•  gojomo	
  @	
  xavvy.com	
  
	
  

Internet content as research data

  • 1.
    Internet Content as Research Data Digital Humanities Australia March 2012, Canberra Monica Omodei & Gordon Mohr
  • 2.
    Research Examples •  Social networking •  Lexicography •  Linguistics •  Network Science •  Political Science •  Media Studies •  Contemporary history
  • 3.
    Common  Collec)on  Strategies   •  Crawl  Scope  &  Focus   1)  Thema)c/Topical  (elec)ons,  events,  global  warming…)   2)  Resource-­‐specific  (video,  pdf,  etc.)   3)  Broad  survey  (domain  wide  for  .com/.net/.org/.edu/.gov)   4)  Exhaus)ve  (end  of  life, closure crawls, natl domains)   5)  Frequency-­‐Based     •  Key  Inputs:  nomina)ons  from  subject  maSer  experts,   prior  crawl  data,  registry  data,  trusted  directories,   wikipedia  
  • 4.
    Exis)ng  web  archives     •  Internet  Archive   •  Common  Crawl     •  Pandora  Archive   •  Internet  Memory  Founda)on  Archive   •  Other  na)onal  archives   •  Research,  University  Library  archives    
  • 5.
    Internet Archive’s WebArchive Positives –  Very broad – 175+ billion web instances –  Historic – started 1996 –  Publicly accessible –  Time-based URL search –  API access –  Not constrained by legislation – covered by fair use and fast take-down response
  • 6.
    Internet  Archive’s  Web  Archive   Negatives –  Because of size can’t search by keyword –  Because of size, fully automated - QA not possible  
  • 7.
    Common  Use  Cases  for  IA’s  web   archive   •  Content  discovery   •  Nostalgia  queries   •  Web  site  restora)on  and  file  recovery   •  Domain  name  valua)on   •  Collabora)ve  R&D   •  Prior  art  analysis  and  patent/copyright  infringement   research   •  Legal  cases   •  Topic  analysis,  web  trends  analysis,  popularity   analysis  
  • 11.
    Common  Crawl   • Non-­‐profit  founda)on  building  an  open  crawl   of  the  web  to  seed  research  and  innova)on   •  Currently  5  billion  pages   •  Stored  on  Amazon’s  S3     •  Accessible  via  MapReduce  processing  in   Amazon’s  EC2  compute  cloud   •  Wholesale  extrac)on,  transforma)on,  and   analysis  of  web  data  cheap  and  easy   •  commoncrawl.org/data/accessing-­‐the-­‐data/  
  • 12.
    Common  Crawl   Nega)ves   •  Not  designed  for  human  browsing  but  for   machine  access   •  Objec)ve  is  to  support  large-­‐scale  analysis  and   text  mining/indexing  –  not  long-­‐term   preserva)on   •  Some  costs  are  involved  for  direct  extrac)on   of  data  from  S3  storage  using  Requester-­‐Pays   API    
  • 13.
    Pandora  Archive   • Posi)ves   –  Quality  checked   –  Targeted  Australian  content  with  selec)on  policy   –  Historical  –  started  1996   –  Bibliocentric  approach  –we  sites/publica)ons   selected  for  archiving  are  catalogued  (see  Trove)   –  Keyword  search   –  Publicly  accessible   –  You  can  nominate  Australian  web  sites  for   inclusion  -­‐  pandora.nla.gov.au/ registra)on_form.html  
  • 15.
    Pandora  Archive   • Nega)ves   –  labour  intensive  so  small   –  significant  content  missed  because  permission  to   copy  refused   •  Situa)on  will  improve  markedly  if  Legal   Deposit  provisions  extended  to  digital   publica)ons   •  Broader  coverage  will  be  achieved  when   infrastructure  is  upgraded  hence  reducing   labour  costs  for  checking/fixing  crawls  
  • 16.
    Pandora  Archive  Stats   •  Size  –  6.32  TB   •  Number  of  Files    >  140  million   •  Number  of  ‘)tles’  >  30.5K   •  Number  of  )tle  instances  >  73.5K  
  • 21.
    .au  Domain  Annual  Snapshots   •  Annual  crawls  since  2005  commissioned  from   Internet  Archive   •  Includes  sites  on  servers  located  in  Australia   as  well  as  .au  domain   •  Robots.txt  respected  except  for  inline  images   and  stylesheets   •  No  public  access  –  researcher  access  protocols   are  being  developed   •  Full  text  search  –  tailored  to  archive  search   •  Separate  .gov  crawl  publicly  accessible  soon  
  • 22.
    Australian  web  domain  crawls   Year   2005   2006   2007   2008   2009   2011   Files   185   596   516   1  billion   765   660   million   million   million   million   million   Hosts   811,523   1,046,038   1,247,614   3,038,658   1,074,645   1,346,549   crawled   Size  (TBs)   6.69   19.04   18.47   34.55   24.29   30.71  
  • 23.
    Internet  Memory  Founda)on   Archive   •  internetmemory.org/en/   •  no  keyword  search  yet  –  only  URL   •  Number  of  European  partners  
  • 25.
    Other  Na)onal  Archives   •  List  of  Interna)onal  Internet  Preserva)on   Consor)um  member  archives  –   netpreserve.org/about/archiveList.php   •  Some  are  whole  domain  archives,  some    are   selec)ve  archives,  many  are  both   •  Some  have  public  access,  others  you  will  need   to  nego)ate  access  for  research   •  Most  archives  have  been  collected  using  the   heritrix  open-­‐source  crawler  and  thus  use  the   standard  format  (warc  ISO  format)  
  • 26.
    Research  Archives   • California  Digital  Library   •  Harvard  University  Libraries   •  Columbia    University  Libraries   •  University  of  North  Texas   ….  and  many  more     •  WebCITE  -­‐  webcita)on.org  (cita)on  service   archive)  
  • 27.
    Bringing  Archives  Together   •  Common  standard  and  APIs   •  Memento  project    
  • 28.
    Create  your  own  Archive   •  Use  a  subscrip)on  service   •  Build  your  own  archive  using  open-­‐source   crawler  heritrix  and  standard  file  format  .warc     •  Use  web  cita)on  services  that  create  archive   copies  as  you  bookmark  pages  
  • 29.
    Subscrip)on  Services   • archive-­‐it.org  (service  operated  by  non-­‐profit   Internet  Archive  since  2006)   •  archivethe.net  (service  operated  by  non-­‐profit     Internet  Memory  Founda)on)   •  California  Digital  Library  Web  Archiving   Service  -­‐  cdlib.org/services/uc3/was.html   •  OCLC  Harvester  Service  -­‐  oclc.org/ webharvester/overview/default.htm  
  • 31.
    Install  web  archiving  system  locally   •  Easy-­‐to-­‐deploy  web  archiving  toolkit  not  yet   available  (that  meets  web  archive  standards)   •  Ins)tu)onal  web  archiving  infrastructure  is   feasible  and  has  been  established  at  a  number   of  universi)es  for  use  by  researchers  –  needs   IT  systems  engineers  to  set  up  though   •  Archives  can  be  deposited  with  the  NLA  for   long-­‐term  preserva)on  
  • 32.
    'Memento':  adding  )me  to  the   web   Protocol  and  browser  add-­‐on  (MementoFox)   •  Aids  discovery,  aggrega)on  of  page  histories    
  • 33.
    Web Data Mining& Analysis – What is it? Why Do It? Innovation is increasingly driven from Large scale Data Analysis Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…
  • 34.
    Platform & Toolkit:Overview •  Software –  Apache Hadoop –  Apache Pig •  Data/File format –  WARC –  CDX –  WAT (new!)
  • 35.
    Apache Hadoop •  HDFS –  Distributed storage –  Durable, default 3x replication –  Scalable: Yahoo! 60+PB HDFS •  MapReduce –  Distributed computation –  You write Java functions –  Hadoop distributes work across cluster –  Tolerates failures
  • 36.
    File formats anddata: WARC
  • 37.
    File formats anddata: CDX •  Index for Wayback Machine: used to browse WARC-based archive •  Space-delimited text file •  Only essential metadata needed by Wayback –  URL –  Content Digest –  Capture Timestamp –  Content-Type –  HTTP response code –  etc.
  • 38.
    File formats anddata: WAT •  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright, privacy •  Work-in-progress: we want your feedback
  • 39.
    File formats anddata: WAT •  WAT is WARC ☺ –  WAT records are WARC metadata records File formats & data: –  WARC-Refers-To header •  CDX: 53 MB identifies original WARC record •  WAT: 443 MB •  WAT payload is JSON •  WARC: 8,651 MB –  Compact –  Hierarchical –  Supported by every programming environ
  • 40.
    Some  References   • hSp://en.wikipedia.org/wiki/Web_archiving   •  hSp://netpreserve.org/about/archiveList.php   •  Web  Archives:  The  Future(s)  -­‐   hSp://www.netpreserve.org/publica)ons/ 2011_06_IIPC_WebArchives-­‐TheFutures.pdf  
  • 41.
    Contacts   •  Webarchive  @  nla.gov.au   •  Secretariat  @  internetmemory.org   •  Queries  about  the  internet  archive  web  archive   hSp://iawebarchiving.wordpress.com/   •  Queries  about  Archive-­‐It  service   hSp://www.archive-­‐it.org/contact-­‐us   •  momodei  @  nla.gov.au   •  gojomo  @  xavvy.com