Slides anu talkwebarchivingaug2012


Published on

Published in: Technology, Design
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Slides anu talkwebarchivingaug2012

  1. 1. Internet Content as Research DataAustralian National University August 2012, Canberra Monica Omodei
  2. 2. Research Examples•  Social networking •  Political Science•  Lexicography •  Media Studies•  Linguistics •  Contemporary history•  Network ScienceData-driven science is migrating from thenatural sciences to humanities and socialscience
  3. 3. Talk  Structure  •  Exis0ng  web  archives  •  Web  archive  use  cases  •  Bringing  archives  together  •  Crea0ng  your  own  archive  •  It’s  ge>ng  harder  –  challenges  •  Web  data  mining  &  analysis        
  4. 4. Exis0ng  web  archives    •  Internet  Archive  •  Common  Crawl    •  Pandora  Archive  •  Internet  Memory  Founda0on  Archive  •  Other  na0onal  archives  •  Research,  University  Library  archives    
  5. 5. Common  Collec0on  Strategies  •  Crawl  Scope  &  Focus   1)  Thema0c/Topical  (elec0ons,  events,  global  warming…)   2)  Resource-­‐specific  (video,  pdf,  etc.)   3)  Broad  survey  (domain  wide  for  .com/.net/.org/.edu/.gov)   4)  Exhaus0ve  (end  of  life, closure crawls, natl domains)   5)  Frequency-­‐Based    •  Key  Inputs:  nomina0ons  from  subject  ma^er  experts,   prior  crawl  data,  registry  data,  trusted  directories,   wikipedia,  twi^er  
  6. 6. Internet Archive’s Web ArchivePositives –  Very broad – 175+ billion web instances –  Historic – started 1996 –  Publicly accessible –  Time-based URL search –  API access –  Not constrained by legislation – covered by fair use and fast take-down response
  7. 7. Internet  Archive’s  Web  Archive  Negatives –  Because of size can’t search by keyword –  Because of size crawling is fully automated – ergo QA is not possible  
  8. 8. Common  Crawl  •  Non-­‐profit  founda0on  building  an  open  crawl   of  the  web  to  seed  research  and  innova0on  •  Currently  5  billion  pages  •  Stored  on  Amazon’s  S3    •  Accessible  via  MapReduce  processing  in   Amazon’s  EC2  compute  cloud  •  Wholesale  extrac0on,  transforma0on,  and   analysis  of  web  data  cheap  and  easy  
  9. 9. Common  Crawl  Nega0ves  •  Not  designed  for  human  browsing  but  for   machine  access  •  Objec0ve  is  to  support  large-­‐scale  analysis  and   text  mining/indexing  –  not  long-­‐term   preserva0on  •  Some  costs  are  involved  for  direct  extrac0on   of  data  from  S3  storage  using  Requester-­‐Pays   API    
  10. 10. Pandora  Archive  •  Posi0ves   –  Quality  checked   –  Targeted  Australian  content  with  selec0on  policy   –  Historical  –  started  1996   –  Bibliocentric  approach  –web  sites/publica0ons   selected  for  archiving  are  catalogued  (see  Trove)   –  Keyword  search   –  Publicly  accessible   –  You  can  nominate  Australian  web  sites  for   inclusion  -­‐ registra0on_form.html  
  11. 11. Pandora  Archive  •  Nega0ves   –  labour  intensive  thus  quite  small   –  significant  content  missed  because  permission  to   copy  refused  •  Situa0on  will  improve  markedly  if  Legal   Deposit  provisions  extended  to  digital   publica0ons  •  Broader  coverage  will  be  achieved  when   infrastructure  is  upgraded  hence  reducing   labour  costs  for  checking/fixing  crawls  
  12. 12. Pandora  Archive  Stats  •  Size  –  6.32  TB  •  Number  of  Files    >  140  million  •  Number  of  ‘0tles’  >  30.5K  •  Number  of  0tle  instances  >  73.5K  
  13. 13. Which archived sites are popular ?   •  Measure: filtered, aggregated web access log data which counts access to titles " •  Examined top 30 archived titles (# of accesses) for each year 2009 to 2012" •  Selected some to examine and speculate as to why they might be popular" •  Selected those with consistently high ranking, and ones that were very variable between years  
  14. 14. Reasons for popularity of archived version  •  Were once popular and are now decommissioned, particularly if domain name continues to exist and redirects to the archive"•  May not be that popular as live sites but their live site links prominently to Pandora as an archive for their content"•  Popular referencing sources cite the archive as well as the live site (if it still exists)  
  15. 15. Improving visibility and usage of Pandora archive  •  Articles about interesting content on the Australia Web Archives blog –http://"•  More effort to identify archived sites that are no longer ʻliveʼ"•  Market automatic redirect services to web site owners/managers"•  Allow Google to index archive content for ʻnon-liveʼ sites (problematic)"•  Install Twittervane - draws  site  nomina0ons   for  archiving  based  on  trending  Twi^er  topics.      "
  16. 16. .au  Domain  Annual  Snapshots  •  Annual  crawls  since  2005  commissioned  from   Internet  Archive  •  Includes  sites  on  servers  located  in  Australia   as  well  as  .au  domain  •  Robots.txt  respected  except  for  inline  images   and  stylesheets  •  No  public  access  –  researcher  access  protocols   are  being  developed  •  Full  text  search  –  suited  to  searching  archives  •  Separate  .gov  crawl  publicly  accessible  soon  
  17. 17. Australian  web  domain  crawls  Year   2005   2006   2007   2008   2009   2011  Files   185   596   516   1  billion   765   660   million   million   million   million   million  Hosts   811,523   1,046,038   1,247,614   3,038,658   1,074,645   1,346,549  crawled  Size  (TBs)   6.69   19.04   18.47   34.55   24.29   30.71  
  18. 18. Internet  Memory  Founda0on  •  Number  of  European  partners    •  LiWA  –  Living  Web  Archives:  next  genera0on   Web  archiving  methods  and  tools    •  LAWA  –  Longitudinal  Analy0cs  of  Web  Archive   Data:  experimental  testbed  for  large-­‐scale   data  analy0cs  •  ARCOMEM  (Collect-­‐All  ARchives  to   COmmunity  MEMories)  leveraging  social   media  for  Intelligent  Preserva0on    •  SCAPE  –  Scalable  Preserva0on  Environments  
  19. 19. Other  Na0onal  Archives  •  List  of  Interna0onal  Internet  Preserva0on   Consor0um  member  archives  –  •  Some  are  whole  domain  archives,  some    are   selec0ve  archives,  many  are  both  •  Some  have  public  access,  others  you  will  need   to  nego0ate  access  for  research  •  Most  archives  have  been  collected  using  the   heritrix  open-­‐source  crawler  and  thus  use  the   standard  format  (warc  ISO  format)  
  20. 20. Research  Archives  •  California  Digital  Library  •  Harvard  University  Libraries  •  Columbia    University  Libraries  •  University  of  North  Texas  ….  and  many  more    •  WebCITE  -­‐  (cita0on  service   archive)  
  21. 21. Example:  Columbia  University  •  Member  of  the  IIPC  •  They  use  the  ArchiveIt  service  •  A  Research  library  that  sees  web  archiving  as   fundamental  to  their  collec0ng    •  They  complement  and  coordinate  with  other  web   archives  •  Their  collec0ng  focus  is  thema0c  –  eg  human  rights,   historic  preserva0on,  NY  religious  ins0tu0ons  •  They  also  archive  web  content  as  part  of  personal   and  organisa0onal  archives  (c.f.  manuscripts  coll)  •  Archive  their  own  web  site  regularly  
  22. 22. Bringing  Archives  Together  •  Common  standards  and  APIs  •  Memento  project  –  adding  0me  to  the  web   –  Aggregates  CDX  files  (URL  index)  from  mul0ple   archives   –  Has  a  Firefox  plug-­‐in  which  allows  0me-­‐based   browsing   –  Ini0a0ve  of  Los  Alamos  Laboratories   –  See  h^p://    
  23. 23. Common  Use  Cases  for  a  web   archive  •  Content  discovery  •  Nostalgia  queries  •  Web  site  restora0on  and  file  recovery  •  Domain  name  valua0on  •  Fall-­‐back  for  link-­‐rot  •  Prior  art  analysis  and  patent/copyright  infringement   research  •  Legal  cases  •  Topic  analysis,  web  trends  analysis,  popularity   analysis,  network  analysis,  linguis0c  analysis  
  24. 24. Create  your  own  Archive  •  Use  a  subscrip0on  service  •  Build  your  own  web  archiving  infrastructure   with  open  source  sonware  (  ie  Heritrix  and   Wayback)  •  Use  web  cita0on  services  that  create  archive   copies  as  you  bookmark  pages  
  25. 25. Subscrip0on  Services  •  archive-­‐  (service  operated  by  non-­‐profit   Internet  Archive  since  2006)  •  (service  operated  by  non-­‐profit     Internet  Memory  Founda0on)  •  California  Digital  Library  Web  Archiving   Service  -­‐  •  OCLC  Harvester  Service  -­‐ webharvester/overview/default.htm  
  26. 26. Install  web  archiving  system  locally  •  Easy-­‐to-­‐deploy  web  archiving  toolkit  not  yet   available    •  Ins0tu0onal  web  archiving  infrastructure  is   feasible  and  has  been  established  at  a  number   of  universi0es  for  use  by  researchers  –  needs   IT  systems  engineers  to  set  up  though  •  Archives  can  be  deposited  with  the  NLA  for   long-­‐term  preserva0on  
  27. 27. Personal  Web  Archiving  •  WARCreate  –  recently  released  free  tool  which   creates  wayback-­‐consumable  warc  files  from  any   web  page  •  Google  Chrome  extension  •  Enables  preserva0on  by  users  from  their  desktop  •  Can  target  content  unreachable  by  crawlers  •  Brings  WARC  to  personal  digital  archiving  •  What  you  do  with  the  WARC  files  is  up  to  you  •  Install  suite  provided  to  set  up  local  Wayback   instance  and  Memento  0megate  
  28. 28. Current  challenges  •  Database-­‐driven  features  and  func0ons  •  Complex  and  varying  URI  formats  and  non-­‐ standard  link  implementa0ons  eg  Twi^er  •  Dynamically  generated  ever-­‐changing  URIs   –  For  serving  the  same  resources  •  Rich  Media  –  eg  streamed  media  with  custom   apps  and  ant-­‐collec0on  measures  •  Scripted  incremental  display  and  page-­‐loading  
  29. 29. …  more…  •  Scripted  HTML  forms  •  Mul0-­‐sourced  embedded  material  •  Dynamic  authen0ca0on  e.g.  captchas,  cross-­‐ site  authen0ca0on,  user-­‐sensi0ve  embeds  •  Alternate  display  based  on  browser  or  device,   or  other  parameter  •  Site  architecture  designed  to  inhibit  crawling   and  indexing  –  but  if  poorly  done  even  ‘polite’   harvesters  like  Heritrix  may  crash  their  server  
  30. 30. ..  but  wait,  there’s  more  …  •  Server-­‐side  scripts  and  remote  procedure  calls   –  the  full  variety  of  paths  through  a  site  are   now  onen  hidden  in  remote/opaque  server-­‐ side  code  –  not  a  new  problem  but  now   effects  80+%  of  online  resources  •  HTML  5  web  sockets  –  effec0vely  codifies   incremental  updates  without  page  reloads  •  Mobile  publishing  
  31. 31. Transac0onal  Web  Archiving  •  Useful  for  ins0tu0onal  archiving     –  Best  for  record-­‐keeping  purposes  -­‐  when   challenged  in  court  about  content  on  web  site   –  Can  be  used  to  ensure  URL  persistence  eg  when   site  has  a  make-­‐over  –  can  intercept  404s       –  No  ‘gaps’  c.f.  crawl  approach  –  every  change  in   accessed  content  is  archived   –  However  requires  code  snippet  to  be  installed  on   web  server   –  Open  source  sonware  being  developed  by  Los   Alamos  Labs  
  32. 32. Web Data Mining & Analysis –What is it? Why Do It?Innovation is increasingly driven from Large scale Data Analysis Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…
  33. 33. Platform & Toolkit: Overview•  Software –  Apache Hadoop –  Apache Pig •  Data/File format –  WARC –  CDX –  WAT (new!)
  34. 34. Apache Hadoop•  HDFS –  Distributed storage –  Durable, default 3x replication –  Scalable: Yahoo! 60+PB HDFS •  MapReduce –  Distributed computation –  You write Java functions –  Hadoop distributes work across cluster –  Tolerates failures
  35. 35. File formats and data: WARC
  36. 36. File formats and data: CDX•  Index used to browse WARC-based archive •  Space-delimited text file •  Only essential the essential metadata needed by Wayback –  URL –  Content Digest –  Capture Timestamp –  Content-Type –  HTTP response code –  etc.
  37. 37. File formats and data: WAT•  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright, privacy •  Work-in-progress: we want your feedback
  38. 38. File formats and data: WAT•  WAT is WARC ☺ –  WAT records are WARC metadata records File formats & data: –  WARC-Refers-To header •  CDX: 53 MB identifies original WARC record •  WAT: 443 MB •  WAT payload is JSON •  WARC: 8,651 MB –  Compact –  Hierarchical –  Supported by every programming environ
  39. 39. Some  References  •  h^p://  •  h^p://  •  Web  Archives:  The  Future(s)  -­‐   h^p:// 2011_06_IIPC_WebArchives-­‐TheFutures.pdf  •  h^p://  •  Common  Crawl:  h^p:// data/accessing-­‐the-­‐data/  
  40. 40. Contacts  •  Webarchive  @  •  Secretariat  @  •  Queries  about  the  internet  archive  web  archive   h^p://  •  Queries  about  Archive-­‐It  service   h^p://www.archive-­‐­‐us  momodei  @  (un0l  31  Aug  2012  )  or  monica.omodei  @