Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Internet Content as    Research Data Digital Humanities Australia   March 2012, CanberraMonica Omodei & Gordon Mohr
Research Examples•    Social networking•    Lexicography•    Linguistics•    Network Science•    Political Science•    Med...
Common	  Collec)on	  Strategies	  •  Crawl	  Scope	  &	  Focus	      1)       Thema)c/Topical	  (elec)ons,	  events,	  glo...
Exis)ng	  web	  archives	  	  •    Internet	  Archive	  •    Common	  Crawl	  	  •    Pandora	  Archive	  •    Internet	  ...
Internet Archive’s Web ArchivePositives  –  Very broad – 175+ billion web instances  –  Historic – started 1996  –  Public...
Internet	  Archive’s	  Web	  Archive	  Negatives       –  Because of size can’t search by keyword       –  Because of size...
Common	  Use	  Cases	  for	  IA’s	  web	                   archive	  •  Content	  discovery	  •  Nostalgia	  queries	  •  ...
Common	  Crawl	  •  Non-­‐profit	  founda)on	  building	  an	  open	  crawl	     of	  the	  web	  to	  seed	  research	  an...
Common	  Crawl	  Nega)ves	  •  Not	  designed	  for	  human	  browsing	  but	  for	     machine	  access	  •  Objec)ve	  i...
Pandora	  Archive	  •  Posi)ves	     –  Quality	  checked	     –  Targeted	  Australian	  content	  with	  selec)on	  poli...
Pandora	  Archive	  •  Nega)ves	     –  labour	  intensive	  so	  small	     –  significant	  content	  missed	  because	  ...
Pandora	  Archive	  Stats	  •    Size	  –	  6.32	  TB	  •    Number	  of	  Files	  	  >	  140	  million	  •    Number	  of...
.au	  Domain	  Annual	  Snapshots	  •  Annual	  crawls	  since	  2005	  commissioned	  from	     Internet	  Archive	  •  I...
Australian	  web	  domain	  crawls	  Year	              2005	        2006	        2007	        2008	             2009	    ...
Internet	  Memory	  Founda)on	                     Archive	  •  internetmemory.org/en/	  •  no	  keyword	  search	  yet	  ...
Other	  Na)onal	  Archives	  •  List	  of	  Interna)onal	  Internet	  Preserva)on	     Consor)um	  member	  archives	  –	 ...
Research	  Archives	  •  California	  Digital	  Library	  •  Harvard	  University	  Libraries	  •  Columbia	  	  Universit...
Bringing	  Archives	  Together	  •  Common	  standard	  and	  APIs	  •  Memento	  project	  	  
Create	  your	  own	  Archive	  •  Use	  a	  subscrip)on	  service	  •  Build	  your	  own	  archive	  using	  open-­‐sour...
Subscrip)on	  Services	  •  archive-­‐it.org	  (service	  operated	  by	  non-­‐profit	     Internet	  Archive	  since	  20...
Install	  web	  archiving	  system	  locally	  •  Easy-­‐to-­‐deploy	  web	  archiving	  toolkit	  not	  yet	     availabl...
Memento:	  adding	  )me	  to	  the	                     web	  Protocol	  and	  browser	  add-­‐on	  (MementoFox)	  •  Aids...
Web Data Mining & Analysis –What is it? Why Do It?Innovation is increasingly driven from Large scale  Data Analysis  Need ...
Platform & Toolkit: Overview•  Software	   –  Apache Hadoop	   –  Apache Pig	•  Data/File format	   –  WARC	   –  CDX	   –...
Apache Hadoop•  HDFS	   –  Distributed storage	   –  Durable, default 3x replication	   –  Scalable: Yahoo! 60+PB HDFS	•  ...
File formats and data: WARC
File formats and data: CDX•  Index for Wayback Machine: used to browse   WARC-based archive	•  Space-delimited text file	• ...
File formats and data: WAT•  Yet Another Metadata Format! ☺ ☹	•  Not preservation format	•  Data exchange and analysis	•  ...
File formats and data: WAT•  WAT is WARC ☺	  –  WAT records are WARC     metadata records	       File formats & data:	  – ...
Some	  References	  •  hSp://en.wikipedia.org/wiki/Web_archiving	  •  hSp://netpreserve.org/about/archiveList.php	  •  Web...
Contacts	  •  Webarchive	  @	  nla.gov.au	  •  Secretariat	  @	  internetmemory.org	  •  Queries	  about	  the	  internet	...
Internet content as research data
Internet content as research data
Internet content as research data
Internet content as research data
Internet content as research data
Internet content as research data
Internet content as research data
Internet content as research data
Internet content as research data
Internet content as research data
Upcoming SlideShare
Loading in …5
×

Internet content as research data

762 views

Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Internet content as research data

  1. 1. Internet Content as Research Data Digital Humanities Australia March 2012, CanberraMonica Omodei & Gordon Mohr
  2. 2. Research Examples•  Social networking•  Lexicography•  Linguistics•  Network Science•  Political Science•  Media Studies•  Contemporary history
  3. 3. Common  Collec)on  Strategies  •  Crawl  Scope  &  Focus   1)  Thema)c/Topical  (elec)ons,  events,  global  warming…)   2)  Resource-­‐specific  (video,  pdf,  etc.)   3)  Broad  survey  (domain  wide  for  .com/.net/.org/.edu/.gov)   4)  Exhaus)ve  (end  of  life, closure crawls, natl domains)   5)  Frequency-­‐Based    •  Key  Inputs:  nomina)ons  from  subject  maSer  experts,   prior  crawl  data,  registry  data,  trusted  directories,   wikipedia  
  4. 4. Exis)ng  web  archives    •  Internet  Archive  •  Common  Crawl    •  Pandora  Archive  •  Internet  Memory  Founda)on  Archive  •  Other  na)onal  archives  •  Research,  University  Library  archives    
  5. 5. Internet Archive’s Web ArchivePositives –  Very broad – 175+ billion web instances –  Historic – started 1996 –  Publicly accessible –  Time-based URL search –  API access –  Not constrained by legislation – covered by fair use and fast take-down response
  6. 6. Internet  Archive’s  Web  Archive  Negatives –  Because of size can’t search by keyword –  Because of size, fully automated - QA not possible  
  7. 7. Common  Use  Cases  for  IA’s  web   archive  •  Content  discovery  •  Nostalgia  queries  •  Web  site  restora)on  and  file  recovery  •  Domain  name  valua)on  •  Collabora)ve  R&D  •  Prior  art  analysis  and  patent/copyright  infringement   research  •  Legal  cases  •  Topic  analysis,  web  trends  analysis,  popularity   analysis  
  8. 8. Common  Crawl  •  Non-­‐profit  founda)on  building  an  open  crawl   of  the  web  to  seed  research  and  innova)on  •  Currently  5  billion  pages  •  Stored  on  Amazon’s  S3    •  Accessible  via  MapReduce  processing  in   Amazon’s  EC2  compute  cloud  •  Wholesale  extrac)on,  transforma)on,  and   analysis  of  web  data  cheap  and  easy  •  commoncrawl.org/data/accessing-­‐the-­‐data/  
  9. 9. Common  Crawl  Nega)ves  •  Not  designed  for  human  browsing  but  for   machine  access  •  Objec)ve  is  to  support  large-­‐scale  analysis  and   text  mining/indexing  –  not  long-­‐term   preserva)on  •  Some  costs  are  involved  for  direct  extrac)on   of  data  from  S3  storage  using  Requester-­‐Pays   API    
  10. 10. Pandora  Archive  •  Posi)ves   –  Quality  checked   –  Targeted  Australian  content  with  selec)on  policy   –  Historical  –  started  1996   –  Bibliocentric  approach  –we  sites/publica)ons   selected  for  archiving  are  catalogued  (see  Trove)   –  Keyword  search   –  Publicly  accessible   –  You  can  nominate  Australian  web  sites  for   inclusion  -­‐  pandora.nla.gov.au/ registra)on_form.html  
  11. 11. Pandora  Archive  •  Nega)ves   –  labour  intensive  so  small   –  significant  content  missed  because  permission  to   copy  refused  •  Situa)on  will  improve  markedly  if  Legal   Deposit  provisions  extended  to  digital   publica)ons  •  Broader  coverage  will  be  achieved  when   infrastructure  is  upgraded  hence  reducing   labour  costs  for  checking/fixing  crawls  
  12. 12. Pandora  Archive  Stats  •  Size  –  6.32  TB  •  Number  of  Files    >  140  million  •  Number  of  ‘)tles’  >  30.5K  •  Number  of  )tle  instances  >  73.5K  
  13. 13. .au  Domain  Annual  Snapshots  •  Annual  crawls  since  2005  commissioned  from   Internet  Archive  •  Includes  sites  on  servers  located  in  Australia   as  well  as  .au  domain  •  Robots.txt  respected  except  for  inline  images   and  stylesheets  •  No  public  access  –  researcher  access  protocols   are  being  developed  •  Full  text  search  –  tailored  to  archive  search  •  Separate  .gov  crawl  publicly  accessible  soon  
  14. 14. Australian  web  domain  crawls  Year   2005   2006   2007   2008   2009   2011  Files   185   596   516   1  billion   765   660   million   million   million   million   million  Hosts   811,523   1,046,038   1,247,614   3,038,658   1,074,645   1,346,549  crawled  Size  (TBs)   6.69   19.04   18.47   34.55   24.29   30.71  
  15. 15. Internet  Memory  Founda)on   Archive  •  internetmemory.org/en/  •  no  keyword  search  yet  –  only  URL  •  Number  of  European  partners  
  16. 16. Other  Na)onal  Archives  •  List  of  Interna)onal  Internet  Preserva)on   Consor)um  member  archives  –   netpreserve.org/about/archiveList.php  •  Some  are  whole  domain  archives,  some    are   selec)ve  archives,  many  are  both  •  Some  have  public  access,  others  you  will  need   to  nego)ate  access  for  research  •  Most  archives  have  been  collected  using  the   heritrix  open-­‐source  crawler  and  thus  use  the   standard  format  (warc  ISO  format)  
  17. 17. Research  Archives  •  California  Digital  Library  •  Harvard  University  Libraries  •  Columbia    University  Libraries  •  University  of  North  Texas  ….  and  many  more    •  WebCITE  -­‐  webcita)on.org  (cita)on  service   archive)  
  18. 18. Bringing  Archives  Together  •  Common  standard  and  APIs  •  Memento  project    
  19. 19. Create  your  own  Archive  •  Use  a  subscrip)on  service  •  Build  your  own  archive  using  open-­‐source   crawler  heritrix  and  standard  file  format  .warc    •  Use  web  cita)on  services  that  create  archive   copies  as  you  bookmark  pages  
  20. 20. Subscrip)on  Services  •  archive-­‐it.org  (service  operated  by  non-­‐profit   Internet  Archive  since  2006)  •  archivethe.net  (service  operated  by  non-­‐profit     Internet  Memory  Founda)on)  •  California  Digital  Library  Web  Archiving   Service  -­‐  cdlib.org/services/uc3/was.html  •  OCLC  Harvester  Service  -­‐  oclc.org/ webharvester/overview/default.htm  
  21. 21. Install  web  archiving  system  locally  •  Easy-­‐to-­‐deploy  web  archiving  toolkit  not  yet   available  (that  meets  web  archive  standards)  •  Ins)tu)onal  web  archiving  infrastructure  is   feasible  and  has  been  established  at  a  number   of  universi)es  for  use  by  researchers  –  needs   IT  systems  engineers  to  set  up  though  •  Archives  can  be  deposited  with  the  NLA  for   long-­‐term  preserva)on  
  22. 22. Memento:  adding  )me  to  the   web  Protocol  and  browser  add-­‐on  (MementoFox)  •  Aids  discovery,  aggrega)on  of  page  histories    
  23. 23. Web Data Mining & Analysis –What is it? Why Do It?Innovation is increasingly driven from Large scale Data Analysis Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…
  24. 24. Platform & Toolkit: Overview•  Software –  Apache Hadoop –  Apache Pig •  Data/File format –  WARC –  CDX –  WAT (new!)
  25. 25. Apache Hadoop•  HDFS –  Distributed storage –  Durable, default 3x replication –  Scalable: Yahoo! 60+PB HDFS •  MapReduce –  Distributed computation –  You write Java functions –  Hadoop distributes work across cluster –  Tolerates failures
  26. 26. File formats and data: WARC
  27. 27. File formats and data: CDX•  Index for Wayback Machine: used to browse WARC-based archive •  Space-delimited text file •  Only essential metadata needed by Wayback –  URL –  Content Digest –  Capture Timestamp –  Content-Type –  HTTP response code –  etc.
  28. 28. File formats and data: WAT•  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright, privacy •  Work-in-progress: we want your feedback
  29. 29. File formats and data: WAT•  WAT is WARC ☺ –  WAT records are WARC metadata records File formats & data: –  WARC-Refers-To header •  CDX: 53 MB identifies original WARC record •  WAT: 443 MB •  WAT payload is JSON •  WARC: 8,651 MB –  Compact –  Hierarchical –  Supported by every programming environ
  30. 30. Some  References  •  hSp://en.wikipedia.org/wiki/Web_archiving  •  hSp://netpreserve.org/about/archiveList.php  •  Web  Archives:  The  Future(s)  -­‐   hSp://www.netpreserve.org/publica)ons/ 2011_06_IIPC_WebArchives-­‐TheFutures.pdf  
  31. 31. Contacts  •  Webarchive  @  nla.gov.au  •  Secretariat  @  internetmemory.org  •  Queries  about  the  internet  archive  web  archive   hSp://iawebarchiving.wordpress.com/  •  Queries  about  Archive-­‐It  service   hSp://www.archive-­‐it.org/contact-­‐us  •  momodei  @  nla.gov.au  •  gojomo  @  xavvy.com    

×