Your SlideShare is downloading. ×
Internet content as research data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Internet content as research data

454
views

Published on

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
454
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Internet Content as Research Data Digital Humanities Australia March 2012, CanberraMonica Omodei & Gordon Mohr
  • 2. Research Examples•  Social networking•  Lexicography•  Linguistics•  Network Science•  Political Science•  Media Studies•  Contemporary history
  • 3. Common  Collec)on  Strategies  •  Crawl  Scope  &  Focus   1)  Thema)c/Topical  (elec)ons,  events,  global  warming…)   2)  Resource-­‐specific  (video,  pdf,  etc.)   3)  Broad  survey  (domain  wide  for  .com/.net/.org/.edu/.gov)   4)  Exhaus)ve  (end  of  life, closure crawls, natl domains)   5)  Frequency-­‐Based    •  Key  Inputs:  nomina)ons  from  subject  maSer  experts,   prior  crawl  data,  registry  data,  trusted  directories,   wikipedia  
  • 4. Exis)ng  web  archives    •  Internet  Archive  •  Common  Crawl    •  Pandora  Archive  •  Internet  Memory  Founda)on  Archive  •  Other  na)onal  archives  •  Research,  University  Library  archives    
  • 5. Internet Archive’s Web ArchivePositives –  Very broad – 175+ billion web instances –  Historic – started 1996 –  Publicly accessible –  Time-based URL search –  API access –  Not constrained by legislation – covered by fair use and fast take-down response
  • 6. Internet  Archive’s  Web  Archive  Negatives –  Because of size can’t search by keyword –  Because of size, fully automated - QA not possible  
  • 7. Common  Use  Cases  for  IA’s  web   archive  •  Content  discovery  •  Nostalgia  queries  •  Web  site  restora)on  and  file  recovery  •  Domain  name  valua)on  •  Collabora)ve  R&D  •  Prior  art  analysis  and  patent/copyright  infringement   research  •  Legal  cases  •  Topic  analysis,  web  trends  analysis,  popularity   analysis  
  • 8. Common  Crawl  •  Non-­‐profit  founda)on  building  an  open  crawl   of  the  web  to  seed  research  and  innova)on  •  Currently  5  billion  pages  •  Stored  on  Amazon’s  S3    •  Accessible  via  MapReduce  processing  in   Amazon’s  EC2  compute  cloud  •  Wholesale  extrac)on,  transforma)on,  and   analysis  of  web  data  cheap  and  easy  •  commoncrawl.org/data/accessing-­‐the-­‐data/  
  • 9. Common  Crawl  Nega)ves  •  Not  designed  for  human  browsing  but  for   machine  access  •  Objec)ve  is  to  support  large-­‐scale  analysis  and   text  mining/indexing  –  not  long-­‐term   preserva)on  •  Some  costs  are  involved  for  direct  extrac)on   of  data  from  S3  storage  using  Requester-­‐Pays   API    
  • 10. Pandora  Archive  •  Posi)ves   –  Quality  checked   –  Targeted  Australian  content  with  selec)on  policy   –  Historical  –  started  1996   –  Bibliocentric  approach  –we  sites/publica)ons   selected  for  archiving  are  catalogued  (see  Trove)   –  Keyword  search   –  Publicly  accessible   –  You  can  nominate  Australian  web  sites  for   inclusion  -­‐  pandora.nla.gov.au/ registra)on_form.html  
  • 11. Pandora  Archive  •  Nega)ves   –  labour  intensive  so  small   –  significant  content  missed  because  permission  to   copy  refused  •  Situa)on  will  improve  markedly  if  Legal   Deposit  provisions  extended  to  digital   publica)ons  •  Broader  coverage  will  be  achieved  when   infrastructure  is  upgraded  hence  reducing   labour  costs  for  checking/fixing  crawls  
  • 12. Pandora  Archive  Stats  •  Size  –  6.32  TB  •  Number  of  Files    >  140  million  •  Number  of  ‘)tles’  >  30.5K  •  Number  of  )tle  instances  >  73.5K  
  • 13. .au  Domain  Annual  Snapshots  •  Annual  crawls  since  2005  commissioned  from   Internet  Archive  •  Includes  sites  on  servers  located  in  Australia   as  well  as  .au  domain  •  Robots.txt  respected  except  for  inline  images   and  stylesheets  •  No  public  access  –  researcher  access  protocols   are  being  developed  •  Full  text  search  –  tailored  to  archive  search  •  Separate  .gov  crawl  publicly  accessible  soon  
  • 14. Australian  web  domain  crawls  Year   2005   2006   2007   2008   2009   2011  Files   185   596   516   1  billion   765   660   million   million   million   million   million  Hosts   811,523   1,046,038   1,247,614   3,038,658   1,074,645   1,346,549  crawled  Size  (TBs)   6.69   19.04   18.47   34.55   24.29   30.71  
  • 15. Internet  Memory  Founda)on   Archive  •  internetmemory.org/en/  •  no  keyword  search  yet  –  only  URL  •  Number  of  European  partners  
  • 16. Other  Na)onal  Archives  •  List  of  Interna)onal  Internet  Preserva)on   Consor)um  member  archives  –   netpreserve.org/about/archiveList.php  •  Some  are  whole  domain  archives,  some    are   selec)ve  archives,  many  are  both  •  Some  have  public  access,  others  you  will  need   to  nego)ate  access  for  research  •  Most  archives  have  been  collected  using  the   heritrix  open-­‐source  crawler  and  thus  use  the   standard  format  (warc  ISO  format)  
  • 17. Research  Archives  •  California  Digital  Library  •  Harvard  University  Libraries  •  Columbia    University  Libraries  •  University  of  North  Texas  ….  and  many  more    •  WebCITE  -­‐  webcita)on.org  (cita)on  service   archive)  
  • 18. Bringing  Archives  Together  •  Common  standard  and  APIs  •  Memento  project    
  • 19. Create  your  own  Archive  •  Use  a  subscrip)on  service  •  Build  your  own  archive  using  open-­‐source   crawler  heritrix  and  standard  file  format  .warc    •  Use  web  cita)on  services  that  create  archive   copies  as  you  bookmark  pages  
  • 20. Subscrip)on  Services  •  archive-­‐it.org  (service  operated  by  non-­‐profit   Internet  Archive  since  2006)  •  archivethe.net  (service  operated  by  non-­‐profit     Internet  Memory  Founda)on)  •  California  Digital  Library  Web  Archiving   Service  -­‐  cdlib.org/services/uc3/was.html  •  OCLC  Harvester  Service  -­‐  oclc.org/ webharvester/overview/default.htm  
  • 21. Install  web  archiving  system  locally  •  Easy-­‐to-­‐deploy  web  archiving  toolkit  not  yet   available  (that  meets  web  archive  standards)  •  Ins)tu)onal  web  archiving  infrastructure  is   feasible  and  has  been  established  at  a  number   of  universi)es  for  use  by  researchers  –  needs   IT  systems  engineers  to  set  up  though  •  Archives  can  be  deposited  with  the  NLA  for   long-­‐term  preserva)on  
  • 22. Memento:  adding  )me  to  the   web  Protocol  and  browser  add-­‐on  (MementoFox)  •  Aids  discovery,  aggrega)on  of  page  histories    
  • 23. Web Data Mining & Analysis –What is it? Why Do It?Innovation is increasingly driven from Large scale Data Analysis Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…
  • 24. Platform & Toolkit: Overview•  Software –  Apache Hadoop –  Apache Pig •  Data/File format –  WARC –  CDX –  WAT (new!)
  • 25. Apache Hadoop•  HDFS –  Distributed storage –  Durable, default 3x replication –  Scalable: Yahoo! 60+PB HDFS •  MapReduce –  Distributed computation –  You write Java functions –  Hadoop distributes work across cluster –  Tolerates failures
  • 26. File formats and data: WARC
  • 27. File formats and data: CDX•  Index for Wayback Machine: used to browse WARC-based archive •  Space-delimited text file •  Only essential metadata needed by Wayback –  URL –  Content Digest –  Capture Timestamp –  Content-Type –  HTTP response code –  etc.
  • 28. File formats and data: WAT•  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright, privacy •  Work-in-progress: we want your feedback
  • 29. File formats and data: WAT•  WAT is WARC ☺ –  WAT records are WARC metadata records File formats & data: –  WARC-Refers-To header •  CDX: 53 MB identifies original WARC record •  WAT: 443 MB •  WAT payload is JSON •  WARC: 8,651 MB –  Compact –  Hierarchical –  Supported by every programming environ
  • 30. Some  References  •  hSp://en.wikipedia.org/wiki/Web_archiving  •  hSp://netpreserve.org/about/archiveList.php  •  Web  Archives:  The  Future(s)  -­‐   hSp://www.netpreserve.org/publica)ons/ 2011_06_IIPC_WebArchives-­‐TheFutures.pdf  
  • 31. Contacts  •  Webarchive  @  nla.gov.au  •  Secretariat  @  internetmemory.org  •  Queries  about  the  internet  archive  web  archive   hSp://iawebarchiving.wordpress.com/  •  Queries  about  Archive-­‐It  service   hSp://www.archive-­‐it.org/contact-­‐us  •  momodei  @  nla.gov.au  •  gojomo  @  xavvy.com    

×