Successfully reported this slideshow.
Your SlideShare is downloading. ×

WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 45 Ad

WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)

Presentation at symposium “Scholarly Access to Web Archives: Progress, Requirements and Challenges”, IIPC, April 25, 2013 (Ljubljana, Slovenia). This presentation discusses the results of the WebART project’s first year, in which different research disciplines joined forces to tackle the issue of scholarly access to Web archives. It introduces WebARTist, a novel Web archive search interface, and discusses the potential of scholarly research using Web archives, as well as current barriers to success, based on the experiences gained during a pilot project.

Presentation at symposium “Scholarly Access to Web Archives: Progress, Requirements and Challenges”, IIPC, April 25, 2013 (Ljubljana, Slovenia). This presentation discusses the results of the WebART project’s first year, in which different research disciplines joined forces to tackle the issue of scholarly access to Web archives. It introduces WebARTist, a novel Web archive search interface, and discusses the potential of scholarly research using Web archives, as well as current barriers to success, based on the experiences gained during a pilot project.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013) (20)

Advertisement

More from TimelessFuture (20)

Recently uploaded (20)

Advertisement

WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)

  1. 1. WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour IIPC symposium “Scholarly Access to Web Archives”, Ljubljana,April 25, 2013
  2. 2. WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour “Facilitating Scholarly Use Of Web Archives” IIPC symposium “Scholarly Access to Web Archives”, Ljubljana,April 25, 2013
  3. 3. What are Web archives for?
  4. 4. 2012-2016
  5. 5. Thaer Samar PhD/programmer Hugo Huurdeman PhD researcher Anat Ben-David Postdoc Arjen deVries Jaap Kamps Richard Rogers Paul Doorenbosch RenéVoorburg Victor-JanVos
  6. 6. WebART Goals •Evaluating current curation and selection procedures of Web archives •Getting insights into current use of Web archives •Developing new methods and tools for research using Web archives
  7. 7. Flickr: koninklijkebibliotheek KB:Web archive since 2007 Statistics: •4,000+ websites •17,000+ harvests •7+TerabyteSelective approach
  8. 8. KB:Web archive since 2007 Statistics: •4,000+ websites •17,000+ harvests •7+TerabyteSelective approach Original image:A N P
  9. 9. ”Wayback Machine” interface
  10. 10. • WebARTist (pilot - beta 1) • Initial dataset (corpus) • 432 crawls, 16 months (13.64 GB) Full-text search engine KB CommonCrawl+ nu.nl (Dutch news aggregator)
  11. 11. WebARTist: Use case • Digital Methods Winter School (Jan. ’13) • Co-design workshop (“Living Lab”) • researchers & developers • first use WebARTist
  12. 12. Word frequency analysis 0 100 200 300 400 500 600 700 800 17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/2012 06/01/2013
  13. 13. Co-Word Analysis
  14. 14. 1 abcnews.go.com1 brucespringsteen.net 1 theverge.com 1 sportamerika.nl 1 reuters.com 1 ebird.org 1 googleblog.blogspot.co.uk 1 presscentre.sony.eu 1 project.wnyc.org 1 bbc.com 1 poynter.org 1 abclocal.go.com 1 en.wikipedia.org 1 nhc.noaa.gov 1 nypost.com 2 earthcam.com 2 maps.google.com 3 hp.com 4 google.org 4 edition.cnn.com Syria Sandy 7 wired.com 7 allthingsd.com 7 abcnews.go.com 7 thesun.co.uk 7 allesoversterrenkunde.nl 8 volkskrant.nl 9 fd.nl 9 nos.nl 9 mobiel.nuvideo.nl 9 guardian.co.uk 10 bit.ly 10 billboard.biz 10 cbsnews.com 11 usmagazine.com 11 variety.com 12 theverge.com 12 people.com 13 Rutte enVerhagen leggen schuld bij PVV 13 telegraaf.nl 14 washingtonpost.com 18 edition.cnn.com 19 bbc.co.uk 20 youtube.com 20 nytimes.com 21 styletoday.nl 21 bloomberg.com 24 thesistools.com 26 hollywoodreporter.com 30 online.wsj.com 30 deadline.com 33 poll.nupubliek.nl 34 spaarrente.nl 39 gamer.nl 48 reuters.com 52 tmz.com 57 open.spotify.com 78 peil.nl 93 gezondheidsnet.nl US Election 4 1 blogs.aljazeera.net 1 youtube.com 1 worldpressphoto.org 1 wikileaks.org 1 washingtonpost.com 1 eubusiness.com 1 vesti.bg 1 trouw.nl 1 #NAME 1 en.wikipedia.org 1 l 1 sana.sy 1 hosted.ap.org 1 shariah4belgium.com 1 nrc.nl 1 guardian.co.uk 1 geopolicity.com 1 nctb.nl 1 rt.com 1 kaspersky.com 2 todayszaman.com 2 volkskrant.nl 2 spaarrente.nl 2 reuters.com 2 peil.nl 2 hrw.org 2 uk.reuters.com 2 cbsnews.com 3 telegraph.co.uk 3 maps.google.nl 4 bbc.co.uk 5 edition.cnn.com 5 aljazeera.com english.alarabiya.net 7 maps.google.com Outlink Analysis
  15. 15. Geomapping location Wire service
  16. 16. Temporal Image Analyses
  17. 17. Timeline
  18. 18. Use case analysis (1) •DMI Winter School •Analysis types performed: • Word frequency count, Outlink frequency count • (Visual) Co-Word analysis • Geomapping • “Temporal Analysis”
  19. 19. Use case analysis (2) Analysis / visualization: DMI Dorling Map Tool, Gephi, Google Fusion tables, Google Refine, TimelineJS Data processing: Excel, Google Spread- sheets
  20. 20. Use case analysis (3) •Basic usage statistics WebARTist 0 7,5 15 22,5 30 Date filter Site filter Collection filter Percentage of queries
  21. 21. Use case conclusions (1) •Data quality and quantity • Limited dataset, but many analysis types possible (daily news crawls) • Not always clear what’s in & what’s out • crawl settings (e.g depth), temporal gaps • Data expansion opportunity: • combining datasets (but ...) • e.g. KB, CommonCrawl & IA Completeness Inconsistencies
  22. 22. Use case conclusions (2) •Search System • Influence of retrieval algorithms & indexing settings • Recall & Precision: precision issues • Feature request: duplicate handling •Interface • How to convey uncertainty? • How to convey advanced technical features? • e.g. advanced query mechanisms
  23. 23. Use case conclusions (3) •Users • High demand for export functions (formats) • (un)familiarity with temporal (archive) search • Trying to utilize “current Web” tools (e.g. link analysis), not applicable to “past Web” • “User search as in (regular) Web search engines” ( see also [Costa & Silva ’11] )
  24. 24. Next steps WebART •New prototype ready (~3TB) • faceted search, thumbnail browsing, site categories & advanced metadata •Formal evaluation of pilot project • Web archive critique • Search system •Research scenarios & use cases
  25. 25. Future WebART search tools
  26. 26. webarchiving.nl @webart12
  27. 27. Summary •Introduction WebART & CATCH •Pilot project • WebARTist • DMI Winter School Use Case • Analysis & Conclusions Use Case •The Future Summary
  28. 28. WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour IIPC symposium “Scholarly Access to Web Archives”, Ljubljana,April 25, 2013

×