More Related Content

More from TimelessFuture(20)

WebART - "Data Digging" - eHumanities Group 2013

  1. WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour eHumanities Group,“NewTrends in eHumanities”, Sept. 19 2013, Meertens Institute
  2. WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour Data Diggin’ @ KB eHumanities Group,“NewTrends in eHumanities”, Sept. 19 2013, Meertens Institute
  3. Contents •The WebART project & KB Web archive •Data Diggin’ @ KB •Analysis •DiggingTowards the Future
  4. 2012-2016
  5. Thaer Samar PhD/programmer Hugo Huurdeman PhD researcher Anat Ben-David Postdoc Arjen deVries Jaap Kamps Richard Rogers Paul Doorenbosch Hildelies Balk Victor-JanVos RenéVoorburg
  6. WebART Goals •Evaluating current curation and selection procedures of Web archives •Getting insights into current use of Web archives •Developing new methods and tools for research using Web archives
  7. What are Web archives for?
  8. Flickr: koninklijkebibliotheek KB:Web archive since 2007 Statistics: •4,000+ websites •17,000+ harvests •7+TerabyteSelective approach
  9. KB:Web archive since 2007 Statistics: •4,000+ websites •17,000+ harvests •7+TerabyteSelective approach Original image:A N P
  10. ”Wayback Machine” interface
  11. Data Diggin’ @ KB •DMI Summer School (2012) • analysis of selection lists KB •DMI Winter School (2013) • use of nu.nl daily harvests KB dataset •Workshop: Sept ‘11 Day (2013) • use of full Web archive KB dataset
  12. DMI Summer School (2012)Data digging, part 1 Selection lists KBData: Toolset: Web-based tools Flickr: Silvertje
  13. DMI Summer School (2012)
  14. • Digital Methods Winter School (Jan. ’13) • Co-design workshop (“Living Lab”) • New Media researchers & developers • first use WebARTist Data digging, part II nu.nl daily harvestsData: Toolset: Full-text search Web-based tools
  15. • Full-text search:WebARTist (pilot - beta 1) • Initial dataset (corpus) • 432 crawls, 16 months (13.64 GB) KB CommonCrawl+ nu.nl (Dutch news aggregator) Full-text searchData digging, part II
  16. Full-text search
  17. Full-text search
  18. Full-text search
  19. Full-text search
  20. Word frequency analysis 0 100 200 300 400 500 600 700 800 17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/2012 06/01/2013
  21. Co-Word Analysis
  22. 1 abcnews.go.com1 brucespringsteen.net 1 theverge.com 1 sportamerika.nl 1 reuters.com 1 ebird.org 1 googleblog.blogspot.co.uk 1 presscentre.sony.eu 1 project.wnyc.org 1 bbc.com 1 poynter.org 1 abclocal.go.com 1 en.wikipedia.org 1 nhc.noaa.gov 1 nypost.com 2 earthcam.com 2 maps.google.com 3 hp.com 4 google.org 4 edition.cnn.com Syria Sandy 7 wired.com 7 allthingsd.com 7 abcnews.go.com 7 thesun.co.uk 7 allesoversterrenkunde.nl 8 volkskrant.nl 9 fd.nl 9 nos.nl 9 mobiel.nuvideo.nl 9 guardian.co.uk 10 bit.ly 10 billboard.biz 10 cbsnews.com 11 usmagazine.com 11 variety.com 12 theverge.com 12 people.com 13 Rutte enVerhagen leggen schuld bij PVV 13 telegraaf.nl 14 washingtonpost.com 18 edition.cnn.com 19 bbc.co.uk 20 youtube.com 20 nytimes.com 21 styletoday.nl 21 bloomberg.com 24 thesistools.com 26 hollywoodreporter.com 30 online.wsj.com 30 deadline.com 33 poll.nupubliek.nl 34 spaarrente.nl 39 gamer.nl 48 reuters.com 52 tmz.com 57 open.spotify.com 78 peil.nl 93 gezondheidsnet.nl US Election 4 1 blogs.aljazeera.net 1 youtube.com 1 worldpressphoto.org 1 wikileaks.org 1 washingtonpost.com 1 eubusiness.com 1 vesti.bg 1 trouw.nl 1 #NAME 1 en.wikipedia.org 1 l 1 sana.sy 1 hosted.ap.org 1 shariah4belgium.com 1 nrc.nl 1 guardian.co.uk 1 geopolicity.com 1 nctb.nl 1 rt.com 1 kaspersky.com 2 todayszaman.com 2 volkskrant.nl 2 spaarrente.nl 2 reuters.com 2 peil.nl 2 hrw.org 2 uk.reuters.com 2 cbsnews.com 3 telegraph.co.uk 3 maps.google.nl 4 bbc.co.uk 5 edition.cnn.com 5 aljazeera.com english.alarabiya.net 7 maps.google.com Outlink Analysis
  23. Geomapping location Wire service
  24. Temporal Image Analyses
  25. Timeline
  26. DMI “9/11 Day” (2013)Data digging, part III Full KB ArchiveDatasets: Toolset: Web-based tools nu.nl “host+1” Full-text search+ Geo-index
  27. Full-text search+
  28. Full-text search+
  29. Full-text search+
  30. •New Media researchers’ interests: • “derive periodizations of the Web” (Web history) • “source hierarchy” (dominant sources in archive) • “keyword uptake” (terms over time) • e.g.‘geenstijl language in archive’ • “accidental”/“incidental” archiving • e.g.‘the guilty pleasures of the Web of innocence’ DMI “9/11 Day” (2013)Data digging, part III
  31. 2009 2010 2011 2012
  32. 2009 2010 2011 2012
  33. 2009 2010 2011 2012
  34. 2009 2010 2011 2012
  35. Analysis (1) • studying the ‘archive’ vs. the ‘archived content’ • researchers’ (un)familiarity with temporal (archive) search • “conditioned” to Google-style searching • high demand for export functions and aggregation features
  36. Analysis (2) •“data is still a crucial factor” • quantity & quality: inherent incompleteness & inconsistencies • not always clear what’s in & what’s out • crawl settings (e.g depth), temporal gaps • “researchers always want what isn’t there”
  37. Digging towards the future Full KB ArchiveDatasets: Toolset: “Toolmaker’s tools” ++
  38. A step further... •Build customizable systems, or, toolmakers’ tools •Provide building blocks
  39. A step further... use “Hadoop” computing power to build custom dataset, perform high-level analysis, etc.
  40. New tools: examples •select,“clean”, filter & process dataset •employ complex queries & search strategies •search, summarize, aggre- gate & share
  41. Moving beyond mere “search” Wayback Machine Search engine “Research” engine explicit support for full research task, including analysis and synthesis steps
  42. Summary •The WebART project •Data Diggin’ @ KB •Analysis •DiggingTowards the Future Summary
  43. webarchiving.nl @webart12
  44. WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour eHumanities Group, NewTrends in eHumanities, Sept. 19 2013, Meertens Institute