Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
WebART project
Web Archive RetrievalTools
Jaap Kamps, Richard Rogers, Arjen deVries
Paul Doorenbosch, RenéVoorburg,Victor-...
WebART project
Web Archive RetrievalTools
Jaap Kamps, Richard Rogers, Arjen deVries
Paul Doorenbosch, RenéVoorburg,Victor-...
Contents
•The WebART project & KB Web archive
•Data Diggin’ @ KB
•Analysis
•DiggingTowards the Future
2012-2016
Thaer Samar
PhD/programmer
Hugo Huurdeman
PhD researcher
Anat Ben-David
Postdoc
Arjen deVries Jaap Kamps Richard Rogers
Pa...
WebART Goals
•Evaluating current curation and selection
procedures of Web archives
•Getting insights into current use of W...
What are Web archives for?
Flickr: koninklijkebibliotheek
KB:Web archive since 2007
Statistics:
•4,000+ websites
•17,000+ harvests
•7+TerabyteSelecti...
KB:Web archive since 2007
Statistics:
•4,000+ websites
•17,000+ harvests
•7+TerabyteSelective approach
Original image:A N P
”Wayback Machine” interface
Data Diggin’ @ KB
•DMI Summer School (2012)
• analysis of selection lists KB
•DMI Winter School (2013)
• use of nu.nl dail...
DMI Summer School (2012)Data digging, part 1
Selection lists KBData:
Toolset: Web-based tools
Flickr: Silvertje
DMI Summer School (2012)
• Digital Methods Winter School (Jan. ’13)
• Co-design workshop (“Living Lab”)
• New Media researchers & developers
• first...
• Full-text search:WebARTist (pilot - beta 1)
• Initial dataset (corpus)
• 432 crawls, 16 months (13.64 GB)
KB CommonCrawl...
Full-text search
Full-text search
Full-text search
Full-text search
Word frequency analysis
0
100
200
300
400
500
600
700
800
17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/201...
Co-Word Analysis
1
abcnews.go.com1
brucespringsteen.net
1
theverge.com
1
sportamerika.nl
1
reuters.com
1
ebird.org
1
googleblog.blogspot.co...
Geomapping location Wire service
Temporal Image Analyses
Timeline
DMI “9/11 Day” (2013)Data digging, part III
Full KB ArchiveDatasets:
Toolset: Web-based tools
nu.nl “host+1”
Full-text sea...
Full-text search+
Full-text search+
Full-text search+
•New Media researchers’ interests:
• “derive periodizations of the Web” (Web history)
• “source hierarchy” (dominant sourc...
2009 2010 2011 2012
2009 2010 2011 2012
2009 2010 2011 2012
2009 2010 2011 2012
Analysis (1)
• studying the ‘archive’ vs. the ‘archived content’
• researchers’ (un)familiarity with temporal (archive)
se...
Analysis (2)
•“data is still a crucial factor”
• quantity & quality: inherent incompleteness &
inconsistencies
• not alway...
Digging towards the future
Full KB ArchiveDatasets:
Toolset: “Toolmaker’s tools”
++
A step further...
•Build customizable systems, or,
toolmakers’ tools
•Provide building blocks
A step further...
use “Hadoop” computing power to build custom dataset, perform high-level analysis, etc.
New tools: examples
•select,“clean”, filter & process dataset
•employ complex queries & search strategies
•search, summariz...
Moving beyond mere “search”
Wayback
Machine
Search
engine
“Research” engine
explicit support for
full research task,
inclu...
Summary
•The WebART project
•Data Diggin’ @ KB
•Analysis
•DiggingTowards the Future
Summary
webarchiving.nl
@webart12
WebART project
Web Archive RetrievalTools
Jaap Kamps, Richard Rogers, Arjen deVries
Paul Doorenbosch, RenéVoorburg,Victor-...
WebART - "Data Digging" - eHumanities Group 2013
WebART - "Data Digging" - eHumanities Group 2013
WebART - "Data Digging" - eHumanities Group 2013
WebART - "Data Digging" - eHumanities Group 2013
WebART - "Data Digging" - eHumanities Group 2013
WebART - "Data Digging" - eHumanities Group 2013
WebART - "Data Digging" - eHumanities Group 2013
WebART - "Data Digging" - eHumanities Group 2013
Upcoming SlideShare
Loading in …5
×

WebART - "Data Digging" - eHumanities Group 2013

1,063 views

Published on

Presentation given at eHumanities Group, Meertens Institute, Amsterdam (Sept. 2013)

Published in: Education, Technology, Design
  • Be the first to comment

  • Be the first to like this

WebART - "Data Digging" - eHumanities Group 2013

  1. 1. WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour eHumanities Group,“NewTrends in eHumanities”, Sept. 19 2013, Meertens Institute
  2. 2. WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour Data Diggin’ @ KB eHumanities Group,“NewTrends in eHumanities”, Sept. 19 2013, Meertens Institute
  3. 3. Contents •The WebART project & KB Web archive •Data Diggin’ @ KB •Analysis •DiggingTowards the Future
  4. 4. 2012-2016
  5. 5. Thaer Samar PhD/programmer Hugo Huurdeman PhD researcher Anat Ben-David Postdoc Arjen deVries Jaap Kamps Richard Rogers Paul Doorenbosch Hildelies Balk Victor-JanVos RenéVoorburg
  6. 6. WebART Goals •Evaluating current curation and selection procedures of Web archives •Getting insights into current use of Web archives •Developing new methods and tools for research using Web archives
  7. 7. What are Web archives for?
  8. 8. Flickr: koninklijkebibliotheek KB:Web archive since 2007 Statistics: •4,000+ websites •17,000+ harvests •7+TerabyteSelective approach
  9. 9. KB:Web archive since 2007 Statistics: •4,000+ websites •17,000+ harvests •7+TerabyteSelective approach Original image:A N P
  10. 10. ”Wayback Machine” interface
  11. 11. Data Diggin’ @ KB •DMI Summer School (2012) • analysis of selection lists KB •DMI Winter School (2013) • use of nu.nl daily harvests KB dataset •Workshop: Sept ‘11 Day (2013) • use of full Web archive KB dataset
  12. 12. DMI Summer School (2012)Data digging, part 1 Selection lists KBData: Toolset: Web-based tools Flickr: Silvertje
  13. 13. DMI Summer School (2012)
  14. 14. • Digital Methods Winter School (Jan. ’13) • Co-design workshop (“Living Lab”) • New Media researchers & developers • first use WebARTist Data digging, part II nu.nl daily harvestsData: Toolset: Full-text search Web-based tools
  15. 15. • Full-text search:WebARTist (pilot - beta 1) • Initial dataset (corpus) • 432 crawls, 16 months (13.64 GB) KB CommonCrawl+ nu.nl (Dutch news aggregator) Full-text searchData digging, part II
  16. 16. Full-text search
  17. 17. Full-text search
  18. 18. Full-text search
  19. 19. Full-text search
  20. 20. Word frequency analysis 0 100 200 300 400 500 600 700 800 17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/2012 06/01/2013
  21. 21. Co-Word Analysis
  22. 22. 1 abcnews.go.com1 brucespringsteen.net 1 theverge.com 1 sportamerika.nl 1 reuters.com 1 ebird.org 1 googleblog.blogspot.co.uk 1 presscentre.sony.eu 1 project.wnyc.org 1 bbc.com 1 poynter.org 1 abclocal.go.com 1 en.wikipedia.org 1 nhc.noaa.gov 1 nypost.com 2 earthcam.com 2 maps.google.com 3 hp.com 4 google.org 4 edition.cnn.com Syria Sandy 7 wired.com 7 allthingsd.com 7 abcnews.go.com 7 thesun.co.uk 7 allesoversterrenkunde.nl 8 volkskrant.nl 9 fd.nl 9 nos.nl 9 mobiel.nuvideo.nl 9 guardian.co.uk 10 bit.ly 10 billboard.biz 10 cbsnews.com 11 usmagazine.com 11 variety.com 12 theverge.com 12 people.com 13 Rutte enVerhagen leggen schuld bij PVV 13 telegraaf.nl 14 washingtonpost.com 18 edition.cnn.com 19 bbc.co.uk 20 youtube.com 20 nytimes.com 21 styletoday.nl 21 bloomberg.com 24 thesistools.com 26 hollywoodreporter.com 30 online.wsj.com 30 deadline.com 33 poll.nupubliek.nl 34 spaarrente.nl 39 gamer.nl 48 reuters.com 52 tmz.com 57 open.spotify.com 78 peil.nl 93 gezondheidsnet.nl US Election 4 1 blogs.aljazeera.net 1 youtube.com 1 worldpressphoto.org 1 wikileaks.org 1 washingtonpost.com 1 eubusiness.com 1 vesti.bg 1 trouw.nl 1 #NAME 1 en.wikipedia.org 1 l 1 sana.sy 1 hosted.ap.org 1 shariah4belgium.com 1 nrc.nl 1 guardian.co.uk 1 geopolicity.com 1 nctb.nl 1 rt.com 1 kaspersky.com 2 todayszaman.com 2 volkskrant.nl 2 spaarrente.nl 2 reuters.com 2 peil.nl 2 hrw.org 2 uk.reuters.com 2 cbsnews.com 3 telegraph.co.uk 3 maps.google.nl 4 bbc.co.uk 5 edition.cnn.com 5 aljazeera.com english.alarabiya.net 7 maps.google.com Outlink Analysis
  23. 23. Geomapping location Wire service
  24. 24. Temporal Image Analyses
  25. 25. Timeline
  26. 26. DMI “9/11 Day” (2013)Data digging, part III Full KB ArchiveDatasets: Toolset: Web-based tools nu.nl “host+1” Full-text search+ Geo-index
  27. 27. Full-text search+
  28. 28. Full-text search+
  29. 29. Full-text search+
  30. 30. •New Media researchers’ interests: • “derive periodizations of the Web” (Web history) • “source hierarchy” (dominant sources in archive) • “keyword uptake” (terms over time) • e.g.‘geenstijl language in archive’ • “accidental”/“incidental” archiving • e.g.‘the guilty pleasures of the Web of innocence’ DMI “9/11 Day” (2013)Data digging, part III
  31. 31. 2009 2010 2011 2012
  32. 32. 2009 2010 2011 2012
  33. 33. 2009 2010 2011 2012
  34. 34. 2009 2010 2011 2012
  35. 35. Analysis (1) • studying the ‘archive’ vs. the ‘archived content’ • researchers’ (un)familiarity with temporal (archive) search • “conditioned” to Google-style searching • high demand for export functions and aggregation features
  36. 36. Analysis (2) •“data is still a crucial factor” • quantity & quality: inherent incompleteness & inconsistencies • not always clear what’s in & what’s out • crawl settings (e.g depth), temporal gaps • “researchers always want what isn’t there”
  37. 37. Digging towards the future Full KB ArchiveDatasets: Toolset: “Toolmaker’s tools” ++
  38. 38. A step further... •Build customizable systems, or, toolmakers’ tools •Provide building blocks
  39. 39. A step further... use “Hadoop” computing power to build custom dataset, perform high-level analysis, etc.
  40. 40. New tools: examples •select,“clean”, filter & process dataset •employ complex queries & search strategies •search, summarize, aggre- gate & share
  41. 41. Moving beyond mere “search” Wayback Machine Search engine “Research” engine explicit support for full research task, including analysis and synthesis steps
  42. 42. Summary •The WebART project •Data Diggin’ @ KB •Analysis •DiggingTowards the Future Summary
  43. 43. webarchiving.nl @webart12
  44. 44. WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour eHumanities Group, NewTrends in eHumanities, Sept. 19 2013, Meertens Institute

×