• Save
WebART - "Data Digging" - eHumanities Group 2013
Upcoming SlideShare
Loading in...5
×
 

WebART - "Data Digging" - eHumanities Group 2013

on

  • 587 views

Presentation given at eHumanities Group, Meertens Institute, Amsterdam (Sept. 2013)

Presentation given at eHumanities Group, Meertens Institute, Amsterdam (Sept. 2013)

Statistics

Views

Total Views
587
Views on SlideShare
587
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

WebART - "Data Digging" - eHumanities Group 2013 WebART - "Data Digging" - eHumanities Group 2013 Presentation Transcript

  • WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour eHumanities Group,“NewTrends in eHumanities”, Sept. 19 2013, Meertens Institute
  • WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour Data Diggin’ @ KB eHumanities Group,“NewTrends in eHumanities”, Sept. 19 2013, Meertens Institute
  • Contents •The WebART project & KB Web archive •Data Diggin’ @ KB •Analysis •DiggingTowards the Future
  • 2012-2016
  • Thaer Samar PhD/programmer Hugo Huurdeman PhD researcher Anat Ben-David Postdoc Arjen deVries Jaap Kamps Richard Rogers Paul Doorenbosch Hildelies Balk Victor-JanVos RenéVoorburg
  • WebART Goals •Evaluating current curation and selection procedures of Web archives •Getting insights into current use of Web archives •Developing new methods and tools for research using Web archives
  • What are Web archives for?
  • Flickr: koninklijkebibliotheek KB:Web archive since 2007 Statistics: •4,000+ websites •17,000+ harvests •7+TerabyteSelective approach
  • KB:Web archive since 2007 Statistics: •4,000+ websites •17,000+ harvests •7+TerabyteSelective approach Original image:A N P
  • ”Wayback Machine” interface
  • Data Diggin’ @ KB •DMI Summer School (2012) • analysis of selection lists KB •DMI Winter School (2013) • use of nu.nl daily harvests KB dataset •Workshop: Sept ‘11 Day (2013) • use of full Web archive KB dataset
  • DMI Summer School (2012)Data digging, part 1 Selection lists KBData: Toolset: Web-based tools Flickr: Silvertje
  • DMI Summer School (2012)
  • • Digital Methods Winter School (Jan. ’13) • Co-design workshop (“Living Lab”) • New Media researchers & developers • first use WebARTist Data digging, part II nu.nl daily harvestsData: Toolset: Full-text search Web-based tools
  • • Full-text search:WebARTist (pilot - beta 1) • Initial dataset (corpus) • 432 crawls, 16 months (13.64 GB) KB CommonCrawl+ nu.nl (Dutch news aggregator) Full-text searchData digging, part II
  • Full-text search
  • Full-text search
  • Full-text search
  • Full-text search
  • Word frequency analysis 0 100 200 300 400 500 600 700 800 17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/2012 06/01/2013
  • Co-Word Analysis
  • 1 abcnews.go.com1 brucespringsteen.net 1 theverge.com 1 sportamerika.nl 1 reuters.com 1 ebird.org 1 googleblog.blogspot.co.uk 1 presscentre.sony.eu 1 project.wnyc.org 1 bbc.com 1 poynter.org 1 abclocal.go.com 1 en.wikipedia.org 1 nhc.noaa.gov 1 nypost.com 2 earthcam.com 2 maps.google.com 3 hp.com 4 google.org 4 edition.cnn.com Syria Sandy 7 wired.com 7 allthingsd.com 7 abcnews.go.com 7 thesun.co.uk 7 allesoversterrenkunde.nl 8 volkskrant.nl 9 fd.nl 9 nos.nl 9 mobiel.nuvideo.nl 9 guardian.co.uk 10 bit.ly 10 billboard.biz 10 cbsnews.com 11 usmagazine.com 11 variety.com 12 theverge.com 12 people.com 13 Rutte enVerhagen leggen schuld bij PVV 13 telegraaf.nl 14 washingtonpost.com 18 edition.cnn.com 19 bbc.co.uk 20 youtube.com 20 nytimes.com 21 styletoday.nl 21 bloomberg.com 24 thesistools.com 26 hollywoodreporter.com 30 online.wsj.com 30 deadline.com 33 poll.nupubliek.nl 34 spaarrente.nl 39 gamer.nl 48 reuters.com 52 tmz.com 57 open.spotify.com 78 peil.nl 93 gezondheidsnet.nl US Election 4 1 blogs.aljazeera.net 1 youtube.com 1 worldpressphoto.org 1 wikileaks.org 1 washingtonpost.com 1 eubusiness.com 1 vesti.bg 1 trouw.nl 1 #NAME 1 en.wikipedia.org 1 l 1 sana.sy 1 hosted.ap.org 1 shariah4belgium.com 1 nrc.nl 1 guardian.co.uk 1 geopolicity.com 1 nctb.nl 1 rt.com 1 kaspersky.com 2 todayszaman.com 2 volkskrant.nl 2 spaarrente.nl 2 reuters.com 2 peil.nl 2 hrw.org 2 uk.reuters.com 2 cbsnews.com 3 telegraph.co.uk 3 maps.google.nl 4 bbc.co.uk 5 edition.cnn.com 5 aljazeera.com english.alarabiya.net 7 maps.google.com Outlink Analysis
  • Geomapping location Wire service
  • Temporal Image Analyses
  • Timeline
  • DMI “9/11 Day” (2013)Data digging, part III Full KB ArchiveDatasets: Toolset: Web-based tools nu.nl “host+1” Full-text search+ Geo-index
  • Full-text search+
  • Full-text search+
  • Full-text search+
  • •New Media researchers’ interests: • “derive periodizations of the Web” (Web history) • “source hierarchy” (dominant sources in archive) • “keyword uptake” (terms over time) • e.g.‘geenstijl language in archive’ • “accidental”/“incidental” archiving • e.g.‘the guilty pleasures of the Web of innocence’ DMI “9/11 Day” (2013)Data digging, part III
  • 2009 2010 2011 2012
  • 2009 2010 2011 2012
  • 2009 2010 2011 2012
  • 2009 2010 2011 2012
  • Analysis (1) • studying the ‘archive’ vs. the ‘archived content’ • researchers’ (un)familiarity with temporal (archive) search • “conditioned” to Google-style searching • high demand for export functions and aggregation features
  • Analysis (2) •“data is still a crucial factor” • quantity & quality: inherent incompleteness & inconsistencies • not always clear what’s in & what’s out • crawl settings (e.g depth), temporal gaps • “researchers always want what isn’t there”
  • Digging towards the future Full KB ArchiveDatasets: Toolset: “Toolmaker’s tools” ++
  • A step further... •Build customizable systems, or, toolmakers’ tools •Provide building blocks
  • A step further... use “Hadoop” computing power to build custom dataset, perform high-level analysis, etc.
  • New tools: examples •select,“clean”, filter & process dataset •employ complex queries & search strategies •search, summarize, aggre- gate & share
  • Moving beyond mere “search” Wayback Machine Search engine “Research” engine explicit support for full research task, including analysis and synthesis steps
  • Summary •The WebART project •Data Diggin’ @ KB •Analysis •DiggingTowards the Future Summary
  • webarchiving.nl @webart12
  • WebART project Web Archive RetrievalTools Jaap Kamps, Richard Rogers, Arjen deVries Paul Doorenbosch, RenéVoorburg,Victor-JanVos Anat Ben-David, Hugo Huurdeman,Thaer Sammar Flickr: LucViatour eHumanities Group, NewTrends in eHumanities, Sept. 19 2013, Meertens Institute