When Search becomes Research
and Research becomes Search
SIGIR’13 Workshop on Exploration, Navigation and Retrieval
of Inf...
(Re)search
(Re)searchers
• My current main interest is search related to/
supporting research (amongst a few dozen other t...
Outline
• DATA:The Web and Online Heritage
• Issues:Archival Silence
• USERS: Digital Heritage -- Digital Humanities
• Cha...
Lot’s of CH online
CH is digitized on a massive scale
Europeana: millions of objects from 1000s of providers
The UK Web Archive
8
  Permission-based selective
archiving since 2004
  30% success rate
  131,164 websites, 54,604
in...
What’s the problem?
Not really that much traffic...
Europeana Web Traffic Report – Q4 2012 - 5 -
Month by Month Overview
Visits Unique Visitors Page Views Time on site/visit
...
How often are web archives used?
6
  Archiving institutions’ focus on data collection, not usage
  19 of 29 IIPC members...
Archival Silence
• Many online collections suffer from low
traffic...
• After years of hard work, the data is
there
• But t...
Digital Heritage online are incunabula
Our infrastructure changed in a revolutionary way
Our technology changed in a revolutionary way
How radical did information access methods change?
Think outside the box?
• Are we too “framed” by the type of
systems that had before?
• And by those that emerged on the We...
Wrap Up (1)
• We have made wonderful progress: CH
data is out there in huge volume
• More, better, richer, ... every day
•...
Right, something really
different -- but what?
CH as Web search?
• Should we really try to “copy” the Web?
• Web search optimizes fast, shallow search
• on highly dynami...
Let’s do the obvious :)
• Look seriously at the scholarly use of the
CH information we have accumulated?
• Get in touch wi...
Digital Heritage
Digital Humanities
e-Humanities
The Times They are a-Changin’ ?
Something exciting is
happening!
• Digital Humanities emerging fast in
response to massive volume of data
• Digitization o...
Change in Character
1.0 2.0
Collection-centered User-centered
Supply-driven Demand-driven
Professionals Amateurs
Individua...
Change = Radical!
• Change in research paradigm?
• Traditional humanities based on
interpretative paradigm
• Empirical sci...
(Actual empirical science is
also less rigorous)
DH requires new data-driven research methods
"Google and the politics of tabs" by Govcom.org,
Amsterdam, 2008.
Website historiography
d to the information
lieve that informa-
search falls squarely
human-computer
ome emphasis on
val, rather than vice
f the ...
Wrap Up (II)
• Digital Humanities is emerging fast and
leads to new data driven research methods
• Motivated by hum. resea...
(Re)search?
• Interactively construct complex strategy
• data sources, selections, processing, back-
and-forth, ...
• Expl...
How to get there?
(1) Intensive collaborations with CH institutions
(2) Include researchers: Co-creation, Living Lab, ...
(3) Build not a tool, but the toolmaker’s tools
Team up with Arjen deVries and Spinque :)
Search strategy from building blocks
Strategy BuilderEach block = data or manipulations
Build dedicated search engine “on the fly”
Research methods become search strategies
Store, refine, reuse, share strategies
(Re)search!
Web Archive (New Media scholars)
Thaer Samar
PhD/programmer
Hugo Huurdeman
PhD researcher
Anat Ben-David
Postdoc
Arjen deVries Jaap Kamps Richard Rogers
Pa...
WebART Goals
•Evaluating current curation and selection
procedures of Web archives
•Getting insights into current use of W...
Flickr: koninklijkebibliotheek
KB:Web archive since 2007
Statistics:
•4,000+ websites
•17,000+ harvests
•7+TerabyteSelecti...
KB:Web archive since 2007
Statistics:
•4,000+ websites
•17,000+ harvests
•7+TerabyteSelective approach
”Wayback Machine” interface
• WebARTist (pilot - beta 1)
• Initial dataset (corpus)
• 432 crawls, 16 months (13.64 GB)
Full-text search engine
KB Comm...
WebARTist: Use case
• Digital Methods Winter School (Jan. ’13)
• Co-design workshop (“Living Lab”)
• researchers & develop...
Word frequency analysis
0
100
200
300
400
500
600
700
800
17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/201...
Co-Word Analysis
1
abcnews.go.com1
brucespringsteen.net
1
theverge.com
1
sportamerika.nl
1
reuters.com
1
ebird.org
1
googleblog.blogspot.co...
Geomapping location Wire service
Temporal Image Analyses
Timeline
Pilot Tools: Scalable Full Text Search++
User interface!
Zoekmachine!
Inverted Index!
Hadoop Distributed Filesystem!
Some Lessons (pilot)
• Fun, creative (but hard for control freaks)
• unexpected really new ideas!
• It is really co-design...
Ongoing
• Started to designing the whole task support
• Want folks to stay in the system!
• Connect source data to later “...
Projects with museums, archives, libraries, archaeology
Wrap Up (III)
• How far can we push this to support research in a
generic way?
• Working on many sources, processing compo...
Work on (Re)search?
• (Re)search leads to radically different modes
of information access!
• (NB: Recall the panel!)
• Dig...
Questions?
• We’re hiring!
• 2 PhD (4y), 2 Postdocs (6m/1y).
• WebART: http://webarchiving.nl/
• ExPoSe: http://staff.scie...
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
Upcoming SlideShare
Loading in …5
×

When Search becomes Research and Research becomes Search

675 views

Published on

SIGIR'13 Workshop on Exploration, Navigation and Retrieval of Information in Cultural Heritage (ENRICH).

Published in: Technology, Design
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
675
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
11
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

When Search becomes Research and Research becomes Search

  1. 1. When Search becomes Research and Research becomes Search SIGIR’13 Workshop on Exploration, Navigation and Retrieval of Information in Cultural Heritage (ENRICH) August 1, 2013, Dublin, Ireland Jaap Kamps University of Amsterdam
  2. 2. (Re)search (Re)searchers • My current main interest is search related to/ supporting research (amongst a few dozen other things) • So what’s different if your searchers are researchers, and their search is (part of) their research? • This talk is rather speculative -- no iron-clad formal results -- but I hope to convince you that this is (at least) an interesting use case • And an area with great opportunities to work in...
  3. 3. Outline • DATA:The Web and Online Heritage • Issues:Archival Silence • USERS: Digital Heritage -- Digital Humanities • Challenges: Digital Methods • TOOLS: Supporting Complex Search Tasks • (Re)search: Digital Methods <-> Complex Search
  4. 4. Lot’s of CH online
  5. 5. CH is digitized on a massive scale
  6. 6. Europeana: millions of objects from 1000s of providers
  7. 7. The UK Web Archive 8   Permission-based selective archiving since 2004   30% success rate   131,164 websites, 54,604 instances, ~14TB WARCs   Domain crawl from 12 April 2013 to implement non-print legal deposit   Expected to crawl between 4-5 million UK websites   Access in reading rooms only http://www.webarchive.org.uk Terabytes of Archived Web Data (From: Hockx-Yu,Web Archiving and Scholarly Use of Web Archives, 2013)
  8. 8. What’s the problem?
  9. 9. Not really that much traffic...
  10. 10. Europeana Web Traffic Report – Q4 2012 - 5 - Month by Month Overview Visits Unique Visitors Page Views Time on site/visit (mm:ss) Bounce rate October 2012 534,830 441,096 2,017,751 00:02:17 50.27% November 2012 612,902 505,177 2,299,244 00:02:16 49.79% December 2012 530,747 439,919 2,079,335 00:02:19 48.80% 2. Portal Search 338,574 Visits with Search 36.10% Increase from Q3 2012 52.89% Increase from Q4 2011 Visits with Search is the number of visits during which at least one portal search occurred 743,292 Total Unique Searches 37.82% Increase from Q3 2012 31.67% Increase from Q4 2011 Total Unique Searches is the number of times a search is performed on Europeana (duplicate searches within a single visit are excluded) 3. Object Views, Social Actions & Click-throughs KPI 27: 30,000 object shares in 2012 Jan – Dec 2012 – 9,609 shares (from portal) Let’s say: less traffic than we hoped for...
  11. 11. How often are web archives used? 6   Archiving institutions’ focus on data collection, not usage   19 of 29 IIPC members’ archives (listed on website) have full or partial online access, often permission-based   Large scale national web archives have restricted access – dark archives   eg Danish National Web Archive, over 280TB   online access for researchers with PhD or higher level   20 users since 2005   “Document-centric” access methods   No agreed way of calculating / benchmarking access statistics   Little evidence of scholarly use of web archives, making it difficult to understand requirements (From: Hockx-Yu,Web Archiving and Scholarly Use of Web Archives, 2013)
  12. 12. Archival Silence • Many online collections suffer from low traffic... • After years of hard work, the data is there • But the users aren’t queuing up to come and explore the data • Why is that happening?
  13. 13. Digital Heritage online are incunabula
  14. 14. Our infrastructure changed in a revolutionary way
  15. 15. Our technology changed in a revolutionary way
  16. 16. How radical did information access methods change?
  17. 17. Think outside the box? • Are we too “framed” by the type of systems that had before? • And by those that emerged on the Web? • (cmp. Diane Kelly’s, Contours and Convergence, KSJ lecture at ECIR’13.)
  18. 18. Wrap Up (1) • We have made wonderful progress: CH data is out there in huge volume • More, better, richer, ... every day • Use of the data is often lagging behind • We should learn from “the Web” • But also do really different things! • (This takes time -- at least a generation)
  19. 19. Right, something really different -- but what?
  20. 20. CH as Web search? • Should we really try to “copy” the Web? • Web search optimizes fast, shallow search • on highly dynamic data with massive #s of user signals • Could we be *ahead* of the Web (rather than following them)?
  21. 21. Let’s do the obvious :) • Look seriously at the scholarly use of the CH information we have accumulated? • Get in touch with researchers and find out how they (want to) use the data and why they are *not* using our tools • (In fact, heritage institutions traditionally focused on scholars, emphasis on the general public is quite recent...)
  22. 22. Digital Heritage Digital Humanities e-Humanities
  23. 23. The Times They are a-Changin’ ?
  24. 24. Something exciting is happening! • Digital Humanities emerging fast in response to massive volume of data • Digitization of historic sources • Heritage of the future is digital • User-generated content in new media • In short: for many research questions a lot of relevant data is available!
  25. 25. Change in Character 1.0 2.0 Collection-centered User-centered Supply-driven Demand-driven Professionals Amateurs Individual scholar Team or lab Small scale Large scale Qualitative Quantitative
  26. 26. Change = Radical! • Change in research paradigm? • Traditional humanities based on interpretative paradigm • Empirical sciences based on a truth-finding paradigm • Did the “success criterion” change? • Use tools of the exact science for the benefit of traditional paradigm?
  27. 27. (Actual empirical science is also less rigorous)
  28. 28. DH requires new data-driven research methods
  29. 29. "Google and the politics of tabs" by Govcom.org, Amsterdam, 2008. Website historiography
  30. 30. d to the information lieve that informa- search falls squarely human-computer ome emphasis on val, rather than vice f the thrusts of this attempt to character- users engage in, to ctivities, and to iden- chniques and mea- appropriate insights or and performance. participated in the were chosen on the ef submitted position sented a broad spec- and academia. Partic- France, Canada, U.S. After accep- s were asked to sub- age) position escribed relevant pectives a few weeks hop. These papers defined. After the morning introduc- tory session, we split the workshop into three new working groups, based on the results of that discussion. J. - © Essentially these are complex search strategies!
  31. 31. Wrap Up (II) • Digital Humanities is emerging fast and leads to new data driven research methods • Motivated by hum. research questions • Essentially they are crawling, cleaning, tokenizing, ranking, exploring, visualizing • Basically the stuff *we* are experts in • Can we build tools that support their research task from begin to end?
  32. 32. (Re)search? • Interactively construct complex strategy • data sources, selections, processing, back- and-forth, ... • Explore all results using facets/aspects • explore whole data set -- no 10 links • Store, share, and refine search strategies • “Session” may take minutes, hours, days, ...
  33. 33. How to get there?
  34. 34. (1) Intensive collaborations with CH institutions
  35. 35. (2) Include researchers: Co-creation, Living Lab, ...
  36. 36. (3) Build not a tool, but the toolmaker’s tools
  37. 37. Team up with Arjen deVries and Spinque :)
  38. 38. Search strategy from building blocks
  39. 39. Strategy BuilderEach block = data or manipulations Build dedicated search engine “on the fly”
  40. 40. Research methods become search strategies Store, refine, reuse, share strategies (Re)search!
  41. 41. Web Archive (New Media scholars)
  42. 42. Thaer Samar PhD/programmer Hugo Huurdeman PhD researcher Anat Ben-David Postdoc Arjen deVries Jaap Kamps Richard Rogers Paul Doorenbosch RenéVoorburg Victor-JanVos
  43. 43. WebART Goals •Evaluating current curation and selection procedures of Web archives •Getting insights into current use of Web archives •Developing new methods and tools for research using Web archives
  44. 44. Flickr: koninklijkebibliotheek KB:Web archive since 2007 Statistics: •4,000+ websites •17,000+ harvests •7+TerabyteSelective approach
  45. 45. KB:Web archive since 2007 Statistics: •4,000+ websites •17,000+ harvests •7+TerabyteSelective approach
  46. 46. ”Wayback Machine” interface
  47. 47. • WebARTist (pilot - beta 1) • Initial dataset (corpus) • 432 crawls, 16 months (13.64 GB) Full-text search engine KB CommonCrawl+ nu.nl (Dutch news aggregator)
  48. 48. WebARTist: Use case • Digital Methods Winter School (Jan. ’13) • Co-design workshop (“Living Lab”) • researchers & developers • first use WebARTist
  49. 49. Word frequency analysis 0 100 200 300 400 500 600 700 800 17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/2012 06/01/2013
  50. 50. Co-Word Analysis
  51. 51. 1 abcnews.go.com1 brucespringsteen.net 1 theverge.com 1 sportamerika.nl 1 reuters.com 1 ebird.org 1 googleblog.blogspot.co.uk 1 presscentre.sony.eu 1 project.wnyc.org 1 bbc.com 1 poynter.org 1 abclocal.go.com 1 en.wikipedia.org 1 nhc.noaa.gov 1 nypost.com 2 earthcam.com 2 maps.google.com 3 hp.com 4 google.org 4 edition.cnn.com Syria Sandy 7 wired.com 7 allthingsd.com 7 abcnews.go.com 7 thesun.co.uk 7 allesoversterrenkunde.nl 8 volkskrant.nl 9 fd.nl 9 nos.nl 9 mobiel.nuvideo.nl 9 guardian.co.uk 10 bit.ly 10 billboard.biz 10 cbsnews.com 11 usmagazine.com 11 variety.com 12 theverge.com 12 people.com 13 Rutte enVerhagen leggen schuld bij PVV 13 telegraaf.nl 14 washingtonpost.com 18 edition.cnn.com 19 bbc.co.uk 20 youtube.com 20 nytimes.com 21 styletoday.nl 21 bloomberg.com 24 thesistools.com 26 hollywoodreporter.com 30 online.wsj.com 30 deadline.com 33 poll.nupubliek.nl 34 spaarrente.nl 39 gamer.nl 48 reuters.com 52 tmz.com 57 open.spotify.com 78 peil.nl 93 gezondheidsnet.nl US Election 4 1 blogs.aljazeera.net 1 youtube.com 1 worldpressphoto.org 1 wikileaks.org 1 washingtonpost.com 1 eubusiness.com 1 vesti.bg 1 trouw.nl 1 #NAME 1 en.wikipedia.org 1 l 1 sana.sy 1 hosted.ap.org 1 shariah4belgium.com 1 nrc.nl 1 guardian.co.uk 1 geopolicity.com 1 nctb.nl 1 rt.com 1 kaspersky.com 2 todayszaman.com 2 volkskrant.nl 2 spaarrente.nl 2 reuters.com 2 peil.nl 2 hrw.org 2 uk.reuters.com 2 cbsnews.com 3 telegraph.co.uk 3 maps.google.nl 4 bbc.co.uk 5 edition.cnn.com 5 aljazeera.com english.alarabiya.net 7 maps.google.com Outlink Analysis
  52. 52. Geomapping location Wire service
  53. 53. Temporal Image Analyses
  54. 54. Timeline
  55. 55. Pilot Tools: Scalable Full Text Search++ User interface! Zoekmachine! Inverted Index! Hadoop Distributed Filesystem!
  56. 56. Some Lessons (pilot) • Fun, creative (but hard for control freaks) • unexpected really new ideas! • It is really co-design -- a dialog: • researchers keep talking in “solutions” • unaware of the full potential? • Search engine used to explore • Then want to use their own tools • Emphasis on aggregates, visualizations
  57. 57. Ongoing • Started to designing the whole task support • Want folks to stay in the system! • Connect source data to later “information graphics” • For the research prototype: no polished graphics • Volume/Hadoop slow things down • 1. Port “search by strategy” to Hadoop (slow, asynchronous) • 2.After (complex) selection on Hadoop, instantiate a dedicated environment (fast, interactive, bounded size)
  58. 58. Projects with museums, archives, libraries, archaeology
  59. 59. Wrap Up (III) • How far can we push this to support research in a generic way? • Working on many sources, processing components and way to combine them into search strategies • Working on richer data (also from research use) • Working on scale • Data is still a crucial issue/factor • Researchers always want what isn’t there • Data quality/noise/completeness issues
  60. 60. Work on (Re)search? • (Re)search leads to radically different modes of information access! • (NB: Recall the panel!) • Digital humanities is happening right now • No shortage of data, dedicated users, ... • Still lot’s of low hanging fruit • Great opportunities for young researchers!
  61. 61. Questions? • We’re hiring! • 2 PhD (4y), 2 Postdocs (6m/1y). • WebART: http://webarchiving.nl/ • ExPoSe: http://staff.science.uva.nl/~kamps/ expose/ • Thank you to all collaborators:Arjen deVries, Richard Rogers, Hugo Huurdeman,Thaer Samar, Anat Ben David, Maarten Marx,Wouter Alink, ...

×