When Search becomes Research and Research becomes Search
When Search becomes Research
and Research becomes Search
SIGIR’13 Workshop on Exploration, Navigation and Retrieval
of Information in Cultural Heritage (ENRICH)
August 1, 2013, Dublin, Ireland
University of Amsterdam
• My current main interest is search related to/
supporting research (amongst a few dozen other things)
• So what’s different if your searchers are researchers,
and their search is (part of) their research?
• This talk is rather speculative -- no iron-clad formal
results -- but I hope to convince you that this is (at
least) an interesting use case
• And an area with great opportunities to work in...
• DATA:The Web and Online Heritage
• Issues:Archival Silence
• USERS: Digital Heritage -- Digital Humanities
• Challenges: Digital Methods
• TOOLS: Supporting Complex Search Tasks
• (Re)search: Digital Methods <-> Complex
Europeana: millions of objects from 1000s of providers
The UK Web Archive
archiving since 2004
30% success rate
131,164 websites, 54,604
instances, ~14TB WARCs
Domain crawl from 12 April
2013 to implement non-print
Expected to crawl
between 4-5 million UK
Access in reading rooms
Terabytes of Archived Web Data
(From: Hockx-Yu,Web Archiving and Scholarly Use of Web Archives, 2013)
Europeana Web Traffic Report – Q4 2012 - 5 -
Month by Month Overview
Visits Unique Visitors Page Views Time on site/visit
October 2012 534,830 441,096 2,017,751 00:02:17 50.27%
November 2012 612,902 505,177 2,299,244 00:02:16 49.79%
December 2012 530,747 439,919 2,079,335 00:02:19 48.80%
2. Portal Search
338,574 Visits with Search 36.10% Increase from Q3 2012 52.89% Increase from Q4 2011
Visits with Search is the number of visits during which at least one portal search occurred
743,292 Total Unique Searches 37.82% Increase from Q3 2012 31.67% Increase from Q4 2011
Total Unique Searches is the number of times a search is performed on Europeana (duplicate searches within a single visit are excluded)
3. Object Views, Social Actions & Click-throughs
KPI 27: 30,000 object shares in 2012
Jan – Dec 2012 – 9,609 shares (from portal)
Let’s say: less trafﬁc than we hoped for...
How often are web archives used?
Archiving institutions’ focus on data collection, not usage
19 of 29 IIPC members’ archives (listed on website) have full or partial
online access, often permission-based
Large scale national web archives have restricted access – dark archives
eg Danish National Web Archive, over 280TB
online access for researchers with PhD or higher level
20 users since 2005
“Document-centric” access methods
No agreed way of calculating / benchmarking access statistics
Little evidence of scholarly use of web archives, making it difficult to
(From: Hockx-Yu,Web Archiving and Scholarly Use of Web Archives, 2013)
• Many online collections suffer from low
• After years of hard work, the data is
• But the users aren’t queuing up to come
and explore the data
• Why is that happening?
How radical did information access methods change?
Think outside the box?
• Are we too “framed” by the type of
systems that had before?
• And by those that emerged on the Web?
• (cmp. Diane Kelly’s, Contours and
Convergence, KSJ lecture at ECIR’13.)
Wrap Up (1)
• We have made wonderful progress: CH
data is out there in huge volume
• More, better, richer, ... every day
• Use of the data is often lagging behind
• We should learn from “the Web”
• But also do really different things!
• (This takes time -- at least a generation)
Right, something really
different -- but what?
CH as Web search?
• Should we really try to “copy” the Web?
• Web search optimizes fast, shallow search
• on highly dynamic data with massive #s of
• Could we be *ahead* of the Web (rather
than following them)?
Let’s do the obvious :)
• Look seriously at the scholarly use of the
CH information we have accumulated?
• Get in touch with researchers and ﬁnd out
how they (want to) use the data and why
they are *not* using our tools
• (In fact, heritage institutions traditionally
focused on scholars, emphasis on the
general public is quite recent...)
Something exciting is
• Digital Humanities emerging fast in
response to massive volume of data
• Digitization of historic sources
• Heritage of the future is digital
• User-generated content in new media
• In short: for many research questions a lot
of relevant data is available!
Change in Character
Individual scholar Team or lab
Small scale Large scale
Change = Radical!
• Change in research paradigm?
• Traditional humanities based on
• Empirical sciences based on a truth-ﬁnding
• Did the “success criterion” change?
• Use tools of the exact science for the beneﬁt
of traditional paradigm?
(Actual empirical science is
also less rigorous)
Wrap Up (II)
• Digital Humanities is emerging fast and
leads to new data driven research methods
• Motivated by hum. research questions
• Essentially they are crawling, cleaning,
tokenizing, ranking, exploring, visualizing
• Basically the stuff *we* are experts in
• Can we build tools that support their
research task from begin to end?
• Interactively construct complex strategy
• data sources, selections, processing, back-
• Explore all results using facets/aspects
• explore whole data set -- no 10 links
• Store, share, and reﬁne search strategies
• “Session” may take minutes, hours, days, ...
Arjen deVries Jaap Kamps Richard Rogers
•Evaluating current curation and selection
procedures of Web archives
•Getting insights into current use of Web
•Developing new methods and tools for
research using Web archives
Pilot Tools: Scalable Full Text Search++
Hadoop Distributed Filesystem!
Some Lessons (pilot)
• Fun, creative (but hard for control freaks)
• unexpected really new ideas!
• It is really co-design -- a dialog:
• researchers keep talking in “solutions”
• unaware of the full potential?
• Search engine used to explore
• Then want to use their own tools
• Emphasis on aggregates, visualizations
• Started to designing the whole task support
• Want folks to stay in the system!
• Connect source data to later “information graphics”
• For the research prototype: no polished graphics
• Volume/Hadoop slow things down
• 1. Port “search by strategy” to Hadoop (slow,
• 2.After (complex) selection on Hadoop, instantiate a
dedicated environment (fast, interactive, bounded
Projects with museums, archives, libraries, archaeology
Wrap Up (III)
• How far can we push this to support research in a
• Working on many sources, processing components
and way to combine them into search strategies
• Working on richer data (also from research use)
• Working on scale
• Data is still a crucial issue/factor
• Researchers always want what isn’t there
• Data quality/noise/completeness issues
Work on (Re)search?
• (Re)search leads to radically different modes
of information access!
• (NB: Recall the panel!)
• Digital humanities is happening right now
• No shortage of data, dedicated users, ...
• Still lot’s of low hanging fruit
• Great opportunities for young researchers!
• We’re hiring!
• 2 PhD (4y), 2 Postdocs (6m/1y).
• WebART: http://webarchiving.nl/
• ExPoSe: http://staff.science.uva.nl/~kamps/
• Thank you to all collaborators:Arjen deVries,
Richard Rogers, Hugo Huurdeman,Thaer Samar,
Anat Ben David, Maarten Marx,Wouter Alink, ...