Roberto Corniacchia's presentation to the ECIR 2014 Industry Day: http://ecir2014.org/industry-day/
Bibliographic data have always represented an interesting case for Information Retrieval. Books have authors, title, editions, publishers, identification codes and so on; they can cite other publications and be held by a number of libraries. Digital humanities and the cultural heritage domain invest an increasing effort in the preservation, valorisation and exploitation of bibliographic data, with an emphasis on open data. This not only means that larger volumes of data are available, but also that such data sets are more and more linked together, with consequent challenges about their integration. So, even though “books” and their archival records have not changed for decades, the scale of the problem is changing rapidly.
Secondly, the spectrum of information needs to be satisfied is growing larger. The increase in available (open) data demands innovative services to be developed, whether they target researchers, librarians, or end users, and whether the context is an academic, cultural or commercial setting. The associated information retrieval challenge is no longer just about finding a book by its author’s last name. Full-text search combined with a few facets may address more complex needs, but does not help to exploit the linked nature of today’s open data to the maximum opportunity. The key problem is how to use effectively the full amount of linked data that are being made available online, increasing day by day; and turn this rich source of information into novel search scenarios: what are the most prestigious academic publishers, based on scientific citations, online consumer reviews and ratings? How can a search system tailor the quest for a book to the age of the expected
reader?
We discuss how Spinque addresses these challenges of rich interlinked book data, using its core Search by Strategy concept to separate concerns about modelling the various types of data and their interrelations, and customizing the ranking of information objects accordingly. Here, search processes are modelled on top of structured and unstructured data, with an integrated support for probabilistic reasoning in order to deal transparently with both exact and missing / vague information. We discuss this case of book records in the specific context of EU-funded project COMSODE (Components Supporting the Open Data Exploitation). The envisioned Open Data Node platform aims at effective reuse of integrated data sources, with a strong emphasis on data quality.
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Search challenges for collections of book records
1. Search challenges for
collections of book records
Roberto Cornacchia
ECIR 2014 – Industry day
Amsterdam, 16 April 2014
> design > publish > search!
2. 2
Outline
● COMSODE (EU-FP7)
– Publication platform for Linked Open Data
● Spinque
– Search modelling
● A use-case from Digital Humanities
– link, clean, search
● A step further
– Rank. Everything. Always.
– Query-time resolution of data conflicts
3. 3
Unlocking the value of L(O)D...
In the public sector
In industry
Source: Open Data 500 by The GovLab
...is a hot topic
In science
Source: Bradley Allen, SlideShare
4. 4
The COMSODE project has received funding from the Seventh Framework Programme of the European Union in the grant agreement number 611358.
COMSODE
Unlock LOD value
by improving publication
www.comsode.eu
5. 5
Spinque
● Spin-off of CWI Amsterdam (2009)
● Develops domain-tailored search technology
– Applied to:
● IP, multimedia, cultural heritage, child-friendly, ...
– Search by Strategy
● visual modelling of search processes
– Rank. Everything. Always.
● integrated support for all-round probabilistic search
● Work in progress in COMSODE
– Search Linked Data
6. 6
A use case in Digital Humanities
● "Can We Rank Scholarly Book Publishers?
A Bibliometric Experiment with the Field of History"
(Zuccala et al., Journal of the American Society for Information Science and Technology, 2014)
● Goal: indicate publisher prestige quantitatively
– bibliographic citations to books from journal articles.
● Dataset: Elsevier Scopus journal citations
– Granted via the 2012 Elsevier Bibliometrics Research Program
– 5.6M citations, 3M from journals to books
– History & literature
– Periods 1996-2000 and 2007-2011
7. 7
Elsevier Scopus dataset
citing_eid,cited_eid,source_title,source_id,article_pubyear,authors,article_title,volume,page_start,doctype
4702,232311,"American Antiquity",40554,1996,"Graybill D. (6603866252);Michaelsen J. (7003483600);Neff H. (7005907495);Larson D. (7402633779);Ambos E. (14048059100)","Risk, climatic variability, and the study of southwestern prehistory: An evolutionary perspective",61,217,re
4702,1333725,"American Antiquity",40554,1997,"Raab L. (6601955075);Larson D. (7402633779)","Medieval climatic anomaly and punctuated cultural evolution in coastal Southern California",62,319,ar
4702,7613691,"American Antiquity",40554,1997,"Colten R. (8363369400);Arnold J. (8754215200);Pletka S. (25221793700)","Contexts of cultural change in insular California",62,300,ar
4702,30302643,"Quarternary Science Reviews",26239,1996,"Stuiver M. (7007003882);Reimer P. (7103071876);Taylor R. (26030669400)","Development and extension of the calibration of the radiocarbon time scale: Archaeological applications",15,655,ar
4702,30317536,"Canadian Journal of Earth Sciences",22031,1996,"Dyke A. (7003706220);McNeely R. (7004891098);Hooper J. (7102438470)","Marine reservoir corrections for bowhead whale radiocarbon age determinations",33,1628,ar
4702,30739323,"Journal of Coastal Research",27374,1997,"Mason O. (7004241927);Hopkins D. (7202255075);Plug L. (7801522080)","Chronology and paleoclimate of storm-induced erosion and episodic dune growth across Cape Espenberg spit, Alaska, U.S.A.",13,770,ar
7154,2287569,"American Sociological Review",16929,1997,"Goodwin J. (7402339411)","The libidinal constitution of a high-risk social movement: Affectual ties and solidarity in the Huk rebellion, 1946 to 1954",62,53,re
7154,30495855,"Sociological Theory",18110,1996,"Emirbayer M. (23110549400)","Useful Durkheim",14,109,ar
9412,9986565,"British Journal for the Philosophy of Science",19977,1997,"Eliasmith C. (6603720957);Thagard P. (6701846211)","Waves, Particles, and Explanatory Coherence",48,1,ar
9412,30006171,Gastroenterology,28330,1996,"Hamlet A. (6701690210);Dalenb<E4>ck J. (7003418017);F<E4>ndriks L. (7005233384);Olbe L. (7006954993)","A mechanism by which Helicobacter pylori infection of the antrum contributes to the development of duodenal ulcer",110,1386,ar
"Power and community:
The archaeology of slavery
at the hermitage plantation" American Antiquity
(journal, history)
Thomas B.
MISSISSIPPIAN
POLITICAL
ECONOMY
Muller J.
1998
1997
cites
article
book
CSV files
RDF
8. 8
Warm-up
● Load RDF data
– (subject, predicate, object)
● Most cited publications
SELECT ?publication
count(*) as ?nCitations
subject predicate object
publication1 cites publication2
publication1 cites publication3
publication3 publisher publisher5
WHERE {[] scopus:cites ?publication}
GROUP BY ?publication
ORDER BY desc(?nCitations)
● No problem with SPARQL or SQL
publication nCitations
publication3 288
publication5 223
publication2 124
9. 9
Warm-up
● Load RDF data
– (subject, predicate, object)
● Most cited publications
SELECT ?publication
count(*) as ?nCitations
subject predicate object
publication1 cites publication2
publication1 cites publication3
publication3 publisher publisher5
WHERE {[] scopus:cites ?publication}
GROUP BY ?publication
ORDER BY desc(?nCitations)
● No problem with SPARQL or SQL
publication nCitations
publication3 288
publication5 223
publication2 124
Predicate
traversal
Aggregation
17. 17
Is the DH researcher happy?
● Yes. All very nice...
– ...but...?
● Data are not 100% clean yet.
● Can we rank publishers of books about “women in war”?
The initial database problem
needs to deal with uncertainty
25. 25
More uncertainty from...
"cited" publishers
journal articles
cited books
aggregated
"cited" publishers
Elsevier – Scopus
(closed data)
OCLC - WorldCat
(open data)
cited books
Priors in
data
Ranking
Fuzzy
matching
In fact...
26. 26
Rank. Everything. Always.
● Unstructured search: uncertainty is first-class citizen
● Structured search: let's switch from "facts" to "evidence"
– Forcing uncertainty to “facts” risks to corrupt data and search results
● Static data normalisation is good when it comes with high confidence
● Otherwise, evidence can be used at query-time, depending on the context
– Strategy blocks contain code for probabilistic DB
● Based on Probabilistic Relational Algebra
(Fuhr 1990, Rölleke et al. 2008)
● Let's just call it "search", finally.
27. 27
Summary
● The use case shown
– benefits from LOD
● data and results can be expanded / improved
– benefits from Search by Strategy
● probabilistic modelling of search scenarios
● On-going effort in the COMSODE context
– Open Data Node: good quality LOD
– Search by Strategy: exploit uncertainty
● Currently
● improving RDF support (e.g. vocabularies, inference)
● Improving query-time resolution of data conflicts