Search challenges for 
collections of book records 
Roberto Cornacchia 
ECIR 2014 – Industry day 
Amsterdam, 16 April 2014 
> design > publish > search!
2 
Outline 
● COMSODE (EU-FP7) 
– Publication platform for Linked Open Data 
● Spinque 
– Search modelling 
● A use-case from Digital Humanities 
– link, clean, search 
● A step further 
– Rank. Everything. Always. 
– Query-time resolution of data conflicts
3 
Unlocking the value of L(O)D... 
In the public sector 
In industry 
Source: Open Data 500 by The GovLab 
...is a hot topic 
In science 
Source: Bradley Allen, SlideShare
4 
The COMSODE project has received funding from the Seventh Framework Programme of the European Union in the grant agreement number 611358. 
COMSODE 
Unlock LOD value 
by improving publication 
www.comsode.eu
5 
Spinque 
● Spin-off of CWI Amsterdam (2009) 
● Develops domain-tailored search technology 
– Applied to: 
● IP, multimedia, cultural heritage, child-friendly, ... 
– Search by Strategy 
● visual modelling of search processes 
– Rank. Everything. Always. 
● integrated support for all-round probabilistic search 
● Work in progress in COMSODE 
– Search Linked Data
6 
A use case in Digital Humanities 
● "Can We Rank Scholarly Book Publishers? 
A Bibliometric Experiment with the Field of History" 
(Zuccala et al., Journal of the American Society for Information Science and Technology, 2014) 
● Goal: indicate publisher prestige quantitatively 
– bibliographic citations to books from journal articles. 
● Dataset: Elsevier Scopus journal citations 
– Granted via the 2012 Elsevier Bibliometrics Research Program 
– 5.6M citations, 3M from journals to books 
– History & literature 
– Periods 1996-2000 and 2007-2011
7 
Elsevier Scopus dataset 
citing_eid,cited_eid,source_title,source_id,article_pubyear,authors,article_title,volume,page_start,doctype 
4702,232311,"American Antiquity",40554,1996,"Graybill D. (6603866252);Michaelsen J. (7003483600);Neff H. (7005907495);Larson D. (7402633779);Ambos E. (14048059100)","Risk, climatic variability, and the study of southwestern prehistory: An evolutionary perspective",61,217,re 
4702,1333725,"American Antiquity",40554,1997,"Raab L. (6601955075);Larson D. (7402633779)","Medieval climatic anomaly and punctuated cultural evolution in coastal Southern California",62,319,ar 
4702,7613691,"American Antiquity",40554,1997,"Colten R. (8363369400);Arnold J. (8754215200);Pletka S. (25221793700)","Contexts of cultural change in insular California",62,300,ar 
4702,30302643,"Quarternary Science Reviews",26239,1996,"Stuiver M. (7007003882);Reimer P. (7103071876);Taylor R. (26030669400)","Development and extension of the calibration of the radiocarbon time scale: Archaeological applications",15,655,ar 
4702,30317536,"Canadian Journal of Earth Sciences",22031,1996,"Dyke A. (7003706220);McNeely R. (7004891098);Hooper J. (7102438470)","Marine reservoir corrections for bowhead whale radiocarbon age determinations",33,1628,ar 
4702,30739323,"Journal of Coastal Research",27374,1997,"Mason O. (7004241927);Hopkins D. (7202255075);Plug L. (7801522080)","Chronology and paleoclimate of storm-induced erosion and episodic dune growth across Cape Espenberg spit, Alaska, U.S.A.",13,770,ar 
7154,2287569,"American Sociological Review",16929,1997,"Goodwin J. (7402339411)","The libidinal constitution of a high-risk social movement: Affectual ties and solidarity in the Huk rebellion, 1946 to 1954",62,53,re 
7154,30495855,"Sociological Theory",18110,1996,"Emirbayer M. (23110549400)","Useful Durkheim",14,109,ar 
9412,9986565,"British Journal for the Philosophy of Science",19977,1997,"Eliasmith C. (6603720957);Thagard P. (6701846211)","Waves, Particles, and Explanatory Coherence",48,1,ar 
9412,30006171,Gastroenterology,28330,1996,"Hamlet A. (6701690210);Dalenb<E4>ck J. (7003418017);F<E4>ndriks L. (7005233384);Olbe L. (7006954993)","A mechanism by which Helicobacter pylori infection of the antrum contributes to the development of duodenal ulcer",110,1386,ar 
"Power and community: 
The archaeology of slavery 
at the hermitage plantation" American Antiquity 
(journal, history) 
Thomas B. 
MISSISSIPPIAN 
POLITICAL 
ECONOMY 
Muller J. 
1998 
1997 
cites 
article 
book 
CSV files 
RDF
8 
Warm-up 
● Load RDF data 
– (subject, predicate, object) 
● Most cited publications 
SELECT ?publication 
count(*) as ?nCitations 
subject predicate object 
publication1 cites publication2 
publication1 cites publication3 
publication3 publisher publisher5 
WHERE {[] scopus:cites ?publication} 
GROUP BY ?publication 
ORDER BY desc(?nCitations) 
● No problem with SPARQL or SQL 
publication nCitations 
publication3 288 
publication5 223 
publication2 124
9 
Warm-up 
● Load RDF data 
– (subject, predicate, object) 
● Most cited publications 
SELECT ?publication 
count(*) as ?nCitations 
subject predicate object 
publication1 cites publication2 
publication1 cites publication3 
publication3 publisher publisher5 
WHERE {[] scopus:cites ?publication} 
GROUP BY ?publication 
ORDER BY desc(?nCitations) 
● No problem with SPARQL or SQL 
publication nCitations 
publication3 288 
publication5 223 
publication2 124 
Predicate 
traversal 
Aggregation
10 
Warm-up .. visually 
"Search by Strategy"
11 
Warm-up .. visually 
"Search by Strategy" 
Elsevier 
data source 
Predicate 
traversal 
Aggregation 
DDeeppllooyy RREESSTT AAPPII 
Data flow 
DDeeppllooyy sseeaarrcchh eennggiinnee
12 
Back to the original goal: 
rank publishers 
Elsevier – Scopus 
(closed data) 
journal articles 
cited books 
“cited” publishers
13 
Back to the original goal: 
rank publishers 
"cited" publishers 
journal articles 
cited books cited books 
aggregated 
"cited" publishers 
sameAs 
Elsevier – Scopus 
(closed data) 
OCLC - WorldCat 
(open data) 
“cited” publishers 
● Open Data Node 
– Links books 
● Search 
– Uses links 
– On-the-fly 
matching? 
DDeeppllooyy sseeaarrcchh eennggiinnee
14 
Surprise.. 
– University Press,Cambridge [England] 
– University Press,Cambridge [etc.] 
– University Press,"Cambridge, Mass.," 
– University Press,"Cambridge, N.E." 
– University Press,"Cambridge, U.K." 
– University Press,"Cambridge, UK" 
– University Press,Cambridge [U.K.] 
– University Press [etc.],Cambridge 
– University Press [etc.],"Cambridge 
[Eng., etc.]" 
– University Press [etc.],Cambridge [etc.] 
– "University press [etc., 
etc.]","Cambridge," 
– University Pressf ats 
collnutz,Cambridge 
– University Press of Cambridge,"Boston, 
Mass." 
– University Press of 
Cambridge,"[Cambridge, Mass.]" 
– Univ. of Cambridge,Cambridge 
– Univ. P.,Cambridge 
– Univ. Pr,Cambridge 
– Univ. Pr.,Cambridge 
– Univ.Pr.,Cambridge 
– Univ. Pr.,Cambridge [u.a.] 
– Univ. Pr.,"Cambridge, U.S.A." 
– Univ. Pr.,Cambridge [usw.] 
2588 variations (just for "Cambridge Universty Press"). 
Probably only 2 or 3 distinct entities in there.
15 
De-duplicate publishers 
"cited" publishers 
Elsevier – Scopus 
(closed data) 
journal articles 
cited books cited books 
aggregated 
"cited" publishers 
sameAs 
OCLC - WorldCat 
(open data)
16 
De-duplicate publishers 
● Open Data Node 
– Links duplicates 
● Search 
– Uses links 
– On-the-fly 
matching? 
"cited" publishers 
Elsevier – Scopus 
(closed data) 
journal articles 
cited books cited books 
aggregated 
"cited" publishers 
sameAs 
OCLC - WorldCat 
(open data) 
DDeeppllooyy sseeaarrcchh eennggiinnee 
sameAs
17 
Is the DH researcher happy? 
● Yes. All very nice... 
– ...but...? 
● Data are not 100% clean yet. 
● Can we rank publishers of books about “women in war”? 
The initial database problem 
needs to deal with uncertainty
18 
Uncertainty from ranking 
"cited" publishers 
journal articles 
cited books 
aggregated 
"cited" publishers 
Elsevier – Scopus 
(closed data) 
OCLC - WorldCat 
(open data) 
subject predicate object 
book1 sameAs book9 
book7 publisher publisher3 
book9 publisher publisher5 
ranked 
cited books 
subject 
book1 
book1 
book2 
about 
"women in war" 
joins aggregations
19 
Uncertainty from ranking 
"cited" publishers 
journal articles 
cited books 
aggregated 
"cited" publishers 
Elsevier – Scopus 
(closed data) 
OCLC - WorldCat 
(open data) 
subject predicate object 
book1 sameAs book9 
book7 publisher publisher3 
book9 publisher publisher5 
ranked 
cited books 
prob 
0.7 
0.5 
0.4 
subject 
book1 
book1 
book2 
about 
"women in war" 
joins aggregations
20 
Uncertainty from ranking 
"cited" publishers 
journal articles 
cited books 
aggregated 
"cited" publishers 
Elsevier – Scopus 
(closed data) 
OCLC - WorldCat 
(open data) 
DDeeppllooyy sseeaarrcchh eennggiinnee 
subject predicate object 
book1 sameAs book9 
book7 publisher publisher3 
book9 publisher publisher5 
ranked 
cited books 
prob 
0.7 
0.5 
0.4 
subject 
book1 
book1 
book2 
about 
"women in war" 
probabilistic 
joins 
probabilistic 
aggregations
21 
More uncertainty from... 
"cited" publishers 
journal articles 
cited books 
aggregated 
"cited" publishers 
Elsevier – Scopus 
(closed data) 
OCLC - WorldCat 
(open data) 
cited books
22 
More uncertainty from... 
"cited" publishers 
journal articles 
cited books 
aggregated 
"cited" publishers 
Elsevier – Scopus 
(closed data) 
OCLC - WorldCat 
(open data) 
cited books 
Ranking
23 
More uncertainty from... 
"cited" publishers 
journal articles 
cited books 
aggregated 
"cited" publishers 
Elsevier – Scopus 
(closed data) 
OCLC - WorldCat 
(open data) 
cited books 
Ranking 
Fuzzy 
matching
24 
More uncertainty from... 
"cited" publishers 
journal articles 
cited books 
aggregated 
"cited" publishers 
Elsevier – Scopus 
(closed data) 
OCLC - WorldCat 
(open data) 
cited books 
Priors in 
data 
Ranking 
Fuzzy 
matching
25 
More uncertainty from... 
"cited" publishers 
journal articles 
cited books 
aggregated 
"cited" publishers 
Elsevier – Scopus 
(closed data) 
OCLC - WorldCat 
(open data) 
cited books 
Priors in 
data 
Ranking 
Fuzzy 
matching 
In fact...
26 
Rank. Everything. Always. 
● Unstructured search: uncertainty is first-class citizen 
● Structured search: let's switch from "facts" to "evidence" 
– Forcing uncertainty to “facts” risks to corrupt data and search results 
● Static data normalisation is good when it comes with high confidence 
● Otherwise, evidence can be used at query-time, depending on the context 
– Strategy blocks contain code for probabilistic DB 
● Based on Probabilistic Relational Algebra 
(Fuhr 1990, Rölleke et al. 2008) 
● Let's just call it "search", finally.
27 
Summary 
● The use case shown 
– benefits from LOD 
● data and results can be expanded / improved 
– benefits from Search by Strategy 
● probabilistic modelling of search scenarios 
● On-going effort in the COMSODE context 
– Open Data Node: good quality LOD 
– Search by Strategy: exploit uncertainty 
● Currently 
● improving RDF support (e.g. vocabularies, inference) 
● Improving query-time resolution of data conflicts
Thank you 
www.spinque.com 
www.comsode.eu 
www.youropendata.eu

Search challenges for collections of book records

  • 1.
    Search challenges for collections of book records Roberto Cornacchia ECIR 2014 – Industry day Amsterdam, 16 April 2014 > design > publish > search!
  • 2.
    2 Outline ●COMSODE (EU-FP7) – Publication platform for Linked Open Data ● Spinque – Search modelling ● A use-case from Digital Humanities – link, clean, search ● A step further – Rank. Everything. Always. – Query-time resolution of data conflicts
  • 3.
    3 Unlocking thevalue of L(O)D... In the public sector In industry Source: Open Data 500 by The GovLab ...is a hot topic In science Source: Bradley Allen, SlideShare
  • 4.
    4 The COMSODEproject has received funding from the Seventh Framework Programme of the European Union in the grant agreement number 611358. COMSODE Unlock LOD value by improving publication www.comsode.eu
  • 5.
    5 Spinque ●Spin-off of CWI Amsterdam (2009) ● Develops domain-tailored search technology – Applied to: ● IP, multimedia, cultural heritage, child-friendly, ... – Search by Strategy ● visual modelling of search processes – Rank. Everything. Always. ● integrated support for all-round probabilistic search ● Work in progress in COMSODE – Search Linked Data
  • 6.
    6 A usecase in Digital Humanities ● "Can We Rank Scholarly Book Publishers? A Bibliometric Experiment with the Field of History" (Zuccala et al., Journal of the American Society for Information Science and Technology, 2014) ● Goal: indicate publisher prestige quantitatively – bibliographic citations to books from journal articles. ● Dataset: Elsevier Scopus journal citations – Granted via the 2012 Elsevier Bibliometrics Research Program – 5.6M citations, 3M from journals to books – History & literature – Periods 1996-2000 and 2007-2011
  • 7.
    7 Elsevier Scopusdataset citing_eid,cited_eid,source_title,source_id,article_pubyear,authors,article_title,volume,page_start,doctype 4702,232311,"American Antiquity",40554,1996,"Graybill D. (6603866252);Michaelsen J. (7003483600);Neff H. (7005907495);Larson D. (7402633779);Ambos E. (14048059100)","Risk, climatic variability, and the study of southwestern prehistory: An evolutionary perspective",61,217,re 4702,1333725,"American Antiquity",40554,1997,"Raab L. (6601955075);Larson D. (7402633779)","Medieval climatic anomaly and punctuated cultural evolution in coastal Southern California",62,319,ar 4702,7613691,"American Antiquity",40554,1997,"Colten R. (8363369400);Arnold J. (8754215200);Pletka S. (25221793700)","Contexts of cultural change in insular California",62,300,ar 4702,30302643,"Quarternary Science Reviews",26239,1996,"Stuiver M. (7007003882);Reimer P. (7103071876);Taylor R. (26030669400)","Development and extension of the calibration of the radiocarbon time scale: Archaeological applications",15,655,ar 4702,30317536,"Canadian Journal of Earth Sciences",22031,1996,"Dyke A. (7003706220);McNeely R. (7004891098);Hooper J. (7102438470)","Marine reservoir corrections for bowhead whale radiocarbon age determinations",33,1628,ar 4702,30739323,"Journal of Coastal Research",27374,1997,"Mason O. (7004241927);Hopkins D. (7202255075);Plug L. (7801522080)","Chronology and paleoclimate of storm-induced erosion and episodic dune growth across Cape Espenberg spit, Alaska, U.S.A.",13,770,ar 7154,2287569,"American Sociological Review",16929,1997,"Goodwin J. (7402339411)","The libidinal constitution of a high-risk social movement: Affectual ties and solidarity in the Huk rebellion, 1946 to 1954",62,53,re 7154,30495855,"Sociological Theory",18110,1996,"Emirbayer M. (23110549400)","Useful Durkheim",14,109,ar 9412,9986565,"British Journal for the Philosophy of Science",19977,1997,"Eliasmith C. (6603720957);Thagard P. (6701846211)","Waves, Particles, and Explanatory Coherence",48,1,ar 9412,30006171,Gastroenterology,28330,1996,"Hamlet A. (6701690210);Dalenb<E4>ck J. (7003418017);F<E4>ndriks L. (7005233384);Olbe L. (7006954993)","A mechanism by which Helicobacter pylori infection of the antrum contributes to the development of duodenal ulcer",110,1386,ar "Power and community: The archaeology of slavery at the hermitage plantation" American Antiquity (journal, history) Thomas B. MISSISSIPPIAN POLITICAL ECONOMY Muller J. 1998 1997 cites article book CSV files RDF
  • 8.
    8 Warm-up ●Load RDF data – (subject, predicate, object) ● Most cited publications SELECT ?publication count(*) as ?nCitations subject predicate object publication1 cites publication2 publication1 cites publication3 publication3 publisher publisher5 WHERE {[] scopus:cites ?publication} GROUP BY ?publication ORDER BY desc(?nCitations) ● No problem with SPARQL or SQL publication nCitations publication3 288 publication5 223 publication2 124
  • 9.
    9 Warm-up ●Load RDF data – (subject, predicate, object) ● Most cited publications SELECT ?publication count(*) as ?nCitations subject predicate object publication1 cites publication2 publication1 cites publication3 publication3 publisher publisher5 WHERE {[] scopus:cites ?publication} GROUP BY ?publication ORDER BY desc(?nCitations) ● No problem with SPARQL or SQL publication nCitations publication3 288 publication5 223 publication2 124 Predicate traversal Aggregation
  • 10.
    10 Warm-up ..visually "Search by Strategy"
  • 11.
    11 Warm-up ..visually "Search by Strategy" Elsevier data source Predicate traversal Aggregation DDeeppllooyy RREESSTT AAPPII Data flow DDeeppllooyy sseeaarrcchh eennggiinnee
  • 12.
    12 Back tothe original goal: rank publishers Elsevier – Scopus (closed data) journal articles cited books “cited” publishers
  • 13.
    13 Back tothe original goal: rank publishers "cited" publishers journal articles cited books cited books aggregated "cited" publishers sameAs Elsevier – Scopus (closed data) OCLC - WorldCat (open data) “cited” publishers ● Open Data Node – Links books ● Search – Uses links – On-the-fly matching? DDeeppllooyy sseeaarrcchh eennggiinnee
  • 14.
    14 Surprise.. –University Press,Cambridge [England] – University Press,Cambridge [etc.] – University Press,"Cambridge, Mass.," – University Press,"Cambridge, N.E." – University Press,"Cambridge, U.K." – University Press,"Cambridge, UK" – University Press,Cambridge [U.K.] – University Press [etc.],Cambridge – University Press [etc.],"Cambridge [Eng., etc.]" – University Press [etc.],Cambridge [etc.] – "University press [etc., etc.]","Cambridge," – University Pressf ats collnutz,Cambridge – University Press of Cambridge,"Boston, Mass." – University Press of Cambridge,"[Cambridge, Mass.]" – Univ. of Cambridge,Cambridge – Univ. P.,Cambridge – Univ. Pr,Cambridge – Univ. Pr.,Cambridge – Univ.Pr.,Cambridge – Univ. Pr.,Cambridge [u.a.] – Univ. Pr.,"Cambridge, U.S.A." – Univ. Pr.,Cambridge [usw.] 2588 variations (just for "Cambridge Universty Press"). Probably only 2 or 3 distinct entities in there.
  • 15.
    15 De-duplicate publishers "cited" publishers Elsevier – Scopus (closed data) journal articles cited books cited books aggregated "cited" publishers sameAs OCLC - WorldCat (open data)
  • 16.
    16 De-duplicate publishers ● Open Data Node – Links duplicates ● Search – Uses links – On-the-fly matching? "cited" publishers Elsevier – Scopus (closed data) journal articles cited books cited books aggregated "cited" publishers sameAs OCLC - WorldCat (open data) DDeeppllooyy sseeaarrcchh eennggiinnee sameAs
  • 17.
    17 Is theDH researcher happy? ● Yes. All very nice... – ...but...? ● Data are not 100% clean yet. ● Can we rank publishers of books about “women in war”? The initial database problem needs to deal with uncertainty
  • 18.
    18 Uncertainty fromranking "cited" publishers journal articles cited books aggregated "cited" publishers Elsevier – Scopus (closed data) OCLC - WorldCat (open data) subject predicate object book1 sameAs book9 book7 publisher publisher3 book9 publisher publisher5 ranked cited books subject book1 book1 book2 about "women in war" joins aggregations
  • 19.
    19 Uncertainty fromranking "cited" publishers journal articles cited books aggregated "cited" publishers Elsevier – Scopus (closed data) OCLC - WorldCat (open data) subject predicate object book1 sameAs book9 book7 publisher publisher3 book9 publisher publisher5 ranked cited books prob 0.7 0.5 0.4 subject book1 book1 book2 about "women in war" joins aggregations
  • 20.
    20 Uncertainty fromranking "cited" publishers journal articles cited books aggregated "cited" publishers Elsevier – Scopus (closed data) OCLC - WorldCat (open data) DDeeppllooyy sseeaarrcchh eennggiinnee subject predicate object book1 sameAs book9 book7 publisher publisher3 book9 publisher publisher5 ranked cited books prob 0.7 0.5 0.4 subject book1 book1 book2 about "women in war" probabilistic joins probabilistic aggregations
  • 21.
    21 More uncertaintyfrom... "cited" publishers journal articles cited books aggregated "cited" publishers Elsevier – Scopus (closed data) OCLC - WorldCat (open data) cited books
  • 22.
    22 More uncertaintyfrom... "cited" publishers journal articles cited books aggregated "cited" publishers Elsevier – Scopus (closed data) OCLC - WorldCat (open data) cited books Ranking
  • 23.
    23 More uncertaintyfrom... "cited" publishers journal articles cited books aggregated "cited" publishers Elsevier – Scopus (closed data) OCLC - WorldCat (open data) cited books Ranking Fuzzy matching
  • 24.
    24 More uncertaintyfrom... "cited" publishers journal articles cited books aggregated "cited" publishers Elsevier – Scopus (closed data) OCLC - WorldCat (open data) cited books Priors in data Ranking Fuzzy matching
  • 25.
    25 More uncertaintyfrom... "cited" publishers journal articles cited books aggregated "cited" publishers Elsevier – Scopus (closed data) OCLC - WorldCat (open data) cited books Priors in data Ranking Fuzzy matching In fact...
  • 26.
    26 Rank. Everything.Always. ● Unstructured search: uncertainty is first-class citizen ● Structured search: let's switch from "facts" to "evidence" – Forcing uncertainty to “facts” risks to corrupt data and search results ● Static data normalisation is good when it comes with high confidence ● Otherwise, evidence can be used at query-time, depending on the context – Strategy blocks contain code for probabilistic DB ● Based on Probabilistic Relational Algebra (Fuhr 1990, Rölleke et al. 2008) ● Let's just call it "search", finally.
  • 27.
    27 Summary ●The use case shown – benefits from LOD ● data and results can be expanded / improved – benefits from Search by Strategy ● probabilistic modelling of search scenarios ● On-going effort in the COMSODE context – Open Data Node: good quality LOD – Search by Strategy: exploit uncertainty ● Currently ● improving RDF support (e.g. vocabularies, inference) ● Improving query-time resolution of data conflicts
  • 28.
    Thank you www.spinque.com www.comsode.eu www.youropendata.eu