Open Access: Trends and opportunities from the publisher's perspective
Collaboration Recommender
1. A Collaboration Recommender
Based on Linked Open Data
Conforming to the VIVO Ontology
Anup Sawant, Hugh J. Devlin, Noshir Contractor (Northwestern)
Brandyn J. Kusenda, David Eichmann (Iowa)
VIVO 2012 Miami, Florida USA
This research was supported by grants from the following grants: National Science Foundation grants
CNS-1010904, OCI-0904356, IIS-0838564, UL1RR024146-06S2 and NIH CTSA awards
UL1RR025741, 5UL1RR025741-04S3
2. SONIC C-IKNOW VIVO Recommender
Outline
• Motivation & Project overview
• MTML collaboration recommendation heuristics
• Report on our practical experience in building
collaboration recommender systems
• Importance of relational data in recommending
collaborations, citation in particular
• Recommendations, Future Work, Questions,
Comments, Suggestions
• Acknowledge Contributors, Collaborators, and Tools
3. Ascendance of Teams
Studies of 19.9 million research articles over 5 decades as recorded in the Web of Science
database, and an additional 2.1 million patent records from 1975-2005 found three important
facts.
1. For virtually all fields, research is increasingly done in teams
2. Teams typically produce more highly cited research than individuals do (accounting for
self-citations), and this team advantage is increasing over time.
3. Teams now produce the exceptionally high impact research, even where that distinction
was once the domain of solo authors.
Sources: Wuchty, Jones, and Uzzi, 2007a, 2007b
4. Ascendance of Virtual Teams
The trend toward virtual communities was not driven by a growth in teamwork by
scientists working with other co-located scientists. Using the Web of Science
database to analyze the collaboration arrangements of over 4,000,000 papers over
a 30 year period, Jones, Wuchty, Uzzi found that:
1. Team science is increasingly composed of co-authors located at different
universities.
2. These “virtual communities of scholars” produce higher impact work than
comparable co-located teams or solo scientists.
3. This change is true for all fields and team sizes, as well as for research done at
elite universities
Source: Jones, Wuchty, Uzzi (2008)
5. Findings for all proposal collaborations
Explaining Proposal Collaboration Relation (p*/ERGM results)
Full model
Effects
(N=2,186)
Control Isolates (single author) 5.447*
Control Edge (proposal collaboration relation) -6.751*
Weighted degree (negative measure
Control 4.623*
of preferential attachment)
H1 Gender (Female) 0.021
H2 Tenure (Years since PhD) 0.002*
H3 Institution Tier (Top 10% universities) -0.098*
H4 H-index -0.014* Researchers are more likely to
have better familiarity of and
H5 Co-authorship 2.431* collaborate again with those they
share a collaboration history (co-
H6 Citation relation 1.132* authorship) or with those they cite
* Indicates p<0.05
Lungeanu, Huang, Contractor (2012) “A network perspective on success in collaboration: Stop
citing me for our own good?”, Academy of Management 5
6. SONIC C-IKNOW VIVO Recommender
Project Goals
• Port the SONIC collaboration recommendation heuristics
to VIVO
• Gain practical experience in building systems that use
– Linked Open Data (LOD)
– SPARQL query language
• Cross-institutional recommending
– Generalize the SONIC collaboration recommendation prototype from a
single institution (Northwestern) to multiple institutions
– Explore use of distributed, federated queries
• Technology adoption study of the utilization and impact
of our social-science grounded recommendation
heuristics
8. Social Drivers:
Why do we create and sustain networks?
• Theories of self- • Theories of contagion
interest • Theories of balance
• Theories of social and • Theories of homophily
resource exchange • Theories of proximity
• Theories of mutual
interest and collective
action
Contractor, N. S., Wasserman, S. & Faust, K. (2006). Testing multi-theoretical multilevel hypotheses
about organizational networks: An analytic framework and empirical example. Academy of
Management Review
9. Multi-theoretical, Multi-level (MTML)
Collaboration Recommendation Heuristics
Heuristic Social theory Relations Metric
Affiliation proximity affiliation neighbor
coauthorship neighbor
Cocitation mutual interest cocitation neighbor
coauthorship neighbor
Most Qualified self-interest citation h-index
authorship
Friend of a friend balance coauthorship distance
count of geodesics
Social Exchange reciprocity citation dyadic in-degree
Follow the crowd contagion coauthorhip + citation centrality
coauthorship distance
Birds of a feather homophily (attributes) count
Mobilizing collective action coauthorhip + citation shortest path
betweenness
Feeling lucky probablistic model coauthorship p*/ERGM
citation
Monge, P. R. and N. S. Contractor (2003)
Theories of communication networks NY:
Oxford University Press
10. Affiliation Heuristic
• The ‘Affiliation’ score is proportional to the
number of experts present in same
department as the seeker but haven’t done
any collaboration in the past with the seeker
• A form of proximity theory – social relations
are (at least in part) opportunistic
“We both work in the same department so we might want to
collaborate in future.”
Example - Works in the same department (Entomology and
Nematology) but never coauthored.
12. Co-Citation Heuristic
• The Co-Citation score is proportional to the
number of times the seeker is co-cited with an
identified expert
• A Cognitive metric: 3rd party rating of similarity
• Mutual interest theory
– Sherif, M. (1958) "Superordinate Goals in the Reduction of Intergroup
Conflict."
“I have been co-cited with a qualified person quite a few times so
I might want to collaborate with him in future.”
Example – Co-cited with you 3 times.
(specifically disallows previous co-authors)
14. H-index
• A scientist has index ‘h’ if ‘h’ of his/her N papers referencing the query term have
at least h citations each, and the other (N-h) papers have no more than h citations
each.
(image source: Wikipedia)
Hirsch, J. E. (2005) An index to quantify an individual's scientific research output
15. Qualified H-index
• A scientist has a “qualified h index,” that is, an h-index qualified by a given
concept, based on the number of their publications which are associated
with that concept as a keyword
16. Most Qualified Heuristic
• The ‘Most Qualified’ score is proportional to
the expert’s “Qualified h-index”
• Self interest theory
– Simon, Herbert (1957). "A Behavioral Model of Rational Choice“
– MacDonald, C. and Ounis, I. (2006) “Voting for candidates: adapting data fusion
techniques for an expert search task ”
“I like to work with someone who is most useful to me and seems to have a lot of
expertise to offer.”
Example – 2 of all of this expert’s articles including the query term have been cited
at least 2 number of times.
17. VIVO Ontology
Representation of Concepts
• Research Areas (associated with researchers)
• Subject Areas (associated with articles)
• Free Text Keywords (associated with articles)
18. Friend of a Friend Heuristic
• The ‘Friend of a Friend’ score is proportional to the
number of distinct paths through which the expert is
indirectly connected to the seeker, and favors experts
close to the seeker in the collaboration network.
• Balance theory, AKA “closing open triangles”
– Monge, P. R. and N. S. Contractor (2003). Theories of communication
networks.
“I like to work with someone I have not previously worked with. If I give
our mutual friend as a reference, they’re more likely to accept.”
Example - Connected indirectly through Hoy,Marjorie Ann via Co-
authorship network.
19. Friend of a Friend …
• Network: (global) Collaboration
– (scalar) Expert attributes
• Path length: distance d from seeker u to expert e
• Number of geodesics n from seeker u to expert e
nsp (u, e)
fobj (u, e) 2
d (u, e)
(specifically disallows previous co-authors)
20. Social Exchange Heuristic
• The ‘Social Exchange’ score is proportional to
the number of articles c authored by the
expert e which cite the seeker u
• Reciprocity theory
– Blau, P. M. (2006) Exchange and power in social life.
fobj (u, e) c(u, e)
“I’ve helped them in the past, so they’re more likely to help me
now.”
Example – Cited your work in 3 articles.
21. Follow the Crowd Heurustic
• The ‘Follow the Crowd’ score is proportional to
the expert’s overall popularity in terms of
collaboration and being cited, and favors experts
close to the seeker in the collaboration network.
• Contagion theory
– Krackhardt, D. and Brass, D. J. (1994) Intraorganizational networks: the micro
side.
– Krackhardt, D. M. (1986) Cognitive social structures.
“They seem to be the most qualified person since many others are
working with them.”
Example - Co-authored or cited by 5 people and is within 3 step(s)
from you via Co-authorship and Citation Network.
22. Follow the Crowd …
deg in (e)
fobj (u, e)
d (u, e)
• inDeg: Expert’s in-degree in the combined
network (Collaboration + citation)
• d: distance from seeker u to expert e in the
collaboration network if connected, max(d)
otherwise
23. Birds of a Feather Heuristic
• The ‘Birds of a Feather’ score is proportional to the (weighted w)
number of attributes a shared between the seeker u and the
expert e, such as moniker (title), department, grad school and
major field of study
• Homophily theory
– Foucault Welles, B., A. Van Devender, et al. (2010) Is a “Friend” a Friend? Investigating the Structure
of Friendship Networks in Virtual Worlds
• No network measures
fobj (u , e) wk ak (u , e)
k
“I find it easier to communicate with someone who has things in common
with me.”
Example - Shares one or more of the following attributes : moniker, work
department, grad school and major field of study.
24. Mobilizing Heuristic
• The ‘Mobilizing’ score favors experts who are brokers and close to
the seeker in the union of the collaboration and citation networks.
• Theory of Collection Action
– Coleman, J. S. (1966) "Individual interests and collective action.“
– Laumann, E. O. and F. U. Pappi (1976) Networks of collective action
“He seems to be connected to lots of qualified experts and can help me make
more useful connections.”
Example – Qualified expert who is a broker among other experts.
inDeg(e) bet (e)
fobj (u, e)
outDeg(e) d (u, e)
– fobj(u,e) : Objective function of user u and expert e
– inDeg(e) : in-degree of expert in union of the Collaboration and Citation networks.
– outDeg(e) : out-degree of expert in union of the Collaboration and Citation networks.
– d(u,e) : seeker to expert distance in union of the Collaboration and Citation networks.
– bet(e) expert’s betweenness centrality in union of the Collaboration and Citation networks, see
Wasserman, S. and K. Faust (1995) Social Network Analysis: Methods and Applications
25. Feeling Lucky Heuristic
• The ‘Feeling Lucky’ is an estimate of the probability of
collaboration using a p*/Exponential Random Graph
Model (ERGM) model of scientific team formation.
• A Probabilistic Model of relationship formation
– Wasserman, S. and G. Robins (2003) An introduction to random graphs,
dependence graphs, and p*
• Factors effecting probability
– In-Degree Centrality of expert in the union of Collaboration and Citation networks
– Publication count of expert
– Similarity (~ “birds of a feather”)
– Moniker
– Work department
– Grad school
– Major Field of Study
– Number of times collaborated with seeker
– Number of times cited seeker
26. Findings for all proposal collaborations
Explaining Proposal Collaboration Relation (p*/ERGM results)
Full model
Effects
(N=2,186)
Control Isolates (single author) 5.447*
Control Edge (proposal collaboration relation) -6.751*
Weighted degree (negative measure
Control 4.623*
of preferential attachment)
H1 Gender (Female) 0.021
H2 Tenure (Years since PhD) 0.002*
H3 Institution Tier (Top 10% universities) -0.098*
H4 H-index -0.014* Researchers are more likely to
have better familiarity of and
H5 Co-authorship 2.431* collaborate again with those they
share a collaboration history (co-
H6 Citation relation 1.132* authorship) or with those they cite
* Indicates p<0.05
Lungeanu, Huang, Contractor (2012) “A network perspective on success in collaboration: Stop
citing me for our own good?”, Academy of Management 26
27. Scientometric Relations
Bibliometric Relations
• Authorship relations (author-article)
– Primary evidence of historical collaboration
behavior
• Citation relations (article-article)
– An important leading indicator of future
collaboration behavior
28. Bibliometric Relations
Descriptions
Directed/
Domain-Range Relation Magnitude
Undirected
author-article authorship directed N
author-author co-authorship undirected Y
article-article citation directed N
author-author citation directed Y
article-article co-citation undirected Y
author-author co-citation undirected Y
29. Citation-related Relations
Dependencies
Article-Article
Citation
Article-Article Author-Author
Co-Citation Citation
Author-Author
Co-Citation
Garfield, Eugene (1955) "Citation indexes for science"
M. M. Kessler (1963) "Bibliographic coupling between scientific papers"
30. Citation-related Relations
Four Useful Primitive Operations
• Authorship-related (derived from VIVO)
1. Given an author A, find all articles by A
getArticles(authorURI)
2. Given an article A, find all authors of A
getAuthors(articleURI)
• Citation-related (derived from PubMed)
3. Given an article A, find all articles which cite A
getArticleArticleCitationFrom(articleID)
4. Given an article A, find all articles cited by A
getArticleArticleCitationTo(articleID)
31. Linking Scientometric Data
VIVO Recommender Sources
Data category VIVO PubMed
Researcher ids Very strong Very weak
Article ids Some PubMed Ids Very strong
Citation data little or none Good
International,
Scope University faculty
1809-present
32. Author Representation
VIVO vs. PubMed
Prof. Alan R. Katritzky,
Department of Chemistry,
University of Florida
UF VIVO PubMed
http://vivo.ufl.edu/individual/n3622 AR Katritzky
Alan Roy Katritzky
Alan R Katritzky
A R Katritzky
33. Linking UF VIVO to PubMed
Approach Diagram
• VIVO Author1 Author2 Author3 Author4
Authorship relations
Article1 Article2 Article3 Article4
PubMed ID PubMed ID
• PubMed Article1 Article3
citation
34. Linking UF VIVO to PubMed
Publication coverage
• 8852 publications in UF VIVO
• 8037 distinct PubMed ids
associated with UF VIVO
publications
• ~90% of UF VIVO’s articles key into
PubMed, making article-article
With PubMed ID citation data available using Linked
Open Data
Without PubMed ID
35. Linking UF VIVO to PubMed
Faculty coverage
• 6578 Faculty Members in UF VIVO
• 990 (15%) of UF Faculty Members have
at least one publication in UF VIVO
• 906 UF Faculty Members have at least
one publication in PubMed
• Therefore using our approach
(VIVO+PubMed mash-up) just 14% of
UF Faculty Members have the
With at least one PubMed ID
possibility of having article-article
citation data (and hence author-author
citation data) available
no pubs or no pubs with PubMed ID
36. Cross-Institutional Search
Previous Work (VIVO 2011)
• Direct2Experts
– http://direct2experts.org/
– Distributed query
– Links to a researcher’s home RNS
– Weber GM, Barnett W, Conlon M, Eichmann D, Kibbe W, Falk-Krzesinski
H, Halaas M, Johnson L, Meeks E, Mitchell D, Schleyer T, Stallings
S, Warden M, Kahlon M (2011) Direct2Experts: a pilot national network to
demonstrate interoperability among research-networking platforms
• VIVO Search
– http://beta.vivosearch.org/
– Centralized index of multiple sites
37. SONIC C-IKNOW VIVO Recommender
SPARQL Query Language for RDF
Just Say NO! to Web Crawling
38. SONIC C-IKNOW VIVO
Collaboration Recommender
SONIC C-IKNOW VIVO Web browser
Collaboration (PC, Mac, Smart Phone, tablet)
Recommender client Remote
SONIC
servers
servers Ranked
recommendations VIVO
(Florida)
SONIC C-IKNOW VIVO
p*/ERGM
Collaboration
server VIVO
Recommender server
(Cornell)
SPARQL
R Community (profiles, PubMed
(statnet) User publications,
of interest (Iowa)
profiles citations,
keywords)
Multiple saved
search criteria
39. Lessons learned
• Researcher Networking Systems (RNSs) should
take article-article citation data seriously
• Adding a robust SPARQL endpoint to each
VIVO-compliant RNS facilitates publishing and
sharing linked open data
• Available free and open source software
(FOSS) tools are mature and more than
adequate to begin building interesting
applications on RNSs
40. Lessons learned …
VIVO Ontology
• Embrace the existing support in the already
included bibo ontology for article-article
citation data and populate the data
• Add researcher attributes
– Year of last degree
– Gender
41. Future Work
• Technology adoption study for an online
collaboration recommendation tool for
research scientists
• p*/ERGM probabilistic recommendations
• Improve navigation through the concept space
using an ontology such as MeSH
• Recommend entities
42. SONIC C-IKNOW VIVO Recommender
Demonstration
• http://ciknow1.northwestern.edu/vivorecommender/
• Migrating soon to:
http://ciknow.northwestern.edu/vivorecommender/
• GitHub:
http://github.com/soniclab
http://github.com/soniclab/vivo-recommender
43. SONIC C-IKNOW VIVO Recommender
Open Source Software Stack
• Java – programming language
• Apache Jena
– RDF interface
– ARQ: SPARQL support
• Java Universal Network/Graph Framework (JUNG) –
social network analysis (SNA) algorithms
– Centrality measures
– Degree of nodes
– etc
• JUNIT – unit testing and quality assurance
• Data-Driven Documents (D3) - visualization
44. SONIC C-IKNOW VIVO Recommender
Our Collaborators
• University of Florida
– Mike Conlon
– Nicholas Rejack
– Stephen Williams
• University of Iowa
– David Eichmann
– Brandyn Kusenda
• Cornell University
– Jon Corson-Rikert
– Brian Caruso
– Christopher Manly
– John Fereira
45. SONIC C-IKNOW VIVO Recommender
SONIC Contributors
• Anup Sawant • Hugh Devlin
• Joe Gilborne • Willem Pieterson
• Jinling Li • Noshir Contractor
Editor's Notes
Every study SONIC conducts on collaboration patterns in science reveals that co-authorship data, and then citation data, are important predictors of future collaboration. Prof. Contractor may have more to say about this research on Friday. For now here is one such study, enabled by a unique, brief peak into grant proposal activity offered our lab by the NSF.
Our overarching goal is to conduct a technology adoption study for a online collaboration recommendation tool for research scientists, with an emphasis on testing and validating the efficacy of and user satisfaction with a suite of recommendation heuristics draw from social science research.
This table serves a topic slide for the next section of our presentation. We view collaboration recommendation systems as a critical application area to test our ideas of the factors contributing to high-impact collaborations in science.
FOAF is proportional to the number of distinct shortest paths between the seeker and the expert, so that an expert with many distinct connections is preferred to one with fewer; and more specifically, to the square root of the number of shortest paths, so that for example the 11th shortest path does not contribute as much as the 2nd; and inversely proportional to the square of the distance, so that closer experts are prefered over more distance experts.
the popularity contest
A similarity measure; our birds of a feather heuristic is perhaps our recommending heuristic that would be most recognizable to those familiar with traditional approaches t recommending.
Every study SONIC conducts on collaboration patterns in science reveals that co-authorship data, and then citation data, are important predictors of future collaboration. Prof. Contractor may have more to say about this research on Friday. For now here is one such study, enabled by a unique, brief peak into grant proposal activity offered our lab by the NSF.
We find a number of bibliographic relations to be useful in recommending collaborations. The “Magnitude” column suggest whether a relation may be considered to have a magnitude, for example, the number of papers co-authored by two authors.
Quality co-authorship relations can be derived from accurate and complete author-article relations; similarly, variants of citation and co-citation depend critically on the accuracy and completeness of article-article citation data.
From our experience researcher networking systems (RNSs) need to be designed to include efficient implementations of four primitive data retrieval operations on bibliometric data in order to facilitate collaboration recommendations.
SONIC’s C-IKNOW VIVO Recommender is a so-called “mash-up” of heterogeneous, linked open data (LOD) sources: one or more VIVO instances, supplemented by citation data from PubMed, courtesy of a SPARQL endpoint implemented by our collaborators at the University of Iowa. VIVO’s strength is disambiguation of researchers, while PubMed’s strength is disambiguating publications.
For an example of the different strengths & weaknesses of VIVO and PubMed, consider the representation of Prof. Alan Katritzky of the department of Chemistry at the University of Florida (the most prolific author at UFL). Following LOD best practices, the UFL VIVO team assigned a unique URI to Prof. Katritzky, while in PubMed he is represented by at least 4 different character strings. Similarly, it would be trivial to find, in any given VIVO instance, duplicate publications, perhaps with slight variations in the title. Heuristics for the disambiguation of names is an active area of research. In our work we are more concerned with the recommendation heuristics, so we do not employ any disambiguation heuristics, instead taking an institution’s VIVO as the authority on the their faculty’s publications, and taking PubMed as the authority for article-article citation data.
An overview of our approach to supplementing VIVO authorship data with article-article citation data from PubMed. For example, here Author1 and Author2 cite Author3. Note that not all researchers have articles, not all researchers have articles with PubMed IDs, and not all articles in PubMed have citation links. A consequence of this approach is that the recommender is most useful in the biomedical sciences.
The UF VIVO system administration team has done an excellent job of associating PubMed identifiers to publications in their VIVO instance.Notes: 34,313 authorship instances (author-article relations) average 3.8 authors/article;
On the other hand most UF faculty members have no publications listed in UF’s VIVO. Then, only a the subset of articles in PubMed, specifically those comprising PubMed Central, that is, the full-text articles, have article-article citation data in machine-readable form; the “References” section of an article is encoded by human curators as part of the process of ingestion of an article into PubMed Central.
After being inspired by these two divergent approaches to cross-institutional recommending at VIVO 2011, we hoped to find something of a middle ground.
We embraced SPARQL for our project, targeting specifically the data we need at query time, as opposed to alternative approaches involving web crawling or harvesting RDF from dereferencing URIs.
The easy availability of high-quality,free and open source software tools greatly reduces the cost of getting started building sophisticated applications on top of rich linked open data such as VIVO. We must acknowledge the contributions of all those who worked on these projects without whom our work would not be possible.