SubSift web services and workflows for profiling and comparing scientists and their published works

SubSift web services and workflows for profiling
and comparing scientists and their published works
Simon Price, Peter Flach, Sebastian Spiegler,
Christopher Bailey and Nikki Rogers

2
Outline of this paper
1. SubSift – submission sifting
2. Background Theory: Vector Space Model
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions

3
2. Background Theory
3. SubSift REST API
5. Conclusions

4
SubSift
SubSift is a prototype
application to support
academic peer review.
SubSift matches submitted
conference/journal papers to
potential peer reviewers
based on similarity to
published works.
Website:
http://subsift.ilrt.bris.ac.uk

5
SubSift has been used for...
15

6
Contribution of this work
SubSift RESTful web services:
• Open Source software (on Google Code)
• Hosted open web service at University of Bristol
Re-usable workflows for profiling and comparing scientists
and their published works.
Tool for constructing, manipulating and publishing
document-centric datasets.

Related Work
• SubSift uses techniques more normally associated with
Information Retrieval
• Full text search tools support text matching on large-scale
document collections
e.g. Apache Lucene, PostgreSQL, Oracle UltraSearch
Designed for 1:M matching but can also to do Cartesian product M:M matching.
• How SubSift differs:
• Exposes detailed metadata throughout.
• Partly a research tool: need to plug in + instrument new algorithms.
• Fewer licensing restrictions and dependencies for open source.
7

8
2. Background Theory: Vector Space Model
3. SubSift REST API
5. Conclusions

9
Vector Space Model (from Information Retrieval)
Vector Space Model consists of:
• bag-of-words representation
• cosine similarity
• tf-idf weighting
For a query (q), rank the documents (dj) in collection (D) by
descending similarity to the query.

10
Vector Space Model: bag-of-words representation
no. terms in each abstract
no. terms in DBLP author page of each PC member

11
Vector Space Model: cosine similarity

12
Vector Space Model: tf-idf weighting

13
Representational State Transfer (REST)
“RESTful” web services:
• URIs to represent resources
• HTTP POST/GET/PUT/DELETE correspond to usual
Create/Read/Update/Delete (CRUD) operations
• Response formats typically include: XML, JSON, CSV
REST is a design pattern for web services based on HTTP using its
familiar URIs, requests, responses, authentication, etc.

14
3. SubSift REST API
3. SubSift REST API
5. Conclusions

15
SubSift System Archicture
SUBSI FT
REST API
XML CSV TermsJSON YAML RDF
WEB
FILESTORE
SUBSIFT
HARVESTER
XSLT
CLIENT

19
SubSift – canonical workflow

20
3. SubSift REST API
5. Conclusions

21
Workflow 1 – Submission Sifting

Workflow 1 – Web 2.0 Client Implementation
22

Workflow 1 – Papers is just a list of URLs (e.g.
Yahoo! Pipes)
23

24
Workflow 2 – Finding an Expert

26
Workflow 3 –Visualising Similarity

27
Clustering staff based on homepage similarity
Dendrogram produced in Matlab from SubSift generated similarity matrix

28
Precision-recall at different thresholds

29
Similarity networks
Diagram created by Graphvis from SubSift generated dot file

30
Connectivity
Diagram created by Graphvis from SubSift generated dot file

31
Workflow 4 – Profiling Reading Lists

32
Profiling a research group by its publications
Diagram produced in Wordle using SubSift profile data

33
Workflow 5 – Ranking News Stories

Future Work
• Scaling-up
• Currently a small-scale web application running on modest hardware.
• Plans to migrate to a larger-scale HPC application at Bristol.
• ExaMiner project
• Mining and mapping the University of Bristol’s research landscape.
• Crawling the University’s web pages to profile and visualise research interests
of and similarities between faculty, departments, research groups and
researchers.
• Plans to apply to websites of other Universities.
35

36
5. Conclusions
3. SubSift REST API
5. Conclusions

37
Conclusion
• SubSift Services useful outside of peer review domain
• Workflows for profiling/comparing scientists
 Promising e-Science and e-Research use cases for profiling and comparing
scientists and their published works.
• Tool for constructing, manipulating and publishing
document-centric datasets
 E.g. information retrieval, data mining, pattern analysis research.
 Publication of datasets in this way supports reproducibility of science.
 Connects data through Linked Data and the Semantic Web.

SubSift web services and workflows for profiling and comparing scientists and their published works

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to SubSift web services and workflows for profiling and comparing scientists and their published works

Similar to SubSift web services and workflows for profiling and comparing scientists and their published works (20)

Recently uploaded

Recently uploaded (20)

SubSift web services and workflows for profiling and comparing scientists and their published works