Paper presentation at IEEE eScience 2010 conference, December 2010, Brisbane, Australia. Scientific researchers, laboratories and organisations can be profiled and compared by analysing their published works, including documents ranging from academic papers to web sites, blog posts and Twitter feeds. This paper describes how the vector space model from information retrieval, more normally associated with full text search, has been employed in the open source SubSift software to support workflows to profile and compare such collections of documents. SubSift was originally designed to match submitted conference or journal papers to potential peer reviewers based on the similarity between the paper's abstract and the reviewer's publications as found in online bibliographic databases. The software is implemented as a family of RESTful web services that, composed into a re-usable workflow, have already been used to support several major data mining conferences. Alternative workflows and service compositions are now enabling other interesting applications.
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
SubSift web services and workflows for profiling and comparing scientists and their published works
1. SubSift web services and workflows for profiling
and comparing scientists and their published works
Simon Price, Peter Flach, Sebastian Spiegler,
Christopher Bailey and Nikki Rogers
2. 2
Outline of this paper
1. SubSift – submission sifting
2. Background Theory: Vector Space Model
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
4. 4
SubSift
SubSift is a prototype
application to support
academic peer review.
SubSift matches submitted
conference/journal papers to
potential peer reviewers
based on similarity to
published works.
Website:
http://subsift.ilrt.bris.ac.uk
6. 6
Contribution of this work
SubSift RESTful web services:
• Open Source software (on Google Code)
• Hosted open web service at University of Bristol
Re-usable workflows for profiling and comparing scientists
and their published works.
Tool for constructing, manipulating and publishing
document-centric datasets.
7. Related Work
• SubSift uses techniques more normally associated with
Information Retrieval
• Full text search tools support text matching on large-scale
document collections
e.g. Apache Lucene, PostgreSQL, Oracle UltraSearch
Designed for 1:M matching but can also to do Cartesian product M:M matching.
• How SubSift differs:
• Exposes detailed metadata throughout.
• Partly a research tool: need to plug in + instrument new algorithms.
• Fewer licensing restrictions and dependencies for open source.
7
8. 8
2. Background Theory: Vector Space Model
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
9. 9
Vector Space Model (from Information Retrieval)
Vector Space Model consists of:
• bag-of-words representation
• cosine similarity
• tf-idf weighting
For a query (q), rank the documents (dj) in collection (D) by
descending similarity to the query.
10. 10
Vector Space Model: bag-of-words representation
no. terms in each abstract
no. terms in DBLP author page of each PC member
13. 13
Representational State Transfer (REST)
“RESTful” web services:
• URIs to represent resources
• HTTP POST/GET/PUT/DELETE correspond to usual
Create/Read/Update/Delete (CRUD) operations
• Response formats typically include: XML, JSON, CSV
REST is a design pattern for web services based on HTTP using its
familiar URIs, requests, responses, authentication, etc.
14. 14
3. SubSift REST API
1. SubSift – submission sifting
2. Background Theory
3. SubSift REST API
4. Demonstration Workflows
5. Conclusions
35. Future Work
• Scaling-up
• Currently a small-scale web application running on modest hardware.
• Plans to migrate to a larger-scale HPC application at Bristol.
• ExaMiner project
• Mining and mapping the University of Bristol’s research landscape.
• Crawling the University’s web pages to profile and visualise research interests
of and similarities between faculty, departments, research groups and
researchers.
• Plans to apply to websites of other Universities.
35
37. 37
Conclusion
• SubSift Services useful outside of peer review domain
• Workflows for profiling/comparing scientists
Promising e-Science and e-Research use cases for profiling and comparing
scientists and their published works.
• Tool for constructing, manipulating and publishing
document-centric datasets
E.g. information retrieval, data mining, pattern analysis research.
Publication of datasets in this way supports reproducibility of science.
Connects data through Linked Data and the Semantic Web.