Data You May Like: A Recommender System for Research
Data Discovery
1MINERAL RESOURCES, 2INFORMATION MANAGEMENT & TECHNOLOGY
Anusuriya Devaraju1, Rob Davy2 and Dominic Hogan2
IN21D: New Approaches to Data Discovery Across Geoscience Domains I AGU 2016, 13th December 2016.
image: orbital-recruitment.co.uk
Introducing Recommender Systems
We can classify
recommender systems into
two broad groups:
• Content-based filtering
systems examine properties of
the items recommended.
• Collaborative filtering systems
recommend items based on
similarity measures between
users or item co-occurrences.
2 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Introducing Case Study
• Total collection (30.11.2016) : 1853
• (1802 public, 51 private collections)
• Domains
• Agriculture & food
• Astronomy & space science
• Data61
• Energy
• Food & nutrition
• Health & biosecurity
• Land & water
• Manufacturing
• Mineral resources
• Oceans & atmosphere
3 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
CSIRO Data Access Portal (DAP)
Motivations
• Direct search
• data search is limited in terms
of title, keyword and
descriptions.
• Faceted browsing
• exhaustive filters and time
consuming.
4 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
A Recommender System for Research Data
• Search and recommendations are complementary.
• Enhances data visibility, especially for users unfamiliar with the datasets.
5 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
interested [ view, download ]
similar
Dataset A
likely interested
Similar Datasets
Data User
Similar datasets may be
determined based on :
• Explicit information, e.g.,
metadata of datasets
• Implicit information, e.g.,
data consumption details
inferred from logs.
Data Sources
6 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
DAP Web Service: https://ws.data.csiro.au/
DAP Google Analytics Reporting API
and server log files
Explicit Information
Implicit Information
Data Similarity Model
The overall similarity between two datasets (Di, Dj) is :
7 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
• Di, Dj : Datasets
• S : Overall similarity
• ω1… ωn : Feature weights
• Sfn(Di,Dj) : Similarity between Di and Dj datasets based on feature class n
(normalized value between [0, 1])
S(Di, Dj) = ω1Sf1(Di,Dj) + ω2Sf2(Di,Dj) + ω3Sf3(Di,Dj) + … + ωnSfn(Di,Dj)
Best Choice of Weights?
• Survey Period : 08.06.2016 –
23.08.2016
• Respondents : Data owners and
consumers
• Number of respondents : 151
8 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Feature Extraction
9 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Feature Class Feature extraction Similarity Measures
Title TF-IDF score Cosine similarity
Description TF-IDF score Cosine similarity
Keyword TF-IDF score Cosine similarity
Activity TF-IDF score Cosine similarity
Fields of
Research
Presence/absence of research fields Jaccard coefficient
Lead Researcher Presence/absence of lead researchers Jaccard coefficient
Contributor Presence/absence of contributors Jaccard coefficient
Search behaviour Common query term Cosine similarity
Download Datasets downloaded together Cosine similarity
Examples : Infer Related Datasets
10 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
rock
Search results
• Dataset 1
• Dataset 2
• Dataset 3
• Dataset 4
• …..
[search term]
Common Search Term Daily Data Download
Example
11 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
System Architecture
12 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
CSIRO Data Access Portal (DAP)
You may
also like
……
HDF
RecommenderService
SQL
database
Research Data
Recommender Engine
CSIRO-DAP
Web Service
Analytics Reporting API
DAP Server Logs
Examples of Web Service Requests
• Obtaining the similarity result is via GET request:
http://{server-name}/simhdf?collection=DAP&nn=5&uw=0&target=csiro:6110
• Get the features and weights associated with a collection
http://{server-name]/features?collection=DAP
13 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Offline Evaluation
• Respondents: Lead researchers
• Number of datasets evaluated : 52
• Evaluation period : 10.11-30.11.2016
• Binary relevance tests
a. Top-ranked datasets : 51/52 datasets (98%)
are rated as relevant.
– 1 dataset was ‘undecided’
b. Next-ranked datasets (not created by
evaluator) : 46/52 datasets (89%) rated as
relevant
- 6 datasets are rated as ‘less relevant’
14 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Business Unit Number of
Evaluators
Agriculture 11
ICT 11
Energy 3
Land & Water 14
Manufacturing 3
Mineral Resources 10
What’s Next?
• Enhance the model with addition of new features – spatial and temporal
information.
• More evaluation! number of evaluators, compare ranked lists, 10-fold
cross validation.
• Apply the recommender model to infer similar research datasets from
other repositories, e.g., data.gov.au, TERN, etc.
15 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Mineral Resources
Anusuriya Devaraju
Postdoctoral Fellow
e anusuriya.devaraju@csiro.au
IMT
Robert Davy
Research Engineer
e Robert.Davy@csiro.au
IMT
Dominic Hogan
Data Librarian
e dominic.hogan@csiro.au
MINERAL RESOURCES
Acknowledgement:
• CSIRO eResearch Collaboration Project (ERRFP-368).
• CSIRO IMT Data Management Capability
Enhancement Program (DMCEP).

Data You May Like: A Recommender System for Research Data Discovery

  • 1.
    Data You MayLike: A Recommender System for Research Data Discovery 1MINERAL RESOURCES, 2INFORMATION MANAGEMENT & TECHNOLOGY Anusuriya Devaraju1, Rob Davy2 and Dominic Hogan2 IN21D: New Approaches to Data Discovery Across Geoscience Domains I AGU 2016, 13th December 2016. image: orbital-recruitment.co.uk
  • 2.
    Introducing Recommender Systems Wecan classify recommender systems into two broad groups: • Content-based filtering systems examine properties of the items recommended. • Collaborative filtering systems recommend items based on similarity measures between users or item co-occurrences. 2 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
  • 3.
    Introducing Case Study •Total collection (30.11.2016) : 1853 • (1802 public, 51 private collections) • Domains • Agriculture & food • Astronomy & space science • Data61 • Energy • Food & nutrition • Health & biosecurity • Land & water • Manufacturing • Mineral resources • Oceans & atmosphere 3 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al. CSIRO Data Access Portal (DAP)
  • 4.
    Motivations • Direct search •data search is limited in terms of title, keyword and descriptions. • Faceted browsing • exhaustive filters and time consuming. 4 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
  • 5.
    A Recommender Systemfor Research Data • Search and recommendations are complementary. • Enhances data visibility, especially for users unfamiliar with the datasets. 5 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al. interested [ view, download ] similar Dataset A likely interested Similar Datasets Data User Similar datasets may be determined based on : • Explicit information, e.g., metadata of datasets • Implicit information, e.g., data consumption details inferred from logs.
  • 6.
    Data Sources 6 |Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al. DAP Web Service: https://ws.data.csiro.au/ DAP Google Analytics Reporting API and server log files Explicit Information Implicit Information
  • 7.
    Data Similarity Model Theoverall similarity between two datasets (Di, Dj) is : 7 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al. • Di, Dj : Datasets • S : Overall similarity • ω1… ωn : Feature weights • Sfn(Di,Dj) : Similarity between Di and Dj datasets based on feature class n (normalized value between [0, 1]) S(Di, Dj) = ω1Sf1(Di,Dj) + ω2Sf2(Di,Dj) + ω3Sf3(Di,Dj) + … + ωnSfn(Di,Dj)
  • 8.
    Best Choice ofWeights? • Survey Period : 08.06.2016 – 23.08.2016 • Respondents : Data owners and consumers • Number of respondents : 151 8 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
  • 9.
    Feature Extraction 9 |Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al. Feature Class Feature extraction Similarity Measures Title TF-IDF score Cosine similarity Description TF-IDF score Cosine similarity Keyword TF-IDF score Cosine similarity Activity TF-IDF score Cosine similarity Fields of Research Presence/absence of research fields Jaccard coefficient Lead Researcher Presence/absence of lead researchers Jaccard coefficient Contributor Presence/absence of contributors Jaccard coefficient Search behaviour Common query term Cosine similarity Download Datasets downloaded together Cosine similarity
  • 10.
    Examples : InferRelated Datasets 10 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al. rock Search results • Dataset 1 • Dataset 2 • Dataset 3 • Dataset 4 • ….. [search term] Common Search Term Daily Data Download
  • 11.
    Example 11 | DataYou May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
  • 12.
    System Architecture 12 |Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al. CSIRO Data Access Portal (DAP) You may also like …… HDF RecommenderService SQL database Research Data Recommender Engine CSIRO-DAP Web Service Analytics Reporting API DAP Server Logs
  • 13.
    Examples of WebService Requests • Obtaining the similarity result is via GET request: http://{server-name}/simhdf?collection=DAP&nn=5&uw=0&target=csiro:6110 • Get the features and weights associated with a collection http://{server-name]/features?collection=DAP 13 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
  • 14.
    Offline Evaluation • Respondents:Lead researchers • Number of datasets evaluated : 52 • Evaluation period : 10.11-30.11.2016 • Binary relevance tests a. Top-ranked datasets : 51/52 datasets (98%) are rated as relevant. – 1 dataset was ‘undecided’ b. Next-ranked datasets (not created by evaluator) : 46/52 datasets (89%) rated as relevant - 6 datasets are rated as ‘less relevant’ 14 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al. Business Unit Number of Evaluators Agriculture 11 ICT 11 Energy 3 Land & Water 14 Manufacturing 3 Mineral Resources 10
  • 15.
    What’s Next? • Enhancethe model with addition of new features – spatial and temporal information. • More evaluation! number of evaluators, compare ranked lists, 10-fold cross validation. • Apply the recommender model to infer similar research datasets from other repositories, e.g., data.gov.au, TERN, etc. 15 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
  • 16.
    Mineral Resources Anusuriya Devaraju PostdoctoralFellow e anusuriya.devaraju@csiro.au IMT Robert Davy Research Engineer e Robert.Davy@csiro.au IMT Dominic Hogan Data Librarian e dominic.hogan@csiro.au MINERAL RESOURCES Acknowledgement: • CSIRO eResearch Collaboration Project (ERRFP-368). • CSIRO IMT Data Management Capability Enhancement Program (DMCEP).