Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability of Hong Kong University Research Experiment

Scott Edmunds, GigaScience/HKU
Quantifying how FAIR is Hong Kong: The Hong Kong
Shareability of Hong Kong University Research Experiment

The Hong Kong experience.
Asia’s Academic City?
8 Universities, many ranked top 50 worldwide
100K students (UG/PG/FT/PT)
1 major research funder (UGC/RGC)
UGC Policy: “Realization of
making Hong Kong Asia's
world city is only possible if it
is based upon the platform of
a very strong education and
higher education sector. “
http://www.ugc.edu.hk/eng/ugc/policy/policy.htm

Research Data policies growing globally
http://ec.europa.eu/research/openscience/index.cfm?section=monitor&pg=researchdata#1

http://dx.doi.org/10.17477/jcea.2018.17.2.200
…meanwhile in Hong Kong
“This ambivalence was reflected by the chairman of the Research Grants Council, who
stated in an interview that ‘there is no relationship between world-class research and
release of data’, questioning whether anyone might be interested in the completeness of
data.
The chairman also saw a conflict between competitiveness and openness, arguing that
the reputation of a researcher is built on publications, not on the underlying data. “

If Government doesn’t act,
Universities need to lead way
http://www.rss.hku.hk/integrity/research-data-records-management

First CRIS in HK, built upon Scholars Hub
http://hub.hku.hk/advanced-search?location=crisdataset
(CRIS = current research information system)

First CRIS in HK, built upon ScholarsHub
http://lib.hku.hk/researchdata/rpg.htm
“Beginning with the September 2017 intake, all HKU
research postgraduate (rpg) students have responsibility
for 1) using a data management plan (DMP), where
applicable, to describe the use of data in preparation for,
or in the generation of their theses, and 2) depositing,
where applicable, a dataset in the HKU Scholars Hub.”

Growing # of OA journals addressing this
http://dx.doi.org/10.1371/journal.pmed.1001607

CAN WE QUANTIFY IF THIS IS
WORKING?

http://reproducibility.cs.arizona.edu/
Arizona Repeatability in
Computer Science Experiment
• 2015 study examining extent Computer Systems
researchers share their research artifacts (code)
• NSF policies on sharing code since 2005
• Examined 613 papers from ACM conferences & journals
•
• Attempted to locate source code that backed up results
• If found, tried to build the code.

• Manual curation/look for
code that backed up results
• If missing, emailed authors
• Chased if no reply
• If found, tried to build the
code
• Resolve issues
• Survey results

613 papers
tested
123 successful
Reproductions (20%)

Can we do something similar in HK?
Teaching HKU MLIM students module on data curation and management.

HKU Repeatability in HK
Research Experiment
• HKU policy on data sharing from 2015
• PLOS policy mandating sharing of supporting March 1,
2014
• HKU has published ≈400 PLOS ONE papers 2014-date
• Can we quantify reproducibility in a sample of these?
• Compare with other less stringent journals (e.g. Springer
Nature data policy ranked journals1)
• Can we follow Arizona and harness crowdsourced
(student) power?
1. https://www.springernature.com/gp/authors/research-data-policy/data-policy-types/12327096

Research Experiment
• Easy exercise in literature curation for HKU MLIM
students
• Set as a project for 59 students, 2017-2019
http://hub.hku.hk/simple-
search?query=&location=publication&sort_by=score&order=desc&rpp=25&filter_field_1=journal&filter_type_1=equals
&filter_value_1=plos+one&etal=0&filtername=dateIssued&filterquery=[2014+TO+2019]&filtertype=equals

https://scholarlykitchen.sspnet.org/2018/01/10/future-oa-megajournal/
NPG (Scientific Reports) copies the PLOS One model…
Another question:
Rise (and fall) of megajournals

Research Experiment
https://scholarlykitchen.sspnet.org/2016/01/06/plos-one-shrinks-by-11-percent/
Rise (and fall) of megajournals
Driven by impact factor or “easier” data policies?
“ Because data requirements are not uniform
across all journals, PLOS has put itself at a
disadvantage as far as attracting authors because
other journals offer an easier path. If strictly
enforced, this new policy is likely to result in a
drop in submissions to PLOS journals. While no
other mega-journal has been able to shake PLOS
ONE’s hold on the market, this policy may provide
an opening for competitors to gain on PLOS ONE
and even overtake it.”
Can we quantify this?

Research Experiment
• Students assigned 2 PLOS + 2 SciRep papers (268 total)
• Quickly scan paper looking for supporting data
• If no data, go to the next paper
• If uses data, is it all associated with the paper?
• If external data, is it available from URL or accession?
• If “data available on request”, are they contactable?
• Spend about up to 10mins per article
• Add data into googledoc, and teacher double checks &
marks students on accuracy
Homework/Case study: literature curation exercise

Research Experiment
Alternative: webscraping option (code in GitHub)…
https://github.com/jessesiu/hku_scholars_hub

Research Experiment
See protocols in protocols.io: http://dx.doi.org/10.17504/protocols.io.6x7hfrn
Teachers protocol: http://dx.doi.org/10.17504/protocols.io.6x8hfrw
Students protocol: http://dx.doi.org/10.17504/protocols.io.6yahfse

Research Experiment
Example
http://hub.hku.hk/handle/10722/223364

Research Experiment
Is there data presented in the paper? – Yes
Is there external data, and if so what is the
link/accession? – No
Is all the data in the paper available? – No
Comments - Has questionnaire, but not data as
says "minimal anonymized dataset will be made
available upon request”
Example

Research Experiment
If data “available on request”, do the authors respond if contacted?
Example

Interesting examples
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0165978
Several examples of missing Infectious Disease data

Interesting examples
Several examples of missing Infectious Disease data
http://www.vox.com/2015/6/17/8796225/mers-virus-data-sharing
http://www.nature.com/news/data-sharing-make-outbreak-research-open-access-1.16966

148
Papers
114 with data 121
Respond 7
Missing 7
27 data on request
Bounce 5 No response 17
121 accessible data
(82%)
data accessibility

120
Papers
79 with data 87
Respond 8
Missing 25
16 data on request
No response 8
57 accessible data
(72.5%)
data accessibility

External Data Sources
• Growing number of papers hosted data via
general-purpose open-access repositories:
– figshare (12), Dryad (5), OSF (4), Zenodo (2), Dataverse
(2), PANGAEA (2), DANS (1)
– Since 2016 figshare use has been dropping &
OSF/Zenodo increasing
– Large numbers of government, IR & institutional
websites
– Other than one broken Dryad link, OA data repositories
much more stable than other URLs (many broken)
https://figshare.com/projects/HKU_Repeatability_in_HK_Research_Experiment/64118

Do not rely on handles
Instability of older HKU Scholars Hub Identifiers & data
• Going back to older (papers collected in early 2017) 3/49 (6%) handles have
changed
• Checking back over time, the number of 2016/2017/2018 PLOS/SR papers
listed keeps increasing (have had to update our results)

Do not rely on “data available from our website”
http://bioinformatics.oxfordjournals.org/content/24/11/1381.long

Do not rely on “data available on request”
https://doi.org/10.1101/633255

Do not rely on “data available from the government”
HK Hospital Authority only shares data with researchers at UGC-funded universities
in Hong Kong, with data access charges on average 35,700 HKD per request1
1. https://www.accessinfo.hk/en/request/request_for_statistics_on_data_c
2. https://www.nature.com/articles/s41598-017-15579-z
“Thanks for your interest. I'm afraid we can't as the data came from our hospital
authority which is highly strict in using of their data and would not allow us to
use the data other the purposed we stated before.”
So why say it was available upon request?
Emailing the authors for the data:

Do not rely on GitHub (or google)
https://dev.to/mjraadi/if-you-don-t-know-now-you-know-github-is-restricting-access-for-users-from-iran-and-a-
few-other-embargoed-countries-5ga9

Lessons Learned: never trust “data on request”
• “Data Available on Request” does not work (65% requests failed after
2 attempts).
• Hong Kong Government (esp. Hospital Authority) data access policies
incompatible with international journal policies
• Email addresses not checked by journals : 5 bounced (one wasn’t
even in correct format). 1 example gave a postal address only.
• Data Access Committee system not working. None of the DACs of the
listed Consortia/Cohort projects responded to emails (Children of
1997, Guangzhou Biobank Cohort Study, JAGES, and China Research
Center on Aging DACs).
• Even if authors respond there are often problems
• t&c’s. e.g.: MTAs or co-authorship, can share a sample of the
processed data not the raw data as they were still writing
publications.
• Data missing, e.g. they deleted the raw sequencing data.

Lessons Learned: problems with Scholars Hub
• Unstable identifiers – 6% (3/49) examples changed in 2
years
• Unstable indexing – numbers of historic publications
keep increasing (self-reporting by authors?)
• Unstable source of datasets: one example of data in a
thesis that was blocked for a period
• Inconsistent indexing/metadata – one example lacked a
link/DOI to the paper, inconsistent keywords & tagging
• Inconsistent authorship – multiple, unused ORCID IDs
registered by HKU

Importance of FAIR snapshots
Why GigaScience set up
http://gigadb.org/

Importance of FAIR snapshots
Why GigaScience set up
https://doi.org/10.1093/database/baz016
Foundational Principles
• Can’t trust “data available on request” – need independent, trusted broker
• Follow FAIR principles (Findability, Accessibility, Interoperability, and
Reusability) for data stewardship & offer unlimited data hosting
• Use globally unique and persistent (stable) identifiers, e.g. DataCite DOIs
• Need to take unlimited sized snapshots of ”version of record” (data, code…)
• Increase Reusability with Interoperable CC licensing (we use CC0)
• Increase Findability & Reusability with rich open metadata (field specific,
DataCite, schema.org) and wide indexing (DataCite, NIH datamed, DCI, etc.)

Thanks to:
Laurie Goodman, Editor in Chief
Nicole Nogoy, Editor
Hans Zauner, Assistant Editor
Hongling Zhao, Assistant Editor
Peter Li, Lead Data Manager
Chris Hunter, Lead BioCurator
Chris Armit, Data Scientist
Mary Ann Tulli, Data Ediitor
Xiao (Jesse) Si Zhe, Database Developer
Chen Qi, Shenzhen Office.
@GigaScience
facebook.com/GigaScience
http://gigasciencejournal.com/blog/
Follow us:
www.gigasciencejournal.com
www.gigadb.org
+
Weibo
& WeChat
+ HKU MLIM students

Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability of Hong Kong University Research Experiment

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability of Hong Kong University Research Experiment

Similar to Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability of Hong Kong University Research Experiment (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability of Hong Kong University Research Experiment